Bayesian Linear Regression

Turing is powerful when applied to complex hierarchical models, but it can also be put to task at common statistical procedures, like linear regression. This tutorial covers how to implement a linear regression model in Turing.

Set Up

We begin by importing all the necessary libraries.

# Import Turing.
using Turing

# Package for loading the data set.
using RDatasets

# Package for visualization.
using StatsPlots

# Functionality for splitting the data.
using MLUtils: splitobs

# Functionality for constructing arrays with identical elements efficiently.
using FillArrays

# Functionality for normalizing the data and evaluating the model predictions.
using StatsBase

# Functionality for working with scaled identity matrices.
using LinearAlgebra

# Set a seed for reproducibility.
using Random
Random.seed!(0);

setprogress!(false)

We will use the mtcars dataset from the RDatasets package. mtcars contains a variety of statistics on different car models, including their miles per gallon, number of cylinders, and horsepower, among others.

We want to know if we can construct a Bayesian linear regression model to predict the miles per gallon of a car, given the other statistics it has. Let us take a look at the data we have.

# Load the dataset.
data = RDatasets.dataset("datasets", "mtcars")

# Show the first six rows of the dataset.
first(data, 6)

6×12 DataFrame

Row	Model	MPG	Cyl	Disp	HP	DRat	WT	QSec	VS	AM	Gear	Carb
	String31	Float64	Int64	Float64	Int64	Float64	Float64	Float64	Int64	Int64	Int64	Int64
1	Mazda RX4	21.0	6	160.0	110	3.9	2.62	16.46	0	1	4	4
2	Mazda RX4 Wag	21.0	6	160.0	110	3.9	2.875	17.02	0	1	4	4
3	Datsun 710	22.8	4	108.0	93	3.85	2.32	18.61	1	1	4	1
4	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
5	Hornet Sportabout	18.7	8	360.0	175	3.15	3.44	17.02	0	0	3	2
6	Valiant	18.1	6	225.0	105	2.76	3.46	20.22	1	0	3	1

size(data)

(32, 12)

The next step is to get our data ready for testing. We’ll split the mtcars dataset into two subsets, one for training our model and one for evaluating our model. Then, we separate the targets we want to learn (MPG, in this case) and standardize the datasets by subtracting each column’s means and dividing by the standard deviation of that column. The resulting data is not very familiar looking, but this standardization process helps the sampler converge far easier.

# Remove the model column.
select!(data, Not(:Model))

# Split our dataset 70%/30% into training/test sets.
trainset, testset = map(DataFrame, splitobs(data; at=0.7, shuffle=true))

# Turing requires data in matrix form.
target = :MPG
train = Matrix(select(trainset, Not(target)))
test = Matrix(select(testset, Not(target)))
train_target = trainset[:, target]
test_target = testset[:, target]

# Standardize the features.
dt_features = fit(ZScoreTransform, train; dims=1)
StatsBase.transform!(dt_features, train)
StatsBase.transform!(dt_features, test)

# Standardize the targets.
dt_targets = fit(ZScoreTransform, train_target)
StatsBase.transform!(dt_targets, train_target)
StatsBase.transform!(dt_targets, test_target);

Model Specification

In a traditional frequentist model using OLS, our model might look like:

\[ \mathrm{MPG}_i = \alpha + \boldsymbol{\beta}^\mathsf{T}\boldsymbol{X_i} \]

where \(\boldsymbol{\beta}\) is a vector of coefficients and \(\boldsymbol{X}\) is a vector of inputs for observation \(i\). The Bayesian model we are more concerned with is the following:

\[ \mathrm{MPG}_i \sim \mathcal{N}(\alpha + \boldsymbol{\beta}^\mathsf{T}\boldsymbol{X_i}, \sigma^2) \]

where \(\alpha\) is an intercept term common to all observations, \(\boldsymbol{\beta}\) is a coefficient vector, \(\boldsymbol{X_i}\) is the observed data for car \(i\), and \(\sigma^2\) is a common variance term.

For \(\sigma^2\), we assign a prior of truncated(Normal(0, 100); lower=0). This is consistent with Andrew Gelman’s recommendations on noninformative priors for variance. The intercept term (\(\alpha\)) is assumed to be normally distributed with a mean of zero and a variance of three. This represents our assumptions that miles per gallon can be explained mostly by our assorted variables, but a high variance term indicates our uncertainty about that. Each coefficient is assumed to be normally distributed with a mean of zero and a variance of 10. We do not know that our coefficients are different from zero, and we don’t know which ones are likely to be the most important, so the variance term is quite high. Lastly, each observation \(y_i\) is distributed according to the calculated mu term given by \(\alpha + \boldsymbol{\beta}^\mathsf{T}\boldsymbol{X_i}\).

# Bayesian linear regression.
@model function linear_regression(x, y)
    # Set variance prior.
    σ² ~ truncated(Normal(0, 100); lower=0)

    # Set intercept prior.
    intercept ~ Normal(0, sqrt(3))

    # Set the priors on our coefficients.
    nfeatures = size(x, 2)
    coefficients ~ MvNormal(Zeros(nfeatures), 10.0 * I)

    # Calculate all the mu terms.
    mu = intercept .+ x * coefficients
    return y ~ MvNormal(mu, σ² * I)
end

linear_regression (generic function with 2 methods)

With our model specified, we can call the sampler. We will use the No U-Turn Sampler (NUTS) here.

model = linear_regression(train, train_target)
chain = sample(model, NUTS(), 5_000)

┌ Info: Found initial step size
└   ϵ = 0.4

Chains MCMC chain (5000×24×1 Array{Float64, 3}):

Iterations        = 1001:1:6000
Number of chains  = 1
Samples per chain = 5000
Wall duration     = 9.65 seconds
Compute duration  = 9.65 seconds
parameters        = σ², intercept, coefficients[1], coefficients[2], coefficients[3], coefficients[4], coefficients[5], coefficients[6], coefficients[7], coefficients[8], coefficients[9], coefficients[10]
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
        parameters      mean       std      mcse    ess_bulk    ess_tail       ⋯
            Symbol   Float64   Float64   Float64     Float64     Float64   Flo ⋯

                σ²    0.4771    0.2898    0.0074   1614.9380   2191.3411    1. ⋯
         intercept   -0.0013    0.1451    0.0020   5552.1263   3209.9602    1. ⋯
   coefficients[1]   -0.5136    0.6048    0.0106   3317.1725   2619.6818    1. ⋯
   coefficients[2]    0.3504    0.7844    0.0186   1796.9923   2367.1576    1. ⋯
   coefficients[3]   -0.4370    0.4952    0.0092   2931.6863   2818.2703    1. ⋯
   coefficients[4]    0.0057    0.2603    0.0041   4093.9661   3294.3455    1. ⋯
   coefficients[5]   -0.2150    0.5640    0.0137   1720.8416   1902.1129    1. ⋯
   coefficients[6]   -0.0316    0.4717    0.0106   1990.2678   2560.0590    1. ⋯
   coefficients[7]   -0.1079    0.4394    0.0092   2254.5406   2355.3310    1. ⋯
   coefficients[8]    0.1035    0.3865    0.0088   1942.8646   2229.7878    1. ⋯
   coefficients[9]    0.2310    0.3759    0.0081   2227.6660   2642.4282    1. ⋯
  coefficients[10]   -0.1383    0.4822    0.0125   1511.5929   1904.5811    1. ⋯
                                                               2 columns omitted

Quantiles
        parameters      2.5%     25.0%     50.0%     75.0%     97.5%
            Symbol   Float64   Float64   Float64   Float64   Float64

                σ²    0.1851    0.2976    0.4002    0.5705    1.2121
         intercept   -0.2908   -0.0909   -0.0014    0.0925    0.2835
   coefficients[1]   -1.7340   -0.9019   -0.5232   -0.1452    0.6796
   coefficients[2]   -1.1938   -0.1630    0.3655    0.8612    1.8740
   coefficients[3]   -1.4388   -0.7584   -0.4321   -0.1262    0.5536
   coefficients[4]   -0.5240   -0.1602    0.0096    0.1715    0.5306
   coefficients[5]   -1.3393   -0.5671   -0.2142    0.1400    0.9210
   coefficients[6]   -0.9548   -0.3286   -0.0309    0.2617    0.8868
   coefficients[7]   -0.9870   -0.3874   -0.1067    0.1699    0.7843
   coefficients[8]   -0.6890   -0.1393    0.1100    0.3549    0.8447
   coefficients[9]   -0.5398    0.0054    0.2293    0.4598    1.0006
  coefficients[10]   -1.0991   -0.4471   -0.1301    0.1637    0.8141

We can also check the densities and traces of the parameters visually using the plot functionality.

plot(chain)

It looks like all parameters have converged.

Comparing to OLS

A satisfactory test of our model is to evaluate how well it predicts. Importantly, we want to compare our model to existing tools like OLS. The code below uses the GLM.jl package to generate a traditional OLS multiple regression model on the same data as our probabilistic model.

# Import the GLM package.
using GLM

# Perform multiple regression OLS.
train_with_intercept = hcat(ones(size(train, 1)), train)
ols = lm(train_with_intercept, train_target)

# Compute predictions on the training data set and unstandardize them.
train_prediction_ols = GLM.predict(ols)
StatsBase.reconstruct!(dt_targets, train_prediction_ols)

# Compute predictions on the test data set and unstandardize them.
test_with_intercept = hcat(ones(size(test, 1)), test)
test_prediction_ols = GLM.predict(ols, test_with_intercept)
StatsBase.reconstruct!(dt_targets, test_prediction_ols);

The function below accepts a chain and an input matrix and calculates predictions. We use the samples of the model parameters in the chain starting with sample 200.

# Make a prediction given an input vector.
function prediction(chain, x)
    p = get_params(chain[200:end, :, :])
    targets = p.intercept' .+ x * reduce(hcat, p.coefficients)'
    return vec(mean(targets; dims=2))
end

prediction (generic function with 1 method)

When we make predictions, we unstandardize them so they are more understandable.

# Calculate the predictions for the training and testing sets and unstandardize them.
train_prediction_bayes = prediction(chain, train)
StatsBase.reconstruct!(dt_targets, train_prediction_bayes)
test_prediction_bayes = prediction(chain, test)
StatsBase.reconstruct!(dt_targets, test_prediction_bayes)

# Show the predictions on the test data set.
DataFrame(; MPG=testset[!, target], Bayes=test_prediction_bayes, OLS=test_prediction_ols)

10×3 DataFrame

Row	MPG	Bayes	OLS
	Float64	Float64	Float64
1	33.9	26.8226	26.804
2	21.0	22.3637	22.4669
3	21.4	20.4945	20.5666
4	26.0	28.7889	28.9169
5	15.0	11.7368	11.584
6	10.4	13.5482	13.7006
7	30.4	27.3307	27.4661
8	10.4	14.3284	14.5346
9	18.7	17.2343	17.2897
10	17.3	14.7345	14.6084

Now let’s evaluate the loss for each method, and each prediction set. We will use the mean squared error to evaluate loss, given by \[ \mathrm{MSE} = \frac{1}{n} \sum_{i=1}^n {(y_i - \hat{y_i})^2} \] where \(y_i\) is the actual value (true MPG) and \(\hat{y_i}\) is the predicted value using either OLS or Bayesian linear regression. A lower SSE indicates a closer fit to the data.

println(
    "Training set:",
    "\n\tBayes loss: ",
    msd(train_prediction_bayes, trainset[!, target]),
    "\n\tOLS loss: ",
    msd(train_prediction_ols, trainset[!, target]),
)

println(
    "Test set:",
    "\n\tBayes loss: ",
    msd(test_prediction_bayes, testset[!, target]),
    "\n\tOLS loss: ",
    msd(test_prediction_ols, testset[!, target]),
)

Training set:
    Bayes loss: 4.269315689736475
    OLS loss: 4.264328587912453
Test set:
    Bayes loss: 11.468971155921656
    OLS loss: 11.920700014054287

As we can see above, OLS and our Bayesian model fit our training and test data set about the same.