using Turing
using RDatasets
using MCMCChains, Plots, StatsPlots
using StatsFuns: logistic
using MLUtils: splitobs
using StatsBase: fit, transform!, ZScoreTransform
# Set a seed for reproducibility.
using Random
Random.seed!(0);Bayesian Logistic Regression
Bayesian logistic regression is the Bayesian counterpart to a common tool in machine learning, logistic regression. The goal of logistic regression is to predict a one or a zero for a given training item. An example might be predicting whether someone is sick or ill given their symptoms and personal information.
In our example, we’ll be working to predict whether someone is likely to default with a synthetic dataset found in the RDatasets package. This dataset, Defaults, comes from R’s ISLR package and contains information on borrowers.
To start, let’s import all the libraries we’ll need.
Data Cleaning & Set Up
Now we’re going to import our dataset. The first six rows of the dataset are shown below so you can get a good feel for what kind of data we have.
# Import the "Default" dataset.
data = RDatasets.dataset("ISLR", "Default");
# Show the first six rows of the dataset.
first(data, 6)| Row | Default | Student | Balance | Income |
|---|---|---|---|---|
| Cat… | Cat… | Float64 | Float64 | |
| 1 | No | No | 729.526 | 44361.6 |
| 2 | No | Yes | 817.18 | 12106.1 |
| 3 | No | No | 1073.55 | 31767.1 |
| 4 | No | No | 529.251 | 35704.5 |
| 5 | No | No | 785.656 | 38463.5 |
| 6 | No | Yes | 919.589 | 7491.56 |
Most machine learning processes require some effort to tidy up the data, and this is no different. We need to convert the Default and Student columns, which say “Yes” or “No” into 1s and 0s. Afterwards, we’ll get rid of the old words-based columns.
# Convert "Default" and "Student" to numeric values.
yesno = ["No", "Yes"]
data[!, :DefaultNum] = indexin(data[!, :Default], yesno) .- 1
data[!, :StudentNum] = indexin(data[!, :Student], yesno) .- 1
# Delete the old columns which say "Yes" and "No".
select!(data, Not([:Default, :Student]))
# Show the first six rows of our edited dataset.
first(data, 6)| Row | Balance | Income | DefaultNum | StudentNum |
|---|---|---|---|---|
| Float64 | Float64 | Int64 | Int64 | |
| 1 | 729.526 | 44361.6 | 0 | 0 |
| 2 | 817.18 | 12106.1 | 0 | 1 |
| 3 | 1073.55 | 31767.1 | 0 | 0 |
| 4 | 529.251 | 35704.5 | 0 | 0 |
| 5 | 785.656 | 38463.5 | 0 | 0 |
| 6 | 919.589 | 7491.56 | 0 | 1 |
Our predictor variables are StudentNum, Balance, and Income, and our target variable is DefaultNum; we separate those out into X and Y for ease of use later on. We’ll also convert them to Matrix and Vector types, respectively, by wrapping them in Array.
# `splitobs` later expects that `X` has observations in columns,
# hence the transpose on `X`.
X = Array(data[!, [:StudentNum, :Balance, :Income]])'
Y = Array(data[!, :DefaultNum])
size(X), size(Y)((3, 10000), (10000,))
It’s now time to split our dataset into training and testing sets. We separate our data into two partitions, train and test. You can use a higher percentage of splitting (or a lower one) by modifying the at = 0.05 argument. Here, we are training on only 5% of the data in order to highlight the power of Bayesian inference with small sample sizes. To do the splitting we will leverage MLUtils, which also lets us effortlessly shuffle our observations and perform a stratified split to get a representative test set.
(train_X, train_Y), (test_X, test_Y) = splitobs((X, Y); at=0.05, shuffle=true, stratified=Y)
# Let's check that the labels are distributed
# similarly in the training and test sets.
mean(train_Y), mean(test_Y)(0.034, 0.03326315789473684)
We must now rescale our numeric variables so that they are centred around zero by subtracting each column by the mean and dividing it by the standard deviation. This rescaling ensures features are on comparable scales, which improves sampler initialisation and convergence.
Note here that we leave out the StudentNum variable (row 1) from the normalisation, since it is already binary and doesn’t need to be rescaled.
dt = fit(ZScoreTransform, view(train_X, 2:3, :); dims=2)
transform!(dt, view(train_X, 2:3, :))
transform!(dt, view(test_X, 2:3, :))2×9500 view(adjoint(::Matrix{Float64}), 2:3, [7866, 9598, 7136, 1047, 3684, 1049, 5396, 8943, 4016, 6504 … 1940, 652, 741, 4698, 582, 6831, 5783, 5675, 8706, 8908]) with eltype Float64:
-0.5065 0.459941 0.0980184 0.16727 … 2.50011 2.47451 1.0281
0.407851 0.619717 0.426981 -1.12592 0.811405 -0.199269 -0.611277
Model Declaration
Finally, we can define our model.
logistic_regression takes four arguments:
xis our set of independent variables;yis the element we want to predict;σis the (fixed) standard deviation we want to assume for our priors.
Within the model, we create four coefficients (intercept, student, balance, and income) and assign a prior of normally distributed with means of zero and standard deviations of σ. We want to find values of these four coefficients to predict any given y.
The for block creates a variable v which is the logistic function. We then observe the likelihood of calculating v given the actual label, y[i].
@model function logistic_regression(x, y, σ)
N = size(x, 2)
@assert length(y) == N
intercept ~ Normal(0, σ)
student ~ Normal(0, σ)
balance ~ Normal(0, σ)
income ~ Normal(0, σ)
for i in 1:N
v = logistic(intercept + student * x[1, i] + balance * x[2, i] + income * x[3, i])
y[i] ~ Bernoulli(v)
end
end;Sampling
Now we can run our sampler. Here we’ll use NUTS to sample from our posterior.
setprogress!(false)# Sample using NUTS.
m = logistic_regression(train_X, train_Y, 1.0)
chain = sample(m, NUTS(), MCMCThreads(), 1_500, 3)┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel └ @ AbstractMCMC ~/.julia/packages/AbstractMCMC/oqm6Y/src/sample.jl:544 ┌ Info: Found initial step size └ ϵ = 0.4 ┌ Info: Found initial step size └ ϵ = 0.2 ┌ Info: Found initial step size └ ϵ = 0.4
Chains MCMC chain (1500×18×3 Array{Float64, 3}):
Iterations = 751:1:2250
Number of chains = 3
Samples per chain = 1500
Wall duration = 8.72 seconds
Compute duration = 4.57 seconds
parameters = intercept, student, balance, income
internals = n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size, logprior, loglikelihood, logjoint
Use `describe(chains)` for summary statistics and quantiles.
The sample() call above assumes that you have at least nchains threads available in your Julia instance. If you do not, the multiple chains will run sequentially, and you may notice a warning. For more information, see the Turing documentation on sampling multiple chains.
Since we ran multiple chains, we may as well do a spot check to make sure each chain converges around similar points.
plot(chain)Looks good!
We can also use the corner function from StatsPlots to show the distributions of the various parameters of our logistic regression.
# The labels to use.
l = [:student, :balance, :income]
# Use the corner function. Requires StatsPlots and MCMCChains.
corner(chain, l)Fortunately the corner plot appears to demonstrate unimodal distributions for each of our parameters, so it should be straightforward to take the means of each parameter’s sampled values to estimate our model to make predictions.
Making Predictions
How do we test how well the model actually predicts whether someone is likely to default? We need to build a prediction function that takes the test object we made earlier and runs it through the average parameter calculated during sampling.
The prediction function below takes a Matrix and a Chain object. It takes the mean of each parameter’s sampled values and re-runs the logistic function using those mean values for every element in the test set.
using MCMCChains: MCMCChains
function prediction(x::AbstractMatrix, chain::MCMCChains.Chains, threshold)
# Pull the means from each parameter's sampled values in the chain.
intercept = mean(chain[:intercept])
student = mean(chain[:student])
balance = mean(chain[:balance])
income = mean(chain[:income])
# Retrieve the number of observations.
n = size(x, 2)
# Generate a vector to store our predictions.
v = Vector{Bool}(undef, n)
# Calculate the logistic function for each element in the test set.
for i in 1:n
num = logistic(
intercept .+ student * x[1, i] + balance * x[2, i] + income * x[3, i]
)
v[i] = num >= threshold
end
return v
endprediction (generic function with 1 method)
Let’s see how we did! We run the test matrix through the prediction function, and compute the mean squared error (MSE) for our prediction. The threshold variable sets the decision boundary for classification. For example, a threshold of 0.07 will predict a default (value of 1) for any predicted probability greater than 0.07 and no default otherwise. Lower thresholds increase sensitivity but may increase false positives.
# Set the prediction threshold.
threshold = 0.07
# Make the predictions.
predictions = prediction(test_X, chain, threshold)
# Calculate MSE for our test set.
loss = sum((predictions - test_Y) .^ 2) / length(test_Y)0.108
Perhaps more important is to see what percentage of defaults we correctly predicted. The code below simply counts defaults and predictions and presents the results.
defaults = sum(test_Y)
not_defaults = length(test_Y) - defaults
predicted_defaults = sum(test_Y .== predictions .== 1)
predicted_not_defaults = sum(test_Y .== predictions .== 0)
println("Defaults: $defaults
Predictions: $predicted_defaults
Percentage defaults correct $(predicted_defaults/defaults)")
println("Not defaults: $not_defaults
Predictions: $predicted_not_defaults
Percentage non-defaults correct $(predicted_not_defaults/not_defaults)")Defaults: 316
Predictions: 260
Percentage defaults correct 0.8227848101265823
Not defaults: 9184
Predictions: 8214
Percentage non-defaults correct 0.8943815331010453
The above shows that with a threshold of 0.07, we correctly predict a respectable portion of the defaults, and correctly identify most non-defaults. This is fairly sensitive to a choice of threshold, and you may wish to experiment with it.