# Import Turing and Distributions.
using Turing, Distributions
# Import RDatasets.
using RDatasets
# Import MCMCChains, Plots, and StatsPlots for visualizations and diagnostics.
using MCMCChains, Plots, StatsPlots
# We need a logistic function, which is provided by StatsFuns.
using StatsFuns: logistic
# Functionality for splitting and normalizing the data
using MLDataUtils: shuffleobs, stratifiedobs, rescale!
# Set a seed for reproducibility.
using Random
Random.seed!(0);
Bayesian Logistic Regression
Bayesian logistic regression is the Bayesian counterpart to a common tool in machine learning, logistic regression. The goal of logistic regression is to predict a one or a zero for a given training item. An example might be predicting whether someone is sick or ill given their symptoms and personal information.
In our example, we’ll be working to predict whether someone is likely to default with a synthetic dataset found in the RDatasets
package. This dataset, Defaults
, comes from R’s ISLR package and contains information on borrowers.
To start, let’s import all the libraries we’ll need.
Data Cleaning & Set Up
Now we’re going to import our dataset. The first six rows of the dataset are shown below so you can get a good feel for what kind of data we have.
# Import the "Default" dataset.
= RDatasets.dataset("ISLR", "Default");
data
# Show the first six rows of the dataset.
first(data, 6)
Row | Default | Student | Balance | Income |
---|---|---|---|---|
Cat… | Cat… | Float64 | Float64 | |
1 | No | No | 729.526 | 44361.6 |
2 | No | Yes | 817.18 | 12106.1 |
3 | No | No | 1073.55 | 31767.1 |
4 | No | No | 529.251 | 35704.5 |
5 | No | No | 785.656 | 38463.5 |
6 | No | Yes | 919.589 | 7491.56 |
Most machine learning processes require some effort to tidy up the data, and this is no different. We need to convert the Default
and Student
columns, which say “Yes” or “No” into 1s and 0s. Afterwards, we’ll get rid of the old words-based columns.
# Convert "Default" and "Student" to numeric values.
:DefaultNum] = [r.Default == "Yes" ? 1.0 : 0.0 for r in eachrow(data)]
data[!, :StudentNum] = [r.Student == "Yes" ? 1.0 : 0.0 for r in eachrow(data)]
data[!,
# Delete the old columns which say "Yes" and "No".
select!(data, Not([:Default, :Student]))
# Show the first six rows of our edited dataset.
first(data, 6)
Row | Balance | Income | DefaultNum | StudentNum |
---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | |
1 | 729.526 | 44361.6 | 0.0 | 0.0 |
2 | 817.18 | 12106.1 | 0.0 | 1.0 |
3 | 1073.55 | 31767.1 | 0.0 | 0.0 |
4 | 529.251 | 35704.5 | 0.0 | 0.0 |
5 | 785.656 | 38463.5 | 0.0 | 0.0 |
6 | 919.589 | 7491.56 | 0.0 | 1.0 |
After we’ve done that tidying, it’s time to split our dataset into training and testing sets, and separate the labels from the data. We separate our data into two halves, train
and test
. You can use a higher percentage of splitting (or a lower one) by modifying the at = 0.05
argument. We have highlighted the use of only a 5% sample to show the power of Bayesian inference with small sample sizes.
We must rescale our variables so that they are centered around zero by subtracting each column by the mean and dividing it by the standard deviation. Without this step, Turing’s sampler will have a hard time finding a place to start searching for parameter estimates. To do this we will leverage MLDataUtils
, which also lets us effortlessly shuffle our observations and perform a stratified split to get a representative test set.
function split_data(df, target; at=0.70)
= shuffleobs(df)
shuffled return trainset, testset = stratifiedobs(row -> row[target], shuffled; p=at)
end
= [:StudentNum, :Balance, :Income]
features = [:Balance, :Income]
numerics = :DefaultNum
target
= split_data(data, target; at=0.05)
trainset, testset for feature in numerics
= rescale!(trainset[!, feature]; obsdim=1)
μ, σ rescale!(testset[!, feature], μ, σ; obsdim=1)
end
# Turing requires data in matrix form, not dataframe
= Matrix(trainset[:, features])
train = Matrix(testset[:, features])
test = trainset[:, target]
train_label = testset[:, target]; test_label
Model Declaration
Finally, we can define our model.
logistic_regression
takes four arguments:
x
is our set of independent variables;y
is the element we want to predict;n
is the number of observations we have; andσ
is the standard deviation we want to assume for our priors.
Within the model, we create four coefficients (intercept
, student
, balance
, and income
) and assign a prior of normally distributed with means of zero and standard deviations of σ
. We want to find values of these four coefficients to predict any given y
.
The for
block creates a variable v
which is the logistic function. We then observe the likelihood of calculating v
given the actual label, y[i]
.
# Bayesian logistic regression (LR)
@model function logistic_regression(x, y, n, σ)
~ Normal(0, σ)
intercept
~ Normal(0, σ)
student ~ Normal(0, σ)
balance ~ Normal(0, σ)
income
for i in 1:n
= logistic(intercept + student * x[i, 1] + balance * x[i, 2] + income * x[i, 3])
v ~ Bernoulli(v)
y[i] end
end;
Sampling
Now we can run our sampler. This time we’ll use NUTS
to sample from our posterior.
setprogress!(false)
# Retrieve the number of observations.
= size(train)
n, _
# Sample using NUTS.
= logistic_regression(train, train_label, n, 1)
m = sample(m, NUTS(), MCMCThreads(), 1_500, 3) chain
Chains MCMC chain (1500×16×3 Array{Float64, 3}):
Iterations = 751:1:2250
Number of chains = 3
Samples per chain = 1500
Wall duration = 11.95 seconds
Compute duration = 9.72 seconds
parameters = intercept, student, balance, income
internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯
intercept -4.4252 0.4507 0.0097 2212.2795 2660.0075 1.0016 ⋯
student -0.3804 0.6210 0.0122 2588.7488 3015.3927 1.0002 ⋯
balance 1.8638 0.2946 0.0063 2171.5382 2368.9974 1.0021 ⋯
income 0.3009 0.3012 0.0058 2665.3003 2582.0627 1.0014 ⋯
1 column omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
intercept -5.3269 -4.7235 -4.4066 -4.1145 -3.5959
student -1.6015 -0.8025 -0.3796 0.0417 0.8442
balance 1.3040 1.6647 1.8505 2.0563 2.4601
income -0.3080 0.1045 0.2991 0.4972 0.9064
The sample()
call above assumes that you have at least nchains
threads available in your Julia instance. If you do not, the multiple chains will run sequentially, and you may notice a warning. For more information, see the Turing documentation on sampling multiple chains.
Since we ran multiple chains, we may as well do a spot check to make sure each chain converges around similar points.
plot(chain)
Looks good!
We can also use the corner
function from MCMCChains to show the distributions of the various parameters of our logistic regression.
# The labels to use.
= [:student, :balance, :income]
l
# Use the corner function. Requires StatsPlots and MCMCChains.
corner(chain, l)