using Turing
using Distributions
using StatsPlots
using Random
using LinearAlgebra
Random.seed!(123)
# Disable progress bars for cleaner output
setprogress!(false) Turing.
[ Info: [Turing]: progress logging is disabled globally
false
Turing.jl provides stochastic gradient-based MCMC samplers: Stochastic Gradient Langevin Dynamics (SGLD) and Stochastic Gradient Hamiltonian Monte Carlo (SGHMC).
These samplers are primarily intended for research purposes and require significant expertise to use effectively. For production use and most practical applications, we strongly recommend using HMC or NUTS instead, which are more robust and efficient.
The current implementation in Turing.jl is primarily useful for: - Research purposes: Studying stochastic gradient MCMC methods - Educational purposes: Understanding stochastic gradient MCMC algorithms - Streaming data: When data arrives continuously (with careful tuning) - Experimental applications: Testing stochastic sampling approaches
Important: The current implementation computes full gradients with added stochastic noise rather than true mini-batch stochastic gradients. This means these samplers don’t currently provide the computational benefits typically associated with stochastic gradient methods for large datasets. They require very careful hyperparameter tuning and often perform slower than standard samplers like HMC or NUTS for most practical applications.
Future Development: These stochastic gradient samplers are being migrated to AdvancedHMC.jl for better maintenance and development. Once migration is complete, Turing.jl will support AbstractMCMC-compatible algorithms, and users requiring research-grade stochastic gradient algorithms will be directed to AdvancedHMC.
SGLD adds properly scaled noise to gradient descent steps to enable MCMC sampling. The key insight is that the right amount of noise transforms optimization into sampling from the posterior distribution.
Let’s start with a simple Gaussian model:
# Generate synthetic data
true_μ = 2.0
true_σ = 1.5
N = 100
data = rand(Normal(true_μ, true_σ), N)
# Define a simple Gaussian model
@model function gaussian_model(x)
μ ~ Normal(0, 10)
σ ~ truncated(Normal(0, 5); lower=0)
for i in 1:length(x)
x[i] ~ Normal(μ, σ)
end
end
model = gaussian_model(data)
DynamicPPL.Model{typeof(gaussian_model), (:x,), (), (), Tuple{Vector{Float64}}, Tuple{}, DynamicPPL.DefaultContext}(gaussian_model, (x = [3.21243189269745, 0.31689123782873996, 0.3430458465060562, 1.3745110472525999, 2.431381970935784, 2.3447280470778016, 1.367347003400461, -0.033385931651795486, 2.1041887116378404, 1.82401579320378 … 1.9524391544746311, 2.6525022164657783, 3.260442669093176, 2.5685766805908874, 1.837062568926331, 0.9818378577483255, 0.3676722290955695, 3.0556374886884523, 2.2149888398562707, 2.222563050014123],), NamedTuple(), DynamicPPL.DefaultContext())
SGLD requires very small step sizes to ensure stability. We use a PolynomialStepsize
that decreases over time. Note: Currently, PolynomialStepsize
is the primary stepsize schedule available in Turing for SGLD.
Important Note on Convergence: The examples below use longer chains (10,000-15,000 samples) with the first half discarded as burn-in to ensure proper convergence. This is typical for stochastic gradient samplers, which require more samples than standard HMC/NUTS to achieve reliable results:
# SGLD with polynomial stepsize schedule
# stepsize(t) = a / (b + t)^γ
# Using smaller step size and longer chain for better convergence
sgld_stepsize = Turing.PolynomialStepsize(0.00005, 20000, 0.55)
chain_sgld = sample(model, SGLD(stepsize=sgld_stepsize), 10000)
# Note: We use a longer chain (10000 samples) to ensure convergence
# The first half can be considered burn-in
summarystats(chain_sgld[5001:end])
Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ μ -17.0265 0.0068 0.0017 15.5131 24.6317 1.1585 ⋯ σ 6.8241 0.5436 0.1698 10.5238 18.3923 2.1157 ⋯ 1 column omitted
SGHMC extends HMC to the stochastic gradient setting by incorporating friction to counteract the noise from stochastic gradients:
# SGHMC with very small learning rate and longer chain
chain_sghmc = sample(model, SGHMC(learning_rate=0.000005, momentum_decay=0.2), 10000)
# Using the second half of the chain after burn-in
summarystats(chain_sghmc[5001:end])
Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat e ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ μ 1.9239 0.1147 0.0280 17.5165 36.5332 1.0674 ⋯ σ 1.4338 0.1111 0.0293 15.2635 43.5824 1.1500 ⋯ 1 column omitted
For comparison, let’s sample the same model using standard HMC:
chain_hmc = sample(model, HMC(0.01, 10), 1000)
println("True values: μ = ", true_μ, ", σ = ", true_σ)
summarystats(chain_hmc)
True values: μ = 2.0, σ = 1.5
Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat e ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ μ 5.7379 0.0000 0.0000 NaN NaN NaN ⋯ σ 0.1845 0.0000 0.0000 NaN NaN NaN ⋯ 1 column omitted
Compare the trace plots to see how the different samplers explore the posterior:
# Compare converged portions of the chains
p1 = plot(chain_sgld[5001:end][:μ], label="SGLD (after burn-in)", title="μ parameter traces")
hline!([true_μ], label="True value", linestyle=:dash, color=:red)
p2 = plot(chain_sghmc[5001:end][:μ], label="SGHMC (after burn-in)")
hline!([true_μ], label="True value", linestyle=:dash, color=:red)
p3 = plot(chain_hmc[:μ], label="HMC")
hline!([true_μ], label="True value", linestyle=:dash, color=:red)
plot(p1, p2, p3, layout=(3,1), size=(800,600))
The comparison shows that (using converged portions after burn-in): - SGLD exhibits slower convergence and higher variance due to the injected noise, requiring longer chains (10,000+ samples) and discarding burn-in to achieve stable estimates - SGHMC shows slightly better mixing than SGLD due to the momentum term, but still requires careful tuning and burn-in period - HMC converges quickly and efficiently explores the posterior from the start, demonstrating why it’s preferred for small to medium-sized problems
Here’s a more complex example using Bayesian linear regression:
# Generate regression data
n_features = 3
n_samples = 100
X = randn(n_samples, n_features)
true_β = [0.5, -1.2, 2.1]
true_σ_noise = 0.3
y = X * true_β + true_σ_noise * randn(n_samples)
@model function linear_regression(X, y)
n_features = size(X, 2)
# Priors
β ~ MvNormal(zeros(n_features), 3 * I)
σ ~ truncated(Normal(0, 1); lower=0)
# Likelihood
y ~ MvNormal(X * β, σ^2 * I)
end
lr_model = linear_regression(X, y)
DynamicPPL.Model{typeof(linear_regression), (:X, :y), (), (), Tuple{Matrix{Float64}, Vector{Float64}}, Tuple{}, DynamicPPL.DefaultContext}(linear_regression, (X = [0.5021022262583117 -0.2163017270531885 -0.8799151436747112; -1.8257514762093627 -0.997310711530162 -0.1162077436234999; … ; -0.7183745448326679 -1.7727770719892872 0.2551774783721763; -1.039705390746218 -0.22634562222509952 1.5634916742483351], y = [-1.3426644104594059, -0.3084539652596656, -3.45973591418794, 4.859182414077877, 0.9092743421074609, 1.1352422264208615, 0.8867287190483546, 2.7119562589321276, 0.45537595630955163, -0.3300190196614182 … 0.2986368584892437, -2.5471155252763245, -2.6493391744329657, -2.32337783644628, -2.1144294002463173, 0.28980708828097207, -0.1773208935900904, -4.57675827800405, 2.2401162600255566, 2.447332194129415]), NamedTuple(), DynamicPPL.DefaultContext())
Sample using the stochastic gradient methods:
# Very conservative parameters for stability with longer chains
sgld_lr_stepsize = Turing.PolynomialStepsize(0.00002, 30000, 0.55)
chain_lr_sgld = sample(lr_model, SGLD(stepsize=sgld_lr_stepsize), 15000)
chain_lr_sghmc = sample(lr_model, SGHMC(learning_rate=0.000002, momentum_decay=0.3), 15000)
chain_lr_hmc = sample(lr_model, HMC(0.01, 10), 1000)
Chains MCMC chain (1000×14×1 Array{Float64, 3}):
Iterations = 1:1:1000
Number of chains = 1
Samples per chain = 1000
Wall duration = 1.57 seconds
Compute duration = 1.57 seconds
parameters = β[1], β[2], β[3], σ
internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, numerical_error, step_size, nom_step_size
Use `describe(chains)` for summary statistics and quantiles.
Compare the results to evaluate the performance of stochastic gradient samplers on a more complex model:
println("True β values: ", true_β)
println("True σ value: ", true_σ_noise)
println()
println("SGLD estimates (after burn-in):")
summarystats(chain_lr_sgld[7501:end])
True β values: [0.5, -1.2, 2.1]
True σ value: 0.3
SGLD estimates (after burn-in):
Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat e ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ β[1] 0.9536 0.0050 0.0012 17.5148 26.3120 1.7430 ⋯ β[2] 1.5811 0.0052 0.0013 17.0790 36.2978 1.7915 ⋯ β[3] -0.4055 0.0136 0.0035 15.7641 21.5733 2.1068 ⋯ σ 2.2495 0.0326 0.0083 16.3483 21.5851 1.9978 ⋯ 1 column omitted
The linear regression example demonstrates that stochastic gradient samplers can recover the true parameters, but: - They require significantly longer chains (15000 vs 1000 for HMC) - We discard the first half as burn-in to ensure convergence - The estimates may still have higher variance than HMC - Convergence diagnostics should be carefully examined before trusting the results
Both samplers support different AD backends. For more information about automatic differentiation in Turing, see the Automatic Differentiation documentation.
using ADTypes
# ForwardDiff (default) - good for few parameters
sgld_forward = SGLD(stepsize=sgld_stepsize, adtype=AutoForwardDiff())
# ReverseDiff - better for many parameters
sgld_reverse = SGLD(stepsize=sgld_stepsize, adtype=AutoReverseDiff())
# Zygote - good for complex models
sgld_zygote = SGLD(stepsize=sgld_stepsize, adtype=AutoZygote())
SGLD{AutoZygote, PolynomialStepsize{Float64}}(PolynomialStepsize{Float64}(5.0e-5, 20000.0, 0.55), AutoZygote())
For SGLD: - Use PolynomialStepsize
with very small initial values (≤ 0.0001) - Larger b
values in PolynomialStepsize(a, b, γ)
provide more stability - The stepsize decreases as a / (b + t)^γ
- Recommended starting point: PolynomialStepsize(0.0001, 10000, 0.55)
- For unstable models: Reduce a
to 0.00001 or increase b
to 50000
For SGHMC: - Use extremely small learning rates (≤ 0.00001) - Momentum decay (friction) typically between 0.1-0.5 - Higher momentum decay improves stability but slows convergence - Recommended starting point: learning_rate=0.00001, momentum_decay=0.1
- For high-dimensional problems: Increase momentum_decay to 0.3-0.5
Tuning Strategy: 1. Start with recommended values and run a short chain (500-1000 samples) 2. If chains diverge or parameters explode, reduce step size by factor of 10 3. If mixing is too slow, carefully increase step size by factor of 2 4. Always validate against HMC/NUTS results when possible
Due to the high variance and slow convergence of stochastic gradient samplers, careful diagnostics are essential:
Stochastic gradient samplers in Turing.jl provide an interface to gradient-based MCMC methods with added stochasticity. While designed for large-scale problems, the current implementation uses full gradients, making them primarily useful for research or specialized applications. For most practical Bayesian inference tasks, standard samplers like HMC or NUTS will be more efficient and easier to tune.