Automatic Differentiation
Switching AD Modes
Turing currently supports four automatic differentiation (AD) backends for sampling: ForwardDiff for forward-mode AD; and ReverseDiff, Zygote, and Tracker for reverse-mode AD.
While Tracker
is still available, its use is discouraged due to a lack of active maintenance.
ForwardDiff
is automatically imported by Turing. To utilize Zygote
or ReverseDiff
for AD, users must explicitly import them with using Zygote
or using ReverseDiff
, alongside using Turing
.
As of Turing version v0.30, the global configuration flag for the AD backend has been removed in favour of AdTypes.jl
, allowing users to specify the AD backend for individual samplers independently.
Users can pass the adtype
keyword argument to the sampler constructor to select the desired AD backend, with the default being AutoForwardDiff(; chunksize=0)
.
For ForwardDiff
, pass adtype=AutoForwardDiff(; chunksize)
to the sampler constructor. A chunksize
of 0 permits the chunk size to be automatically determined. For more information regarding the selection of chunksize
, please refer to related section of ForwardDiff
's documentation.
For ReverseDiff
, pass adtype=AutoReverseDiff()
to the sampler constructor. An additional argument can be provided to AutoReverseDiff
to specify whether to to compile the tape only once and cache it for later use (false
by default, which means no caching tape). Be aware that the use of caching in certain types of models can lead to incorrect results and/or errors.
Compiled tapes should only be used if you are absolutely certain that the computation doesn't change between different executions of your model.
Thus, e.g., in the model definition and all im- and explicitly called functions in the model all loops should be of fixed size, and if
-statements should consistently execute the same branches.
For instance, if
-statements with conditions that can be determined at compile time or conditions that depend only on the data will always execute the same branches during sampling (if the data is constant throughout sampling and, e.g., no mini-batching is used).
However, if
-statements that depend on the model parameters can take different branches during sampling; hence, the compiled tape might be incorrect.
Thus you must not use compiled tapes when your model makes decisions based on the model parameters, and you should be careful if you compute functions of parameters that those functions do not have branching which might cause them to execute different code for different values of the parameter.
For Zygote
, pass adtype=AutoZygote()
to the sampler constructor.
And the previously used interface functions including ADBackend
, setadbackend
, setsafe
, setchunksize
, and setrdcache
are deprecated and removed.
Compositional Sampling with Differing AD Modes
Turing supports intermixed automatic differentiation methods for different variable spaces. The snippet below shows using ForwardDiff
to sample the mean (m
) parameter, and using ReverseDiff
for the variance (s
) parameter:
using Turing
using ReverseDiff
# Define a simple Normal model with unknown mean and variance.
@model function gdemo(x, y)
s² ~ InverseGamma(2, 3)
m ~ Normal(0, sqrt(s²))
x ~ Normal(m, sqrt(s²))
return y ~ Normal(m, sqrt(s²))
end
# Sample using Gibbs and varying autodiff backends.
c = sample(
gdemo(1.5, 2),
Gibbs(
HMC(0.1, 5, :m; adtype=AutoForwardDiff(; chunksize=0)),
HMC(0.1, 5, :s²; adtype=AutoReverseDiff(false)),
),
1000,
)
Chains MCMC chain (1000×3×1 Array{Float64, 3}):
Iterations = 1:1:1000
Number of chains = 1
Samples per chain = 1000
Wall duration = 4.91 seconds
Compute duration = 4.91 seconds
parameters = s², m
internals = lp
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat
e ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64
⋯
s² 2.1057 1.6297 0.1243 183.4312 273.2658 1.0071
⋯
m 1.0908 0.7491 0.0695 123.2781 120.7066 1.0177
⋯
1 column om
itted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
s² 0.5950 1.0897 1.6096 2.5383 6.8398
m -0.3297 0.6017 1.0654 1.5179 2.8525
Generally, reverse-mode AD, for instance ReverseDiff
, is faster when sampling from variables of high dimensionality (greater than 20), while forward-mode AD, for instance ForwardDiff
, is more efficient for lower-dimension variables. This functionality allows those who are performance sensitive to fine tune their automatic differentiation for their specific models.
If the differentiation method is not specified in this way, Turing will default to using whatever the global AD backend is. Currently, this defaults to ForwardDiff
.