Mode Estimation

After defining a statistical model, in addition to sampling from its distributions, one may be interested in finding the parameter values that maximise for instance the posterior distribution density function or the likelihood. This is called mode estimation. Turing provides support for two mode estimation techniques, maximum likelihood estimation (MLE) and maximum a posterior (MAP) estimation.

To demonstrate mode estimation, let us load Turing and declare a model:

using Turing

@model function gdemo(x)
    s² ~ InverseGamma(2, 3)
    m ~ Normal(0, sqrt(s²))

    for i in eachindex(x)
        x[i] ~ Normal(m, sqrt(s²))
    end
end

gdemo (generic function with 2 methods)

Once the model is defined, we can construct a model instance as we normally would:

# Instantiate the gdemo model with our data.
data = [1.5, 2.0]
model = gdemo(data)

DynamicPPL.Model{typeof(gdemo), (:x,), (), (), Tuple{Vector{Float64}}, Tuple{}, DynamicPPL.DefaultContext}(gdemo, (x = [1.5, 2.0],), NamedTuple(), DynamicPPL.DefaultContext())

Finding the maximum aposteriori or maximum likelihood parameters is as simple as

# Generate a MLE estimate.
mle_estimate = maximum_likelihood(model)

# Generate a MAP estimate.
map_estimate = maximum_a_posteriori(model)

ModeResult with maximized lp of -4.62
[0.9074074074074113, 1.1666666666666656]

The estimates are returned as instances of the ModeResult type. It has the fields values for the parameter values found and lp for the log probability at the optimum, as well as f for the objective function and optim_result for more detailed results of the optimisation procedure.

@show mle_estimate.values
@show mle_estimate.lp;

mle_estimate.values = [0.06250000000095987, 1.7500000000010623]
mle_estimate.lp = -0.06528834416956464

Controlling the optimisation process

Under the hood maximum_likelihood and maximum_a_posteriori use the Optimization.jl package, which provides a unified interface to many other optimisation packages. By default Turing typically uses the LBFGS method from Optim.jl to find the mode estimate, but we can easily change that:

using OptimizationOptimJL: NelderMead
@show maximum_likelihood(model, NelderMead())

using OptimizationNLopt: NLopt.LD_TNEWTON_PRECOND_RESTART
@show maximum_likelihood(model, LD_TNEWTON_PRECOND_RESTART());

maximum_likelihood(model, NelderMead()) = [0.06249475477947304, 1.7500164216193026]
┌ Warning: The selected optimization algorithm requires second order derivatives, but `SecondOrder` ADtype was not provided. 
│         So a `SecondOrder` with ADTypes.AutoForwardDiff() for both inner and outer will be created, this can be suboptimal and not work in some cases so 
│         an explicit `SecondOrder` ADtype is recommended.
└ @ OptimizationBase ~/.julia/packages/OptimizationBase/Lc8sB/src/cache.jl:49
maximum_likelihood(model, LD_TNEWTON_PRECOND_RESTART()) = [0.062499999883668896, 1.749999999639848]

The above are just two examples, Optimization.jl supports many more.

We can also help the optimisation by giving it a starting point we know is close to the final solution, or by specifying an automatic differentiation method

import Mooncake

maximum_likelihood(
    model, NelderMead(); initial_params=[0.1, 2], adtype=AutoMooncake()
)

ModeResult with maximized lp of -0.07
[0.062494553692639856, 1.7500042095865365]

When providing values to arguments like initial_params the parameters are typically specified in the order in which they appear in the code of the model, so in this case first s² then m. More precisely it’s the order returned by Turing.Inference.getparams(model, DynamicPPL.VarInfo(model)).

We can also do constrained optimisation, by providing either intervals within which the parameters must stay, or costraint functions that they need to respect. For instance, here’s how one can find the MLE with the constraint that the variance must be less than 0.01 and the mean must be between -1 and 1.:

maximum_likelihood(model; lb=[0.0, -1.0], ub=[0.01, 1.0])

ModeResult with maximized lp of -59.73
[0.009999999976315307, 0.9999999990289276]

The arguments for lower (lb) and upper (ub) bounds follow the arguments of Optimization.OptimizationProblem, as do other parameters for providing constraints, such as cons. Any extraneous keyword arguments given to maximum_likelihood or maximum_a_posteriori are passed to Optimization.solve. Some often useful ones are maxiters for controlling the maximum number of iterations and abstol and reltol for the absolute and relative convergence tolerances:

badly_converged_mle = maximum_likelihood(
    model, NelderMead(); maxiters=10, reltol=1e-9
)

ModeResult with maximized lp of -1.98
[1.0890861981865774, 1.6870725988264612]

We can check whether the optimisation converged using the optim_result field of the result:

@show badly_converged_mle.optim_result;

badly_converged_mle.optim_result = retcode: Failure
u: [0.08533899434137648, 1.6870725988264612]
Final objective value:     1.9842395664467052

For more details, such as a full list of possible arguments, we encourage the reader to read the docstring of the function Turing.Optimisation.estimate_mode, which is what maximum_likelihood and maximum_a_posteriori call, and the documentation of Optimization.jl.

Analyzing your mode estimate

Turing extends several methods from StatsBase that can be used to analyze your mode estimation results. Methods implemented include vcov, informationmatrix, coeftable, params, and coef, among others.

For example, let’s examine our ML estimate from above using coeftable:

using StatsBase: coeftable
coeftable(mle_estimate)

	Coef.	Std. Error	z	Pr(>	z	)
s²	0.0625	0.0625	1.0	0.317311	-0.0599977	0.184998
m	1.75	0.176777	9.89949	4.18383e-23	1.40352	2.09648

Standard errors are calculated from the Fisher information matrix (inverse Hessian of the log likelihood or log joint). Note that standard errors calculated in this way may not always be appropriate for MAP estimates, so please be cautious in interpreting them.

Sampling with the MAP/MLE as initial states

You can begin sampling your chain from an MLE/MAP estimate by extracting the vector of parameter values and providing it to the sample function with the keyword initial_params. For example, here is how to sample from the full posterior using the MAP estimate as the starting point:

map_estimate = maximum_a_posteriori(model)
chain = sample(model, NUTS(), 1_000; initial_params=map_estimate.values.array)