The Probability Interface
The easiest way to manipulate and query DynamicPPL models is via the DynamicPPL probability interface.
Let's use a simple model of normally-distributed data as an example.
using DynamicPPL
using Distributions
using FillArrays
using LinearAlgebra
using Random
@model function gdemo(n)
μ ~ Normal(0, 1)
x ~ MvNormal(Fill(μ, n), I)
return nothing
end
We generate some data using μ = 0
:
Random.seed!(1776)
dataset = randn(100)
Conditioning and Deconditioning
Bayesian models can be transformed with two main operations, conditioning and deconditioning (also known as marginalization). Conditioning takes a variable and fixes its value as known. We do this by passing a model and a collection of conditioned variables to |
or its alias condition
:
model = gdemo(length(dataset)) | (x=dataset, μ=0)
This operation can be reversed by applying decondition
:
decondition(model)
We can also decondition only some of the variables:
decondition(model, :μ)
Sometimes it is helpful to define convenience functions for conditioning on some variable(s). For instance, in this example we might want to define a version of gdemo
that conditions on some observations of x
:
gdemo(x::AbstractVector{<:Real}) = gdemo(length(x)) | (; x)
For illustrative purposes, however, we do not use this function in the examples below.
Probabilities and Densities
We often want to calculate the (unnormalized) probability density for an event. This probability might be a prior, a likelihood, or a posterior (joint) density. DynamicPPL provides convenient functions for this. For example, we can calculate the joint probability of a set of samples (here drawn from the prior) with logjoint
:
model = gdemo(length(dataset)) | (x=dataset,)
Random.seed!(124)
sample = rand(model)
logjoint(model, sample)
-181.7247437162069
For models with many variables rand(model)
can be prohibitively slow since it returns a NamedTuple
of samples from the prior distribution of the unconditioned variables. We recommend working with samples of type DataStructures.OrderedDict
in this case:
using DataStructures
Random.seed!(124)
sample_dict = rand(OrderedDict, model)
logjoint(model, sample_dict)
-181.7247437162069
The prior probability and the likelihood of a set of samples can be calculated with the functions loglikelihood
and logjoint
, respectively:
logjoint(model, sample) ≈ loglikelihood(model, sample) + logprior(model, sample)
true
logjoint(model, sample_dict) ≈
loglikelihood(model, sample_dict) + logprior(model, sample_dict)
true
Example: Cross-validation
To give an example of the probability interface in use, we can use it to estimate the performance of our model using cross-validation. In cross-validation, we split the dataset into several equal parts. Then, we choose one of these sets to serve as the validation set. Here, we measure fit using the cross entropy (Bayes loss).[1] (For the sake of simplicity, in the following code, we enforce that nfolds
must divide the number of data points. For a more competent implementation, see MLUtils.jl.)
# Calculate the train/validation splits across `nfolds` partitions, assume `length(dataset)` divides `nfolds`
function kfolds(dataset::Array{<:Real}, nfolds::Int)
fold_size, remaining = divrem(length(dataset), nfolds)
if remaining != 0
error("The number of folds must divide the number of data points.")
end
first_idx = firstindex(dataset)
last_idx = lastindex(dataset)
splits = map(0:(nfolds - 1)) do i
start_idx = first_idx + i * fold_size
end_idx = start_idx + fold_size
train_set_indices = [first_idx:(start_idx - 1); end_idx:last_idx]
return (view(dataset, train_set_indices), view(dataset, start_idx:(end_idx - 1)))
end
return splits
end
function cross_val(
dataset::Vector{<:Real};
nfolds::Int=5,
nsamples::Int=1_000,
rng::Random.AbstractRNG=Random.default_rng(),
)
# Initialize `loss` in a way such that the loop below does not change its type
model = gdemo(1) | (x=[first(dataset)],)
loss = zero(logjoint(model, rand(rng, model)))
for (train, validation) in kfolds(dataset, nfolds)
# First, we train the model on the training set, i.e., we obtain samples from the posterior.
# For normally-distributed data, the posterior can be computed in closed form.
# For general models, however, typically samples will be generated using MCMC with Turing.
posterior = Normal(mean(train), 1)
samples = rand(rng, posterior, nsamples)
# Evaluation on the validation set.
validation_model = gdemo(length(validation)) | (x=validation,)
loss += sum(samples) do sample
logjoint(validation_model, (μ=sample,))
end
end
return loss
end
cross_val(dataset)
-212760.30282411768
- 1See ParetoSmooth.jl for a faster and more accurate implementation of cross-validation than the one provided here.