Probabilistic Modelling using the Infinite Mixture Model

In many applications it is desirable to allow the model to adjust its complexity to the amount of data. Consider for example the task of assigning objects into clusters or groups. This task often involves the specification of the number of groups. However, often times it is not known beforehand how many groups exist. Moreover, in some applictions, e.g. modelling topics in text documents or grouping species, the number of examples per group is heavy tailed. This makes it impossible to predefine the number of groups and requiring the model to form new groups when data points from previously unseen groups are observed.

A natural approach for such applications is the use of non-parametric models. This tutorial will introduce how to use the Dirichlet process in a mixture of infinitely many Gaussians using Turing. For further information on Bayesian nonparametrics and the Dirichlet process we refer to the introduction by Zoubin Ghahramani and the book “Fundamentals of Nonparametric Bayesian Inference” by Subhashis Ghosal and Aad van der Vaart.

using Turing

Mixture Model

Before introducing infinite mixture models in Turing, we will briefly review the construction of finite mixture models. Subsequently, we will define how to use the Chinese restaurant process construction of a Dirichlet process for non-parametric clustering.

Two-Component Model

First, consider the simple case of a mixture model with two Gaussian components with fixed covariance. The generative process of such a model can be written as:

\[\begin{equation*} \begin{aligned} \pi_1 &\sim \mathrm{Beta}(a, b) \\ \pi_2 &= 1-\pi_1 \\ \mu_1 &\sim \mathrm{Normal}(\mu_0, \Sigma_0) \\ \mu_2 &\sim \mathrm{Normal}(\mu_0, \Sigma_0) \\ z_i &\sim \mathrm{Categorical}(\pi_1, \pi_2) \\ x_i &\sim \mathrm{Normal}(\mu_{z_i}, \Sigma) \end{aligned} \end{equation*}\]

where \(\pi_1, \pi_2\) are the mixing weights of the mixture model, i.e. \(\pi_1 + \pi_2 = 1\), and \(z_i\) is a latent assignment of the observation \(x_i\) to a component (Gaussian).

We can implement this model in Turing for 1D data as follows:

@model function two_model(x)
    # Hyper-parameters
    μ0 = 0.0
    σ0 = 1.0

    # Draw weights.
    π1 ~ Beta(1, 1)
    π2 = 1 - π1

    # Draw locations of the components.
    μ1 ~ Normal0, σ0)
    μ2 ~ Normal0, σ0)

    # Draw latent assignment.
    z ~ Categorical([π1, π2])

    # Draw observation from selected component.
    if z == 1
        x ~ Normal1, 1.0)
    else
        x ~ Normal2, 1.0)
    end
end
two_model (generic function with 2 methods)

Finite Mixture Model

If we have more than two components, this model can elegantly be extended using a Dirichlet distribution as prior for the mixing weights \(\pi_1, \dots, \pi_K\). Note that the Dirichlet distribution is the multivariate generalization of the beta distribution. The resulting model can be written as:

\[ \begin{align} (\pi_1, \dots, \pi_K) &\sim Dirichlet(K, \alpha) \\ \mu_k &\sim \mathrm{Normal}(\mu_0, \Sigma_0), \;\; \forall k \\ z &\sim Categorical(\pi_1, \dots, \pi_K) \\ x &\sim \mathrm{Normal}(\mu_z, \Sigma) \end{align} \]

which resembles the model in the Gaussian mixture model tutorial with a slightly different notation.

Infinite Mixture Model

The question now arises, is there a generalization of a Dirichlet distribution for which the dimensionality \(K\) is infinite, i.e. \(K = \infty\)?

But first, to implement an infinite Gaussian mixture model in Turing, we first need to load the Turing.RandomMeasures module. RandomMeasures contains a variety of tools useful in nonparametrics.

using Turing.RandomMeasures

We now will utilize the fact that one can integrate out the mixing weights in a Gaussian mixture model allowing us to arrive at the Chinese restaurant process construction. See Carl E. Rasmussen: The Infinite Gaussian Mixture Model, NIPS (2000) for details.

In fact, if the mixing weights are integrated out, the conditional prior for the latent variable \(z\) is given by:

\[ p(z_i = k \mid z_{\not i}, \alpha) = \frac{n_k + \alpha K}{N - 1 + \alpha} \]

where \(z_{\not i}\) are the latent assignments of all observations except observation \(i\). Note that we use \(n_k\) to denote the number of observations at component \(k\) excluding observation \(i\). The parameter \(\alpha\) is the concentration parameter of the Dirichlet distribution used as prior over the mixing weights.

Chinese Restaurant Process

To obtain the Chinese restaurant process construction, we can now derive the conditional prior if \(K \rightarrow \infty\).

For \(n_k > 0\) we obtain:

\[ p(z_i = k \mid z_{\not i}, \alpha) = \frac{n_k}{N - 1 + \alpha} \]

and for all infinitely many clusters that are empty (combined) we get:

\[ p(z_i = k \mid z_{\not i}, \alpha) = \frac{\alpha}{N - 1 + \alpha} \]

Those equations show that the conditional prior for component assignments is proportional to the number of such observations, meaning that the Chinese restaurant process has a rich get richer property.

To get a better understanding of this property, we can plot the cluster choosen by for each new observation drawn from the conditional prior.

# Concentration parameter.
α = 10.0

# Random measure, e.g. Dirichlet process.
rpm = DirichletProcess(α)

# Cluster assignments for each observation.
z = Vector{Int}()

# Maximum number of observations we observe.
Nmax = 500

for i in 1:Nmax
    # Number of observations per cluster.
    K = isempty(z) ? 0 : maximum(z)
    nk = Vector{Int}(map(k -> sum(z .== k), 1:K))

    # Draw new assignment.
    push!(z, rand(ChineseRestaurantProcess(rpm, nk)))
end
using Plots

# Plot the cluster assignments over time 
@gif for i in 1:Nmax
    scatter(
        collect(1:i),
        z[1:i];
        markersize=2,
        xlabel="observation (i)",
        ylabel="cluster (k)",
        legend=false,
    )
end
GKS: cannot open display - headless operation mode active
[ Info: Saved animation to /tmp/jl_e5yV0HMNYd.gif