Unsupervised Learning using Bayesian Mixture Models

The following tutorial illustrates the use of Turing for clustering data using a Bayesian mixture model. The aim of this task is to infer a latent grouping (hidden structure) from unlabelled data.

Synthetic Data

We generate a synthetic dataset of \(N = 60\) two-dimensional points \(x_i \in \mathbb{R}^2\) drawn from a Gaussian mixture model. For simplicity, we use \(K = 2\) clusters with

  • equal weights, i.e., we use mixture weights \(w = [0.5, 0.5]\), and
  • isotropic Gaussian distributions of the points in each cluster.

More concretely, we use the Gaussian distributions \(\mathcal{N}([\mu_k, \mu_k]^\mathsf{T}, I)\) with parameters \(\mu_1 = -3.5\) and \(\mu_2 = 0.5\).

using Distributions
using FillArrays
using StatsPlots

using LinearAlgebra
using Random

# Set a random seed.
Random.seed!(3)

# Define Gaussian mixture model.
w = [0.5, 0.5]
μ = [-3.5, 0.5]
mixturemodel = MixtureModel([MvNormal(Fill(μₖ, 2), I) for μₖ in μ], w)

# We draw the data points.
N = 60
x = rand(mixturemodel, N);

The following plot shows the dataset.

scatter(x[1, :], x[2, :]; legend=false, title="Synthetic Dataset")

Gaussian Mixture Model in Turing

We are interested in recovering the grouping from the dataset. More precisely, we want to infer the mixture weights, the parameters \(\mu_1\) and \(\mu_2\), and the assignment of each datum to a cluster for the generative Gaussian mixture model.

In a Bayesian Gaussian mixture model with \(K\) components each data point \(x_i\) (\(i = 1,\ldots,N\)) is generated according to the following generative process. First we draw the model parameters, i.e., in our example we draw parameters \(\mu_k\) for the mean of the isotropic normal distributions and the mixture weights \(w\) of the \(K\) clusters. We use standard normal distributions as priors for \(\mu_k\) and a Dirichlet distribution with parameters \(\alpha_1 = \cdots = \alpha_K = 1\) as prior for \(w\): \[ \begin{aligned} \mu_k &\sim \mathcal{N}(0, 1) \qquad (k = 1,\ldots,K)\\ w &\sim \operatorname{Dirichlet}(\alpha_1, \ldots, \alpha_K) \end{aligned} \] After having constructed all the necessary model parameters, we can generate an observation by first selecting one of the clusters \[ z_i \sim \operatorname{Categorical}(w) \qquad (i = 1,\ldots,N), \] and then drawing the datum accordingly, i.e., in our example drawing \[ x_i \sim \mathcal{N}([\mu_{z_i}, \mu_{z_i}]^\mathsf{T}, I) \qquad (i=1,\ldots,N). \] For more details on Gaussian mixture models, we refer to Christopher M. Bishop, Pattern Recognition and Machine Learning, Section 9.

We specify the model with Turing.

using Turing

@model function gaussian_mixture_model(x)
    # Draw the parameters for each of the K=2 clusters from a standard normal distribution.
    K = 2
    μ ~ MvNormal(Zeros(K), I)

    # Draw the weights for the K clusters from a Dirichlet distribution with parameters αₖ = 1.
    w ~ Dirichlet(K, 1.0)
    # Alternatively, one could use a fixed set of weights.
    # w = fill(1/K, K)

    # Construct categorical distribution of assignments.
    distribution_assignments = Categorical(w)

    # Construct multivariate normal distributions of each cluster.
    D, N = size(x)
    distribution_clusters = [MvNormal(Fill(μₖ, D), I) for μₖ in μ]

    # Draw assignments for each datum and generate it from the multivariate normal distribution.
    k = Vector{Int}(undef, N)
    for i in 1:N
        k[i] ~ distribution_assignments
        x[:, i] ~ distribution_clusters[k[i]]
    end

    return k
end

model = gaussian_mixture_model(x);

We run a MCMC simulation to obtain an approximation of the posterior distribution of the parameters \(\mu\) and \(w\) and assignments \(k\). We use a Gibbs sampler that combines a particle Gibbs sampler for the discrete parameters (assignments \(k\)) and a Hamiltonion Monte Carlo sampler for the continuous parameters (\(\mu\) and \(w\)). We generate multiple chains in parallel using multi-threading.

sampler = Gibbs(PG(100, :k), HMC(0.05, 10, :μ, :w))
nsamples = 150
nchains = 4
burn = 10
chains = sample(model, sampler, MCMCThreads(), nsamples, nchains, discard_initial = burn);

The sample() call above assumes that you have at least nchains threads available in your Julia instance. If you do not, the multiple chains will run sequentially, and you may notice a warning. For more information, see the Turing documentation on sampling multiple chains.

Inferred Mixture Model

After sampling we can visualize the trace and density of the parameters of interest.

We consider the samples of the location parameters \(\mu_1\) and \(\mu_2\) for the two clusters.

plot(chains[["μ[1]", "μ[2]"]]; legend=true)

It can happen that the modes of \(\mu_1\) and \(\mu_2\) switch between chains. For more information see the Stan documentation. This is because it’s possible for either model parameter \(\mu_k\) to be assigned to either of the corresponding true means, and this assignment need not be consistent between chains.

That is, the posterior is fundamentally multimodal, and different chains can end up in different modes, complicating inference. One solution here is to enforce an ordering on our \(\mu\) vector, requiring \(\mu_k > \mu_{k-1}\) for all \(k\). Bijectors.jl provides an easy transformation (ordered()) for this purpose:

@model function gaussian_mixture_model_ordered(x)
    # Draw the parameters for each of the K=2 clusters from a standard normal distribution.
    K = 2
    μ ~ Bijectors.ordered(MvNormal(Zeros(K), I))
    # Draw the weights for the K clusters from a Dirichlet distribution with parameters αₖ = 1.
    w ~ Dirichlet(K, 1.0)
    # Alternatively, one could use a fixed set of weights.
    # w = fill(1/K, K)
    # Construct categorical distribution of assignments.
    distribution_assignments = Categorical(w)
    # Construct multivariate normal distributions of each cluster.
    D, N = size(x)
    distribution_clusters = [MvNormal(Fill(μₖ, D), I) for μₖ in μ]
    # Draw assignments for each datum and generate it from the multivariate normal distribution.
    k = Vector{Int}(undef, N)
    for i in 1:N
        k[i] ~ distribution_assignments
        x[:, i] ~ distribution_clusters[k[i]]
    end
    return k
end

model = gaussian_mixture_model_ordered(x);

Now, re-running our model, we can see that the assigned means are consistent across chains:

chains = sample(model, sampler, MCMCThreads(), nsamples, nchains, discard_initial = burn);
plot(chains[["μ[1]", "μ[2]"]]; legend=true)

We also inspect the samples of the mixture weights \(w\).

plot(chains[["w[1]", "w[2]"]]; legend=true)

As the distributions of the samples for the parameters \(\mu_1\), \(\mu_2\), \(w_1\), and \(w_2\) are unimodal, we can safely visualize the density region of our model using the average values.

# Model with mean of samples as parameters.
μ_mean = [mean(chains, "μ[$i]") for i in 1:2]
w_mean = [mean(chains, "w[$i]") for i in 1:2]
mixturemodel_mean = MixtureModel([MvNormal(Fill(μₖ, 2), I) for μₖ in μ_mean], w_mean)
contour(
    range(-7.5, 3; length=1_000),
    range(-6.5, 3; length=1_000),
    (x, y) -> logpdf(mixturemodel_mean, [x, y]);
    widen=false,
)
scatter!(x[1, :], x[2, :]; legend=false, title="Synthetic Dataset")