Distributions and the Jacobian

This series of articles will seek to motivate the Bijectors.jl package, which provides the tools for transforming distributions in the Turing.jl probabilistic programming language.

It assumes:

some basic knowledge of probability distributions (the notions of sampling from them and calculating the probability density function for a given distribution); and
some calculus (the chain and product rules for differentiation, and changes of variables in integrals).

import Random
Random.seed!(468);

using Distributions: Normal, LogNormal, logpdf, Distributions
using Plots: histogram

Sampling from a distribution

To sample from a distribution (as defined in Distributions.jl), we can use the rand function. Let’s sample from a normal distribution and then plot a histogram of the samples.

samples = rand(Normal(), 5000)
histogram(samples, bins=50)

(Calling Normal() without any arguments, as we do here, gives us a normal distribution with mean 0 and standard deviation 1.) If you want to know the log probability density of observing any of the samples, you can use logpdf:

println("sample: $(samples[1])")
println("logpdf: $(logpdf(Normal(), samples[1]))")

sample: 0.04374853981619864
logpdf: -0.9198955005726975

The probability density function for the normal distribution with mean 0 and standard deviation 1 is

\[p(x) = \frac{1}{\sqrt{2\pi}} \exp{\left(-\frac{x^2}{2}\right)},\]

so we could also have calculated this manually using:

log(1 / sqrt(2π) * exp(-samples[1]^2 / 2))

-0.9198955005726974

(or more efficiently, -(samples[1]^2 + log2π) / 2, where log2π is from the IrrationalConstants.jl package).

Sampling from a transformed distribution

Say that \(x\) is distributed according to Normal(), and we want to draw samples of \(y = \exp(x)\). Now, \(y\) is itself a random variable, and like any other random variable, will have a probability distribution, which we’ll call \(q(y)\).

In this specific case, the distribution of \(y\) is known as a log-normal distribution. For the purposes of this tutorial, let’s implement our own MyLogNormal distribution that we can sample from. (Distributions.jl already defines its own LogNormal, so we have to use a different name.) To do this, we need to overload Base.rand for our new distribution.

struct MyLogNormal <: Distributions.ContinuousUnivariateDistribution
    μ::Float64
    σ::Float64
end
MyLogNormal() = MyLogNormal(0.0, 1.0)

function Base.rand(rng::Random.AbstractRNG, d::MyLogNormal)
  exp(rand(rng, Normal(d.μ, d.σ)))
end

Now we can do the same as above:

samples_lognormal = rand(MyLogNormal(), 5000)
# Cut off the tail for clearer visualisation
histogram(samples_lognormal, bins=0:0.1:5; xlims=(0, 5))

How do we implement logpdf for our new distribution, though? Or in other words, if we observe a sample \(y\), how do we know what the probability of drawing that sample was?

Naively, we might think to just un-transform the variable y by reversing the exponential, i.e. taking the logarithm. We could then use the logpdf of the original distribution of x.

naive_logpdf(d::MyLogNormal, y) = logpdf(Normal(d.μ, d.σ), log(y))

naive_logpdf (generic function with 1 method)

We can compare this function against the logpdf implemented in Distributions.jl:

println("Sample   : $(samples_lognormal[1])")
println("Expected : $(logpdf(LogNormal(), samples_lognormal[1]))")
println("Actual   : $(naive_logpdf(MyLogNormal(), samples_lognormal[1]))")

Sample   : 2.2331001636281114
Expected : -2.0450477723405234
Actual   : -1.2416569444078478

Clearly this approach is not quite correct!

The derivative

The reason why this doesn’t work is because transforming a (continuous) distribution causes probability density to be stretched and otherwise moved around. For example, in the normal distribution, half of the probability density is between \(-\infty\) and \(0\), and half is between \(0\) and \(\infty\). When exponentiated (i.e. in the log-normal distribution), the first half of the density is mapped to the interval \((0, 1)\), and the second half to \((1, \infty)\).

This ‘explanation’ on its own does not really mean much, though. A perhaps more useful approach is to not talk about probability densities, but instead to make it more concrete by relating them to actual probabilities. If we think about the normal distribution as a continuous curve, what the probability density function \(p(x)\) really tells us is that: for any two points \(a\) and \(b\) (where \(a \leq b\)), the probability of drawing a sample between \(a\) and \(b\) is the corresponding area under the curve, i.e.

\[\int_a^b p(x) \, \mathrm{d}x.\]

For example, if \((a, b) = (-\infty, \infty)\), then the probability of drawing a sample between \(a\) and \(b\) is 1.

Let’s say that the probability density function of the log-normal distribution is \(q(y)\). Then, the area under the curve between the two points \(\exp(a)\) and \(\exp(b)\) is:

\[\int_{\exp(a)}^{\exp(b)} q(y) \, \mathrm{d}y.\]

This integral should be equal to the one above, because the probability of drawing from \([a, b]\) in the original distribution should be the same as the probability of drawing from \([\exp(a), \exp(b)]\) in the transformed distribution. The question we have to solve here is: how do we find a function \(q(y)\) such that this equality holds?

We can approach this by making the substitution \(y = \exp(x)\) in the first integral (see Wikipedia for a refresher on substitutions in integrals, if needed). We have that:

\[\frac{\mathrm{d}y}{\mathrm{d}x} = \exp(x) = y \implies \mathrm{d}x = \frac{1}{y}\,\mathrm{d}y\]

and so

\[\int_{x=a}^{x=b} p(x) \, \mathrm{d}x = \int_{y=\exp(a)}^{y=\exp(b)} p(\log(y)) \frac{1}{y} \,\mathrm{d}y = \int_{\exp(a)}^{\exp(b)} q(y) \, \mathrm{d}y, \]

from which we can read off \(q(y) = p(\log(y)) / y\).

In contrast, when we implemented naive_logpdf

naive_logpdf(d::MyLogNormal, y) = logpdf(Normal(d.μ, d.σ), log(y))

naive_logpdf (generic function with 1 method)

that was the equivalent of saying that \(q(y) = p(\log(y))\). We left out a factor of \(1/y\)!

Indeed, now we can define the correct logpdf function. Since everything is a logarithm here, instead of multiplying by \(1/y\) we subtract \(\log(y)\):

Distributions.logpdf(d::MyLogNormal, y) = logpdf(Normal(d.μ, d.σ), log(y)) - log(y)

and check that it works:

println("Sample   : $(samples_lognormal[1])")
println("Expected : $(logpdf(LogNormal(), samples_lognormal[1]))")
println("Actual   : $(logpdf(MyLogNormal(), samples_lognormal[1]))")

Sample   : 2.2331001636281114
Expected : -2.0450477723405234
Actual   : -2.0450477723405234

The same process can be applied to any kind of (invertible) transformation. If we have some transformation from \(x\) to \(y\), and the probability density functions of \(x\) and \(y\) are \(p(x)\) and \(q(y)\) respectively, then we have a general formula that:

\[q(y) = p(x) \left| \frac{\mathrm{d}x}{\mathrm{d}y} \right|.\]

In this case, we had \(y = \exp(x)\), so \(\mathrm{d}x/\mathrm{d}y = 1/y\). (This equation is (11.5) in Bishop’s textbook.)

Note

The absolute value here takes care of the case where \(f\) is a decreasing function, i.e., \(f(x) > f(y)\) when \(x < y\). You can try this out with the transformation \(y = -\exp(x)\). If \(a < b\), then \(-\exp(a) > -\exp(b)\), and so you will have to swap the integration limits to ensure that the integral comes out positive.

Note that \(\mathrm{d}y/\mathrm{d}x\) is equal to \((\mathrm{d}x/\mathrm{d}y)^{-1}\), so the formula above can also be written as:

\[q(y) \left| \frac{\mathrm{d}y}{\mathrm{d}x} \right| = p(x).\]

The Jacobian

In general, we may have transforms that act on multivariate distributions: for example, something mapping \(p(x_1, x_2)\) to \(q(y_1, y_2)\). In this case, we need to extend the rule above by introducing what is known as the Jacobian matrix:

In this case, the rule above has to be extended by replacing the derivative \(\mathrm{d}x/\mathrm{d}y\) with the determinant of the inverse Jacobian matrix:

\[\mathbf{J} = \begin{pmatrix} \partial y_1/\partial x_1 & \partial y_1/\partial x_2 \\ \partial y_2/\partial x_1 & \partial y_2/\partial x_2 \end{pmatrix}.\]

This allows us to write the direct generalisation as:

\[q(y_1, y_2) \left| \det(\mathbf{J}) \right| = p(x_1, x_2),\]

or equivalently,

\[q(y_1, y_2) = p(x_1, x_2) \left| \det(\mathbf{J}^{-1}) \right|.\]

where \(\mathbf{J}^{-1}\) is the inverse of the Jacobian matrix. This is the same as equation (11.9) in Bishop.

Note

Instead of inverting the original Jacobian matrix to get \(\mathbf{J}^{-1}\), we could also use the Jacobian of the inverse function:

\[\mathbf{J}_\text{inv} = \begin{pmatrix} \partial x_1/\partial y_1 & \partial x_1/\partial y_2 \\ \partial x_2/\partial y_1 & \partial x_2/\partial y_2 \end{pmatrix}.\]

As it turns out, these are entirely equivalent: the Jacobian of the inverse function is the inverse of the original Jacobian matrix.

The rest of this section will be devoted to an example to show that this works, and contains some slightly less pretty mathematics. If you are already suitably convinced by this stage, then you can skip the rest of this section. (Or if you prefer something more formal, the Wikipedia article on integration by substitution discusses the multivariate case as well.)

An example: the Box–Muller transform

A motivating example where one might like to use a Jacobian is the Box–Muller transform, which is a technique for sampling from a normal distribution.

The Box–Muller transform works by first sampling two random variables from the uniform distribution between 0 and 1:

\[\begin{align} x_1 &\sim U(0, 1) \\ x_2 &\sim U(0, 1). \end{align}\]

Both of these have a probability density function of \(p(x) = 1\) for \(0 < x \leq 1\), and 0 otherwise. Because they are independent, we can write that

\[p(x_1, x_2) = p(x_1) p(x_2) = \begin{cases} 1 & \text{if } 0 < x_1 \leq 1 \text{ and } 0 < x_2 \leq 1, \\ 0 & \text{otherwise}. \end{cases}\]

The next step is to perform the transforms

\[\begin{align} y_1 &= \sqrt{-2 \log(x_1)} \cos(2\pi x_2); \\ y_2 &= \sqrt{-2 \log(x_1)} \sin(2\pi x_2), \end{align}\]

and it turns out that with these transforms, both \(y_1\) and \(y_2\) are independent and normally distributed with mean 0 and standard deviation 1, i.e.

\[q(y_1, y_2) = \frac{1}{2\pi} \exp{\left(-\frac{y_1^2}{2}\right)} \exp{\left(-\frac{y_2^2}{2}\right)}.\]

How can we show that this is the case?

There are many ways to work out the required calculus. Some are more elegant and some rather less so! One of the less headache-inducing ways is to define the intermediate variables:

\[r = \sqrt{-2 \log(x_1)}; \quad \theta = 2\pi x_2,\]

from which we can see that \(y_1 = r\cos\theta\) and \(y_2 = r\sin\theta\), and hence

\[\begin{align} x_1 &= \exp{\left(-\frac{r^2}{2}\right)} = \exp{\left(-\frac{y_1^2}{2}\right)}\exp{\left(-\frac{y_2^2}{2}\right)}; \\ x_2 &= \frac{\theta}{2\pi} = \frac{1}{2\pi} \, \arctan\left(\frac{y_2}{y_1}\right). \end{align}\]

This lets us obtain the requisite partial derivatives in a way that doesn’t involve too much algebra. As an example, we have

\[\frac{\partial x_1}{\partial y_1} = -y_1 \exp{\left(-\frac{y_1^2}{2}\right)}\exp{\left(-\frac{y_2^2}{2}\right)} = -y_1 x_1,\]

(where we used the product rule), and

\[\frac{\partial x_2}{\partial y_1} = \frac{1}{2\pi} \left(\frac{1}{1 + (y_2/y_1)^2}\right) \left(-\frac{y_2}{y_1^2}\right),\]

(where we used the chain rule, and the derivative \(\mathrm{d}(\arctan(a))/\mathrm{d}a = 1/(1 + a^2)\)).

Putting together the Jacobian matrix, we have:

\[\mathbf{J} = \begin{pmatrix} -y_1 x_1 & -y_2 x_1 \\ -cy_2/y_1^2 & c/y_1 \\ \end{pmatrix},\]

where \(c = [2\pi(1 + (y_2/y_1)^2)]^{-1}\). The determinant of this matrix is

\[\begin{align} \det(\mathbf{J}) &= -cx_1 - cx_1(y_2/y_1)^2 \\ &= -cx_1\left[1 + \left(\frac{y_2}{y_1}\right)^2\right] \\ &= -\frac{1}{2\pi} x_1 \\ &= -\frac{1}{2\pi}\exp{\left(-\frac{y_1^2}{2}\right)}\exp{\left(-\frac{y_2^2}{2}\right)}, \end{align}\]

Coming right back to our probability density, we have that

\[\begin{align} q(y_1, y_2) &= p(x_1, x_2) \cdot |\det(\mathbf{J})| \\ &= \frac{1}{2\pi}\exp{\left(-\frac{y_1^2}{2}\right)}\exp{\left(-\frac{y_2^2}{2}\right)}, \end{align}\]

as desired.

Note

We haven’t yet explicitly accounted for the fact that \(p(x_1, x_2)\) is 0 if either \(x_1\) or \(x_2\) are outside the range \((0, 1]\). For example, if this constraint on \(x_1\) and \(x_2\) were to result in inaccessible values of \(y_1\) or \(y_2\), then \(q(y_1, y_2)\) should be 0 for those values. Formally, for the transformation \(f: X \to Y\) where \(X\) is the unit square (i.e. \(0 < x_1, x_2 \leq 1\)), \(q(y_1, y_2)\) should only take the above value for the image of \(f\), and anywhere outside of the image it should be 0.

In our case, the \(\log(x_1)\) term in the transform varies between 0 and \(\infty\), and the \(\cos(2\pi x_2)\) term ranges from \(-1\) to \(1\). Hence \(y_1\), which is the product of these two terms, ranges from \(-\infty\) to \(\infty\), and likewise for \(y_2\). So the image of \(f\) is the entire real plane, and we don’t have to worry about this.

Having seen the theory that underpins how distributions can be transformed, let’s now turn to how this is implemented in the Turing ecosystem.