KLMinScoreGradDescent

Description

This algorithms aim to minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence via stochastic gradient descent in the space of parameters. Specifically, it uses the the score gradient estimator, which is similar to the algorithm that was originally referred to as black-box variational inference (BBVI; [RGB2014][WW2013]). (The term BBVI has also recently been used to refer to the more general setup of maximizing the ELBO in parameter space. We are using the more narrow definition, which restricts to the use of the score gradient.) However, instead of using the vanilla score gradient estimator, we differentiate the "VarGrad" objective[RBNRA2020], which results in the score gradient variance-reduced by the leave-one-out control variate[SK2014][KvHW2019]. KLMinScoreGradDescent is also an alias of BBVI.

AdvancedVI.KLMinScoreGradDescentType
KLMinScoreGradDescent(adtype; optimizer, n_samples, averager, operator)

KL divergence minimization by running stochastic gradient descent with the score gradient in the Euclidean space of variational parameters.

Note

If a <:MvLocationScale variational family is used, for operator, IdentityOperator should be avoided since optimization can result in a singular scale matrix. Instead, consider using ClipScale.

Arguments

  • adtype: Automatic differentiation backend.

Keyword Arguments

  • optimizer::Optimisers.AbstractRule: Optimization algorithm to be used. Only DoG, DoWG and Optimisers.Descent are supported. (default: DoWG())
  • n_samples::Int: Number of Monte Carlo samples to be used for estimating each gradient.
  • averager::AbstractAverager: Parameter averaging strategy. (default: PolynomialAveraging())
  • operator::Union{<:IdentityOperator, <:ClipScale}: Operator to be applied after each gradient descent step. (default: IdentityOperator())
  • subsampling::Union{<:Nothing,<:AbstractSubsampling}: Data point subsampling strategy. If nothing, subsampling is not used. (default: nothing)

Output

  • q_averaged: The variational approximation formed by the averaged SGD iterates.

Callback

The callback function callback has a signature of

callback(; rng, iteration, restructure, params, averaged_params, restructure, gradient)

The arguments are as follows:

  • rng: Random number generator internally used by the algorithm.
  • iteration: The index of the current iteration.
  • restructure: Function that restructures the variational approximation from the variational parameters. Calling restructure(params) reconstructs the current variational approximation.
  • params: Current variational parameters.
  • averaged_params: Variational parameters averaged according to the averaging strategy.
  • gradient: The estimated (possibly stochastic) gradient.

Requirements

  • The trainable parameters in the variational approximation are expected to be extractable through Optimisers.destructure. This requires the variational approximation to be marked as a functor through Functors.@functor.
  • The variational approximation $q_{\lambda}$ implements rand.
  • The variational approximation $q_{\lambda}$ implements logpdf(q, x), which should also be differentiable with respect to x.
  • The target distribution and the variational approximation have the same support.
source

The associated objective value can be estimated through the following:

AdvancedVI.estimate_objectiveMethod
estimate_objective([rng,] alg, q, prob; n_samples, entropy)

Estimate the ELBO of the variational approximation q against the target log-density prob.

Arguments

  • rng::Random.AbstractRNG: Random number generator.
  • alg::Union{<:KLMinRepGradDescent,<:KLMinRepGradProxDescent,<:KLMinScoreGradDescent}: Variational inference algorithm.
  • q: Variational approximation.
  • prob: The target log-joint likelihood implementing the LogDensityProblem interface.

Keyword Arguments

  • n_samples::Int: Number of Monte Carlo samples for estimating the objective. (default: Same as the the number of samples used for estimating the gradient during optimization.)
  • entropy::AbstractEntropyEstimator: Entropy estimator. (default: MonteCarloEntropy())

Returns

  • obj_est: Estimate of the objective value.
source

Methodology

This algorithm aims to solve the problem

\[ \mathrm{minimize}_{q \in \mathcal{Q}}\quad \mathrm{KL}\left(q, \pi\right)\]

where $\mathcal{Q}$ is some family of distributions, often called the variational family, by running stochastic gradient descent in the (Euclidean) space of parameters. That is, for all $q_{\lambda} \in \mathcal{Q}$, we assume $q_{\lambda}$ there is a corresponding vector of parameters $\lambda \in \Lambda$, where the space of parameters is Euclidean such that $\Lambda \subset \mathbb{R}^p$.

Since we usually only have access to the unnormalized densities of the target distribution $\pi$, we don't have direct access to the KL divergence. Instead, the ELBO maximization strategy maximizes a surrogate objective, the evidence lower bound (ELBO; [JGJS1999])

\[ \mathrm{ELBO}\left(q\right) \triangleq \mathbb{E}_{\theta \sim q} \log \pi\left(\theta\right) + \mathbb{H}\left(q\right),\]

which is equivalent to the KL up to an additive constant (the evidence).

Algorithmically, KLMinRepGradDescent iterates the step

\[ \lambda_{t+1} = \mathrm{operator}\big( \lambda_{t} + \gamma_t \widehat{\nabla_{\lambda} \mathrm{ELBO}} (q_{\lambda_t}) \big) , \]

where $\widehat{\nabla \mathrm{ELBO}}(q_{\lambda})$ is the score gradient estimate[G1990][KR1996][RSU1996][W1992] of the ELBO gradient and $\mathrm{operator}$ is an optional operator (e.g. projections, identity mapping).

Let us describe the score gradient estimator[G1990][KR1996][RSU1996][W1992] of the ELBO gradient, also known as the score function method and the REINFORCE gradient. For variational inference, the use of the score gradient was proposed in [WW2013][RGB2014]. Unlike the reparameterization gradient, the score gradient does not require the target log density to be differentiable, and does not differentiate through the sampling process of the variational approximation $q$. Instead, it only requires gradients of the log density $\log q$. For this reason, the score gradient is the standard method to deal with discrete variables and targets with discrete support. In more detail, the score gradient uses the Fisher log-derivative identity: For any regular $f$,

\[\nabla_{\lambda} \mathbb{E}_{z \sim q_{\lambda}} f = \mathbb{E}_{z \sim q_{\lambda}}\left[ f(z) \log q_{\lambda}(z) \right] \; .\]

The ELBO corresponds to the case where $f = \log \pi / \log q$, where $\log \pi$ is the target log density. Instead of implementing the canonical score gradient, KLMinScoreGradDescent internally uses the "VarGrad" objective[RBNRA2020]:

\[\widehat{\mathrm{VarGrad}}(\lambda) = \mathrm{Var}\left( \log q_{\lambda}(z_i) - \log \pi\left(z_i\right) \right) \; ,\]

where the variance is computed over the samples $z_1, \ldots, z_m \sim q_{\lambda}$. Differentiating the VarGrad objective corresponds to the canonical score gradient combined with the "leave-one-out" control variate[SK2014][KvHW2019].

  • RBNRA2020Richter, L., Boustati, A., Nüsken, N., Ruiz, F., & Akyildiz, O. D. (2020). Vargrad: a low-variance gradient estimator for variational inference. Advances in Neural Information Processing Systems, 33, 13481-13492.
  • SK2014Salimans, T., & Knowles, D. A. (2014). On using control variates with stochastic approximation for variational bayes and its connection to stochastic linear regression. arXiv preprint arXiv:1401.1022.
  • JGJS1999Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37, 183-233.
  • G1990Glynn, P. W. (1990). Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10), 75-84.
  • KR1996Kleijnen, J. P., & Rubinstein, R. Y. (1996). Optimization and sensitivity analysis of computer simulation models by the score function method. European Journal of Operational Research, 88(3), 413-427.
  • RSU1996Rubinstein, R. Y., Shapiro, A., & Uryasev, S. (1996). The score function method. Encyclopedia of Management Sciences, 1363-1366.
  • W1992Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8, 229-256.
  • WW2013Wingate, D., & Weber, T. (2013). Automated variational inference in probabilistic programming. arXiv preprint arXiv:1301.1299.
  • RGB2014Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial intelligence and statistics (pp. 814-822). PMLR.
  • KvHW2019Kool, W., van Hoof, H., & Welling, M. (2019). Buy 4 reinforce samples, get a baseline for free!.