Optimization

Parameter-Free Optimization Rules

We provide custom optimization rules that are not provided out-of-the-box by Optimisers.jl. The main theme of the provided optimizers is that they are parameter-free. This means that these optimization rules shouldn't require (or barely) any tuning to obtain performance competitive with well-tuned alternatives.

AdvancedVI.DoGType
DoG(repsilon)

Distance over gradient (DoG[IHC2023]) optimizer. It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon. The original paper recommends $ 10^{-4} ( 1 + \lVert \lambda_0 \rVert ) $, but the default value is $ 10^{-6} $.

Parameters

  • repsilon: Initial guess of the Euclidean distance between the initial point and the optimum. (default value: 1e-6)
source
AdvancedVI.DoWGType
DoWG(repsilon)

Distance over weighted gradient (DoWG[KMJ2024]) optimizer. It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon.

Parameters

  • repsilon: Initial guess of the Euclidean distance between the initial point and the optimum. (default value: 1e-6)
source
AdvancedVI.COCOBType
COCOB(alpha)

Continuous Coin Betting (COCOB[OT2017]) optimizer. We use the "COCOB-Backprop" variant, which is closer to the Adam optimizer. It's only parameter is the maximum change per parameter α, which shouldn't need much tuning.

Parameters

  • alpha: Scaling parameter. (default value: 100)
source

Parameter Averaging Strategies

In some cases, the best optimization performance is obtained by averaging the sequence of parameters generated by the optimization algorithm. For instance, the DoG[IHC2023] and DoWG[KMJ2024] papers report their best performance through averaging. The benefits of parameter averaging have been specifically confirmed for ELBO maximization[DCAMHV2020].

AdvancedVI.PolynomialAveragingType
PolynomialAveraging(eta)

Polynomial averaging rule proposed Shamir and Zhang[SZ2013]. At iteration t, the parameter average $ \bar{\lambda}_t $ according to the polynomial averaging rule is given as

\[ \bar{\lambda}_t = (1 - w_t) \bar{\lambda}_{t-1} + w_t \lambda_t \, ,\]

where the averaging weight is

\[ w_t = \frac{\eta + 1}{t + \eta} \, .\]

Higher eta ($\eta$) down-weights earlier iterations. When $\eta=0$, this is equivalent to uniformly averaging the iterates in an online fashion. The DoG paper[IHC2023] suggests $\eta=8$.

Parameters

  • eta: Regularization term. (default: 8)
source

Operators

Depending on the variational family, variational objective, and optimization strategy, it might be necessary to modify the variational parameters after performing a gradient-based update. For this, an operator acting on the parameters can be supplied via the operator keyword argument of AdvancedVI.optimize.

ClipScale

For the location-scale family, it is often the case that optimization is stable only when the smallest eigenvalue of the scale matrix is strictly positive[D2020]. To ensure this, we provide the following projection operator:

AdvancedVI.ClipScaleType
ClipScale(ϵ = 1e-5)

Projection operator ensuring that an MvLocationScale or MvLocationScaleLowRank has a scale with eigenvalues larger than ϵ. ClipScale also supports by operating on MvLocationScale and MvLocationScaleLowRank wrapped by a Bijectors.TransformedDistribution object.

source

ProximalLocationScaleEntropy

ELBO maximization with the location-scale family tends to be unstable when the scale has small eigenvalues or the stepsize is large. To remedy this, a proximal operator of the entropy[D2020] can be used.

AdvancedVI.ProximalLocationScaleEntropyType
ProximalLocationScaleEntropy()

Proximal operator for the entropy of a location-scale distribution, which is defined as

\[ \mathrm{prox}(\lambda) = \argmin_{\lambda^{\prime}} - \mathbb{H}(q_{\lambda^{\prime}}) + \frac{1}{2 \gamma_t} \left\lVert \lambda - \lambda^{\prime} \right\rVert ,\]

where $\gamma_t$ is the stepsize the optimizer used with the proximal operator. This assumes the variational family is <:VILocationScale and the optimizer is one of the following:

  • DoG
  • DoWG
  • Descent

For ELBO maximization, since this proximal operator handles the entropy, the gradient estimator for the ELBO must ignore the entropy term. That is, the entropy keyword argument of RepGradELBO muse be one of the following:

  • ClosedFormEntropyZeroGradient
  • StickingTheLandingEntropyZeroGradient
source
  • OT2017Orabona, F., & Tommasi, T. (2017). Training deep networks without learning rates through coin betting. Advances in Neural Information Processing Systems, 30.
  • SZ2013Shamir, O., & Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning (pp. 71-79). PMLR.
  • DCAMHV2020Dhaka, A. K., Catalina, A., Andersen, M. R., Magnusson, M., Huggins, J., & Vehtari, A. (2020). Robust, accurate stochastic optimization for variational inference. Advances in Neural Information Processing Systems, 33, 10961-10973.
  • KMJ2024Khaled, A., Mishchenko, K., & Jin, C. (2023). Dowg unleashed: An efficient universal parameter-free gradient descent method. Advances in Neural Information Processing Systems, 36, 6748-6769.
  • IHC2023Ivgi, M., Hinder, O., & Carmon, Y. (2023). Dog is sgd's best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning (pp. 14465-14499). PMLR.
  • D2020Domke, J. (2020). Provable smoothness guarantees for black-box variational inference. In International Conference on Machine Learning.