Optimization

Parameter-Free Optimization Rules

We provide custom optimization rules that are not provided out-of-the-box by Optimisers.jl. The main theme of the provided optimizers is that they are parameter-free. This means that these optimization rules shouldn't require (or barely) any tuning to obtain performance competitive with well-tuned alternatives.

AdvancedVI.DoG — Type

DoG(alpha)

Distance over gradient (DoG^[IHC2023]) optimizer. Its only parameter is the guess for the distance between the optimum and the initialization alpha, which shouldn't need much tuning.

DoG works by starting from a AdaGrad-like update rule with a small step size, but then automatically increases the step size ("warming up") to be as large as possible. If alpha is too large, the optimzier can initially diverge, while if it is too small, the warm up period can be too long.

Parameters

alpha: Scaling factor for initial guess (repsilon in the original paper) of the Euclidean distance between the initial point and the optimum. For the initial parameter lambda0, repsilon is calculated as repsilon = alpha*(1 + norm(lambda0)). (default value: 1e-6)

source

AdvancedVI.DoWG — Type

DoWG(alpha)

Distance over weighted gradient (DoWG^[KMJ2024]) optimizer. Its only parameter is the guess for the distance between the optimum and the initialization alpha, which shouldn't need much tuning.

DoWG is a minor modification to DoG so that the step sizes are always provably larger than DoG. Similarly to DoG, it works by starting from a AdaGrad-like update rule with a small step size, but then automatically increases the step size ("warming up") to be as large as possible. If alpha is too large, the optimzier can initially diverge, while if it is too small, the warm up period can be too long. Depending on the problem, DoWG can be too aggressive and result in unstable behavior. If this is suspected, try using DoG instead.

Parameters

alpha: Scaling factor for initial guess (repsilon in the original paper) of the Euclidean distance between the initial point and the optimum. For the initial parameter lambda0, repsilon is calculated as repsilon = alpha*(1 + norm(lambda0)). (default value: 1e-6)

source

AdvancedVI.COCOB — Type

COCOB(alpha)

Continuous Coin Betting (COCOB^[OT2017]) optimizer. We use the "COCOB-Backprop" variant, which is closer to the Adam optimizer. Its only parameter is the maximum change per parameter alpha, which shouldn't need much tuning.

Parameters

alpha: Scaling parameter. (default value: 100)

source

Parameter Averaging Strategies

In some cases, the best optimization performance is obtained by averaging the sequence of parameters generated by the optimization algorithm. For instance, the DoG^[IHC2023] and DoWG^[KMJ2024] papers report their best performance through averaging. The benefits of parameter averaging have been specifically confirmed for ELBO maximization^[DCAMHV2020].

AdvancedVI.NoAveraging — Type

NoAveraging()

No averaging. This returns the last-iterate of the optimization rule.

source

AdvancedVI.PolynomialAveraging — Type

PolynomialAveraging(eta)

Polynomial averaging rule proposed Shamir and Zhang^[SZ2013]. At iteration t, the parameter average $ \bar{\lambda}_t $ according to the polynomial averaging rule is given as

\[ \bar{\lambda}_t = (1 - w_t) \bar{\lambda}_{t-1} + w_t \lambda_t \, ,\]

where the averaging weight is

\[ w_t = \frac{\eta + 1}{t + \eta} \, .\]

Higher eta ($\eta$) down-weights earlier iterations. When $\eta=0$, this is equivalent to uniformly averaging the iterates in an online fashion. The DoG paper^[IHC2023] suggests $\eta=8$.

Parameters

eta: Regularization term. (default: 8)

source

Operators

Depending on the variational family, variational objective, and optimization strategy, it might be necessary to modify the variational parameters after performing a gradient-based update. For this, an operator acting on the parameters can be supplied via the operator keyword argument of AdvancedVI.optimize.

`ClipScale`

For the location-scale family, it is often the case that optimization is stable only when the smallest eigenvalue of the scale matrix is strictly positive^[D2020]. To ensure this, we provide the following projection operator:

AdvancedVI.ClipScale — Type

ClipScale(ϵ = 1e-5)

Projection operator ensuring that an MvLocationScale or MvLocationScaleLowRank has a scale with eigenvalues larger than ϵ. ClipScale also supports by operating on MvLocationScale and MvLocationScaleLowRank wrapped by a Bijectors.TransformedDistribution object.

source

`ProximalLocationScaleEntropy`

ELBO maximization with the location-scale family tends to be unstable when the scale has small eigenvalues or the stepsize is large. To remedy this, a proximal operator of the entropy^[D2020] can be used.

AdvancedVI.ProximalLocationScaleEntropy — Type

ProximalLocationScaleEntropy()

Proximal operator for the entropy of a location-scale distribution, which is defined as

\[ \mathrm{prox}(\lambda) = \argmin_{\lambda^{\prime}} - \mathbb{H}(q_{\lambda^{\prime}}) + \frac{1}{2 \gamma_t} {\left\lVert \lambda - \lambda^{\prime} \right\rVert}_2^2 ,\]

where $\gamma_t$ is the stepsize the optimizer used with the proximal operator. This assumes the variational family is <:VILocationScale and the optimizer is one of the following:

DoG
DoWG
Descent

For ELBO maximization, since this proximal operator handles the entropy, the gradient estimator for the ELBO must ignore the entropy term. That is, the entropy keyword argument of RepGradELBO muse be one of the following:

ClosedFormEntropyZeroGradient
StickingTheLandingEntropyZeroGradient

source

OT2017Orabona, F., & Tommasi, T. (2017). Training deep networks without learning rates through coin betting. Advances in Neural Information Processing Systems, 30.
SZ2013Shamir, O., & Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning (pp. 71-79). PMLR.
DCAMHV2020Dhaka, A. K., Catalina, A., Andersen, M. R., Magnusson, M., Huggins, J., & Vehtari, A. (2020). Robust, accurate stochastic optimization for variational inference. Advances in Neural Information Processing Systems, 33, 10961-10973.
KMJ2024Khaled, A., Mishchenko, K., & Jin, C. (2023). Dowg unleashed: An efficient universal parameter-free gradient descent method. Advances in Neural Information Processing Systems, 36, 6748-6769.
IHC2023Ivgi, M., Hinder, O., & Carmon, Y. (2023). Dog is sgd's best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning (pp. 14465-14499). PMLR.
D2020Domke, J. (2020). Provable smoothness guarantees for black-box variational inference. In International Conference on Machine Learning.