Optimization
Parameter-Free Optimization Rules
We provide custom optimization rules that are not provided out-of-the-box by Optimisers.jl. The main theme of the provided optimizers is that they are parameter-free. This means that these optimization rules shouldn't require (or barely) any tuning to obtain performance competitive with well-tuned alternatives.
AdvancedVI.DoG
— TypeDoG(repsilon)
Distance over gradient (DoG[IHC2023]) optimizer. It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon. The original paper recommends $ 10^{-4} ( 1 + \lVert \lambda_0 \rVert ) $, but the default value is $ 10^{-6} $.
Parameters
repsilon
: Initial guess of the Euclidean distance between the initial point and the optimum. (default value:1e-6
)
AdvancedVI.DoWG
— TypeDoWG(repsilon)
Distance over weighted gradient (DoWG[KMJ2024]) optimizer. It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon.
Parameters
repsilon
: Initial guess of the Euclidean distance between the initial point and the optimum. (default value:1e-6
)
AdvancedVI.COCOB
— TypeCOCOB(alpha)
Continuous Coin Betting (COCOB[OT2017]) optimizer. We use the "COCOB-Backprop" variant, which is closer to the Adam optimizer. It's only parameter is the maximum change per parameter α, which shouldn't need much tuning.
Parameters
alpha
: Scaling parameter. (default value:100
)
Parameter Averaging Strategies
In some cases, the best optimization performance is obtained by averaging the sequence of parameters generated by the optimization algorithm. For instance, the DoG
[IHC2023] and DoWG
[KMJ2024] papers report their best performance through averaging. The benefits of parameter averaging have been specifically confirmed for ELBO maximization[DCAMHV2020].
AdvancedVI.NoAveraging
— TypeNoAveraging()
No averaging. This returns the last-iterate of the optimization rule.
AdvancedVI.PolynomialAveraging
— TypePolynomialAveraging(eta)
Polynomial averaging rule proposed Shamir and Zhang[SZ2013]. At iteration t
, the parameter average $ \bar{\lambda}_t $ according to the polynomial averaging rule is given as
\[ \bar{\lambda}_t = (1 - w_t) \bar{\lambda}_{t-1} + w_t \lambda_t \, ,\]
where the averaging weight is
\[ w_t = \frac{\eta + 1}{t + \eta} \, .\]
Higher eta
($\eta$) down-weights earlier iterations. When $\eta=0$, this is equivalent to uniformly averaging the iterates in an online fashion. The DoG paper[IHC2023] suggests $\eta=8$.
Parameters
eta
: Regularization term. (default:8
)
Operators
Depending on the variational family, variational objective, and optimization strategy, it might be necessary to modify the variational parameters after performing a gradient-based update. For this, an operator acting on the parameters can be supplied via the operator
keyword argument of AdvancedVI.optimize
.
ClipScale
For the location-scale family, it is often the case that optimization is stable only when the smallest eigenvalue of the scale matrix is strictly positive[D2020]. To ensure this, we provide the following projection operator:
AdvancedVI.ClipScale
— TypeClipScale(ϵ = 1e-5)
Projection operator ensuring that an MvLocationScale
or MvLocationScaleLowRank
has a scale with eigenvalues larger than ϵ
. ClipScale
also supports by operating on MvLocationScale
and MvLocationScaleLowRank
wrapped by a Bijectors.TransformedDistribution
object.
ProximalLocationScaleEntropy
ELBO maximization with the location-scale family tends to be unstable when the scale has small eigenvalues or the stepsize is large. To remedy this, a proximal operator of the entropy[D2020] can be used.
AdvancedVI.ProximalLocationScaleEntropy
— TypeProximalLocationScaleEntropy()
Proximal operator for the entropy of a location-scale distribution, which is defined as
\[ \mathrm{prox}(\lambda) = \argmin_{\lambda^{\prime}} - \mathbb{H}(q_{\lambda^{\prime}}) + \frac{1}{2 \gamma_t} \left\lVert \lambda - \lambda^{\prime} \right\rVert ,\]
where $\gamma_t$ is the stepsize the optimizer used with the proximal operator. This assumes the variational family is <:VILocationScale
and the optimizer is one of the following:
DoG
DoWG
Descent
For ELBO maximization, since this proximal operator handles the entropy, the gradient estimator for the ELBO must ignore the entropy term. That is, the entropy
keyword argument of RepGradELBO
muse be one of the following:
ClosedFormEntropyZeroGradient
StickingTheLandingEntropyZeroGradient
- OT2017Orabona, F., & Tommasi, T. (2017). Training deep networks without learning rates through coin betting. Advances in Neural Information Processing Systems, 30.
- SZ2013Shamir, O., & Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning (pp. 71-79). PMLR.
- DCAMHV2020Dhaka, A. K., Catalina, A., Andersen, M. R., Magnusson, M., Huggins, J., & Vehtari, A. (2020). Robust, accurate stochastic optimization for variational inference. Advances in Neural Information Processing Systems, 33, 10961-10973.
- KMJ2024Khaled, A., Mishchenko, K., & Jin, C. (2023). Dowg unleashed: An efficient universal parameter-free gradient descent method. Advances in Neural Information Processing Systems, 36, 6748-6769.
- IHC2023Ivgi, M., Hinder, O., & Carmon, Y. (2023). Dog is sgd's best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning (pp. 14465-14499). PMLR.
- D2020Domke, J. (2020). Provable smoothness guarantees for black-box variational inference. In International Conference on Machine Learning.