Optimization
Parameter-Free Optimization Rules
We provide custom optimization rules that are not provided out-of-the-box by Optimisers.jl. The main theme of the provided optimizers is that they are parameter-free. This means that these optimization rules shouldn't require (or barely) any tuning to obtain performance competitive with well-tuned alternatives.
AdvancedVI.DoG
— TypeDoG(repsilon)
Distance over gradient (DoG[IHC2023]) optimizer. It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon. The original paper recommends $ 10^{-4} ( 1 + \lVert \lambda_0 \rVert ) $, but the default value is $ 10^{-6} $.
Parameters
repsilon
: Initial guess of the Euclidean distance between the initial point and the optimum. (default value:1e-6
)
AdvancedVI.DoWG
— TypeDoWG(repsilon)
Distance over weighted gradient (DoWG[KMJ2024]) optimizer. It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon.
Parameters
repsilon
: Initial guess of the Euclidean distance between the initial point and the optimum. (default value:1e-6
)
AdvancedVI.COCOB
— TypeCOCOB(alpha)
Continuous Coin Betting (COCOB[OT2017]) optimizer. We use the "COCOB-Backprop" variant, which is closer to the Adam optimizer. It's only parameter is the maximum change per parameter α, which shouldn't need much tuning.
Parameters
alpha
: Scaling parameter. (default value:100
)
Parameter Averaging Strategies
In some cases, the best optimization performance is obtained by averaging the sequence of parameters generated by the optimization algorithm. For instance, the DoG
[IHC2023] and DoWG
[KMJ2024] papers report their best performance through averaging. The benefits of parameter averaging have been specifically confirmed for ELBO maximization[DCAMHV2020].
AdvancedVI.NoAveraging
— TypeNoAveraging()
No averaging. This returns the last-iterate of the optimization rule.
AdvancedVI.PolynomialAveraging
— TypePolynomialAveraging(eta)
Polynomial averaging rule proposed Shamir and Zhang[SZ2013]. At iteration t
, the parameter average $ \bar{\lambda}_t $ according to the polynomial averaging rule is given as
\[ \bar{\lambda}_t = (1 - w_t) \bar{\lambda}_{t-1} + w_t \lambda_t \, ,\]
where the averaging weight is
\[ w_t = \frac{\eta + 1}{t + \eta} \, .\]
Higher eta
($\eta$) down-weights earlier iterations. When $\eta=0$, this is equivalent to uniformly averaging the iterates in an online fashion. The DoG paper[IHC2023] suggests $\eta=8$.
Parameters
eta
: Regularization term. (default:8
)
Operators
Depending on the variational family, variational objective, and optimization strategy, it might be necessary to modify the variational parameters after performing a gradient-based update. For this, an operator acting on the parameters can be supplied via the operator
keyword argument of AdvancedVI.optimize
.
ClipScale
For the location scale, it is often the case that optimization is stable only when the smallest eigenvalue of the scale matrix is strictly positive[D2020]. To ensure this, we provide the following projection operator:
AdvancedVI.ClipScale
— TypeClipScale(ϵ = 1e-5)
Projection operator ensuring that an MvLocationScale
or MvLocationScaleLowRank
has a scale with eigenvalues larger than ϵ
. ClipScale
also supports by operating on MvLocationScale
and MvLocationScaleLowRank
wrapped by a Bijectors.TransformedDistribution
object.
- OT2017Orabona, F., & Tommasi, T. (2017). Training deep networks without learning rates through coin betting. Advances in Neural Information Processing Systems, 30.
- SZ2013Shamir, O., & Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning (pp. 71-79). PMLR.
- DCAMHV2020Dhaka, A. K., Catalina, A., Andersen, M. R., Magnusson, M., Huggins, J., & Vehtari, A. (2020). Robust, accurate stochastic optimization for variational inference. Advances in Neural Information Processing Systems, 33, 10961-10973.
- KMJ2024Khaled, A., Mishchenko, K., & Jin, C. (2023). Dowg unleashed: An efficient universal parameter-free gradient descent method. Advances in Neural Information Processing Systems, 36, 6748-6769.
- IHC2023Ivgi, M., Hinder, O., & Carmon, Y. (2023). Dog is sgd's best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning (pp. 14465-14499). PMLR.
- D2020Domke, J. (2020). Provable smoothness guarantees for black-box variational inference. In International Conference on Machine Learning.