Optimization
Parameter-Free Optimization Rules
We provide custom optimization rules that are not provided out-of-the-box by Optimisers.jl. The main theme of the provided optimizers is that they are parameter-free. This means that these optimization rules shouldn't require (or barely) any tuning to obtain performance competitive with well-tuned alternatives.
AdvancedVI.DoG
— TypeDoG(alpha)
Distance over gradient (DoG[IHC2023]) optimizer. Its only parameter is the guess for the distance between the optimum and the initialization alpha
, which shouldn't need much tuning.
DoG works by starting from a AdaGrad-like update rule with a small step size, but then automatically increases the step size ("warming up") to be as large as possible. If alpha
is too large, the optimzier can initially diverge, while if it is too small, the warm up period can be too long.
Parameters
alpha
: Scaling factor for initial guess (repsilon
in the original paper) of the Euclidean distance between the initial point and the optimum. For the initial parameterlambda0
,repsilon
is calculated asrepsilon = alpha*(1 + norm(lambda0))
. (default value:1e-6
)
AdvancedVI.DoWG
— TypeDoWG(alpha)
Distance over weighted gradient (DoWG[KMJ2024]) optimizer. Its only parameter is the guess for the distance between the optimum and the initialization alpha
, which shouldn't need much tuning.
DoWG is a minor modification to DoG so that the step sizes are always provably larger than DoG. Similarly to DoG, it works by starting from a AdaGrad-like update rule with a small step size, but then automatically increases the step size ("warming up") to be as large as possible. If alpha
is too large, the optimzier can initially diverge, while if it is too small, the warm up period can be too long. Depending on the problem, DoWG can be too aggressive and result in unstable behavior. If this is suspected, try using DoG instead.
Parameters
alpha
: Scaling factor for initial guess (repsilon
in the original paper) of the Euclidean distance between the initial point and the optimum. For the initial parameterlambda0
,repsilon
is calculated asrepsilon = alpha*(1 + norm(lambda0))
. (default value:1e-6
)
AdvancedVI.COCOB
— TypeCOCOB(alpha)
Continuous Coin Betting (COCOB[OT2017]) optimizer. We use the "COCOB-Backprop" variant, which is closer to the Adam optimizer. Its only parameter is the maximum change per parameter alpha
, which shouldn't need much tuning.
Parameters
alpha
: Scaling parameter. (default value:100
)
Parameter Averaging Strategies
In some cases, the best optimization performance is obtained by averaging the sequence of parameters generated by the optimization algorithm. For instance, the DoG
[IHC2023] and DoWG
[KMJ2024] papers report their best performance through averaging. The benefits of parameter averaging have been specifically confirmed for ELBO maximization[DCAMHV2020].
AdvancedVI.NoAveraging
— TypeNoAveraging()
No averaging. This returns the last-iterate of the optimization rule.
AdvancedVI.PolynomialAveraging
— TypePolynomialAveraging(eta)
Polynomial averaging rule proposed Shamir and Zhang[SZ2013]. At iteration t
, the parameter average $ \bar{\lambda}_t $ according to the polynomial averaging rule is given as
\[ \bar{\lambda}_t = (1 - w_t) \bar{\lambda}_{t-1} + w_t \lambda_t \, ,\]
where the averaging weight is
\[ w_t = \frac{\eta + 1}{t + \eta} \, .\]
Higher eta
($\eta$) down-weights earlier iterations. When $\eta=0$, this is equivalent to uniformly averaging the iterates in an online fashion. The DoG paper[IHC2023] suggests $\eta=8$.
Parameters
eta
: Regularization term. (default:8
)
Operators
Depending on the variational family, variational objective, and optimization strategy, it might be necessary to modify the variational parameters after performing a gradient-based update. For this, an operator acting on the parameters can be supplied via the operator
keyword argument of AdvancedVI.optimize
.
ClipScale
For the location-scale family, it is often the case that optimization is stable only when the smallest eigenvalue of the scale matrix is strictly positive[D2020]. To ensure this, we provide the following projection operator:
AdvancedVI.ClipScale
— TypeClipScale(ϵ = 1e-5)
Projection operator ensuring that an MvLocationScale
or MvLocationScaleLowRank
has a scale with eigenvalues larger than ϵ
. ClipScale
also supports by operating on MvLocationScale
and MvLocationScaleLowRank
wrapped by a Bijectors.TransformedDistribution
object.
ProximalLocationScaleEntropy
ELBO maximization with the location-scale family tends to be unstable when the scale has small eigenvalues or the stepsize is large. To remedy this, a proximal operator of the entropy[D2020] can be used.
AdvancedVI.ProximalLocationScaleEntropy
— TypeProximalLocationScaleEntropy()
Proximal operator for the entropy of a location-scale distribution, which is defined as
\[ \mathrm{prox}(\lambda) = \argmin_{\lambda^{\prime}} - \mathbb{H}(q_{\lambda^{\prime}}) + \frac{1}{2 \gamma_t} {\left\lVert \lambda - \lambda^{\prime} \right\rVert}_2^2 ,\]
where $\gamma_t$ is the stepsize the optimizer used with the proximal operator. This assumes the variational family is <:VILocationScale
and the optimizer is one of the following:
DoG
DoWG
Descent
For ELBO maximization, since this proximal operator handles the entropy, the gradient estimator for the ELBO must ignore the entropy term. That is, the entropy
keyword argument of RepGradELBO
muse be one of the following:
ClosedFormEntropyZeroGradient
StickingTheLandingEntropyZeroGradient
- OT2017Orabona, F., & Tommasi, T. (2017). Training deep networks without learning rates through coin betting. Advances in Neural Information Processing Systems, 30.
- SZ2013Shamir, O., & Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning (pp. 71-79). PMLR.
- DCAMHV2020Dhaka, A. K., Catalina, A., Andersen, M. R., Magnusson, M., Huggins, J., & Vehtari, A. (2020). Robust, accurate stochastic optimization for variational inference. Advances in Neural Information Processing Systems, 33, 10961-10973.
- KMJ2024Khaled, A., Mishchenko, K., & Jin, C. (2023). Dowg unleashed: An efficient universal parameter-free gradient descent method. Advances in Neural Information Processing Systems, 36, 6748-6769.
- IHC2023Ivgi, M., Hinder, O., & Carmon, Y. (2023). Dog is sgd's best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning (pp. 14465-14499). PMLR.
- D2020Domke, J. (2020). Provable smoothness guarantees for black-box variational inference. In International Conference on Machine Learning.