Optimization

Parameter-Free Optimization Rules

We provide custom optimization rules that are not provided out-of-the-box by Optimisers.jl. The main theme of the provided optimizers is that they are parameter-free. This means that these optimization rules shouldn't require (or barely) any tuning to obtain performance competitive with well-tuned alternatives.

AdvancedVI.DoGType
DoG(repsilon)

Distance over gradient (DoG[IHC2023]) optimizer. It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon. The original paper recommends $ 10^{-4} ( 1 + \lVert \lambda_0 \rVert ) $, but the default value is $ 10^{-6} $.

Parameters

  • repsilon: Initial guess of the Euclidean distance between the initial point and the optimum. (default value: 1e-6)
source
AdvancedVI.DoWGType
DoWG(repsilon)

Distance over weighted gradient (DoWG[KMJ2024]) optimizer. It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon.

Parameters

  • repsilon: Initial guess of the Euclidean distance between the initial point and the optimum. (default value: 1e-6)
source
AdvancedVI.COCOBType
COCOB(alpha)

Continuous Coin Betting (COCOB[OT2017]) optimizer. We use the "COCOB-Backprop" variant, which is closer to the Adam optimizer. It's only parameter is the maximum change per parameter α, which shouldn't need much tuning.

Parameters

  • alpha: Scaling parameter. (default value: 100)
source

Parameter Averaging Strategies

In some cases, the best optimization performance is obtained by averaging the sequence of parameters generated by the optimization algorithm. For instance, the DoG[IHC2023] and DoWG[KMJ2024] papers report their best performance through averaging. The benefits of parameter averaging have been specifically confirmed for ELBO maximization[DCAMHV2020].

AdvancedVI.PolynomialAveragingType
PolynomialAveraging(eta)

Polynomial averaging rule proposed Shamir and Zhang[SZ2013]. At iteration t, the parameter average $ \bar{\lambda}_t $ according to the polynomial averaging rule is given as

\[ \bar{\lambda}_t = (1 - w_t) \bar{\lambda}_{t-1} + w_t \lambda_t \, ,\]

where the averaging weight is

\[ w_t = \frac{\eta + 1}{t + \eta} \, .\]

Higher eta ($\eta$) down-weights earlier iterations. When $\eta=0$, this is equivalent to uniformly averaging the iterates in an online fashion. The DoG paper[IHC2023] suggests $\eta=8$.

Parameters

  • eta: Regularization term. (default: 8)
source

Operators

Depending on the variational family, variational objective, and optimization strategy, it might be necessary to modify the variational parameters after performing a gradient-based update. For this, an operator acting on the parameters can be supplied via the operator keyword argument of AdvancedVI.optimize.

ClipScale

For the location scale, it is often the case that optimization is stable only when the smallest eigenvalue of the scale matrix is strictly positive[D2020]. To ensure this, we provide the following projection operator:

AdvancedVI.ClipScaleType
ClipScale(ϵ = 1e-5)

Projection operator ensuring that an MvLocationScale or MvLocationScaleLowRank has a scale with eigenvalues larger than ϵ. ClipScale also supports by operating on MvLocationScale and MvLocationScaleLowRank wrapped by a Bijectors.TransformedDistribution object.

source
  • OT2017Orabona, F., & Tommasi, T. (2017). Training deep networks without learning rates through coin betting. Advances in Neural Information Processing Systems, 30.
  • SZ2013Shamir, O., & Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning (pp. 71-79). PMLR.
  • DCAMHV2020Dhaka, A. K., Catalina, A., Andersen, M. R., Magnusson, M., Huggins, J., & Vehtari, A. (2020). Robust, accurate stochastic optimization for variational inference. Advances in Neural Information Processing Systems, 33, 10961-10973.
  • KMJ2024Khaled, A., Mishchenko, K., & Jin, C. (2023). Dowg unleashed: An efficient universal parameter-free gradient descent method. Advances in Neural Information Processing Systems, 36, 6748-6769.
  • IHC2023Ivgi, M., Hinder, O., & Carmon, Y. (2023). Dog is sgd's best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning (pp. 14465-14499). PMLR.
  • D2020Domke, J. (2020). Provable smoothness guarantees for black-box variational inference. In International Conference on Machine Learning.