Momentum-Based Variance Reduction in Non-Convex SGD

Cutkosky, Ashok; Orabona, Francesco

Citation Details

Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses $$F$$, STORM finds a point $$\boldsymbol{x}$$ with $$E[\|\nabla F(\boldsymbol{x})\|]\le O(1/\sqrt{T}+\sigma^{1/3}/T^{1/3})$$ in $$T$$ iterations with $$\sigma^2$$ variance in the gradients, matching the optimal rate and without requiring knowledge of $$\sigma$$. more »

Award ID(s):: 1908111

PAR ID:: 10208397

Author(s) / Creator(s):: Cutkosky, Ashok; Orabona, Francesco

Editor(s):: Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; Garnett, R.

Date Published:: 2019-01-01

Journal Name:: Advances in neural information processing systems

Volume:: 32

ISSN:: 1049-5258

Page Range / eLocation ID:: 15236 - 15245

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Conference Paper:
The DOI is not currently available.

More Like this