skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Momentum-Based Variance Reduction in Non-Convex SGD
Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses $$F$$, STORM finds a point $$\boldsymbol{x}$$ with $$E[\|\nabla F(\boldsymbol{x})\|]\le O(1/\sqrt{T}+\sigma^{1/3}/T^{1/3})$$ in $$T$$ iterations with $$\sigma^2$$ variance in the gradients, matching the optimal rate and without requiring knowledge of $$\sigma$$.  more » « less
Award ID(s):
1908111
PAR ID:
10208397
Author(s) / Creator(s):
;
Editor(s):
Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; Garnett, R.
Date Published:
Journal Name:
Advances in neural information processing systems
Volume:
32
ISSN:
1049-5258
Page Range / eLocation ID:
15236 - 15245
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Deep learning methods achieve state-of-the-art performance in many application scenarios. Yet, these methods require a significant amount of hyperparameters tuning in order to achieve the best results. In particular, tuning the learning rates in the stochastic optimization process is still one of the main bottlenecks. In this paper, we propose a new stochastic gradient descent procedure for deep networks that does not require any learning rate setting. Contrary to previous methods, we do not adapt the learning rates nor we make use of the assumed curvature of the objective function. Instead, we reduce the optimization process to a game of betting on a coin and propose a learning rate free optimal algorithm for this scenario. Theoretical convergence is proven for convex and quasi-convex functions and empirical evidence shows the advantage of our algorithm over popular stochastic gradient algorithms. 
    more » « less
  2. Stochastic gradient descent is the method of choice for large scale optimization of machine learning objective functions. Yet, its performance is greatly variable and heavily depends on the choice of the stepsizes. This has motivated a large body of research on adaptive stepsizes. However, there is currently a gap in our theoretical understanding of these methods, especially in the non-convex setting. In this paper, we start closing this gap: we theoretically analyze in the convex and non-convex settings a generalized version of the AdaGrad stepsizes. We show sufficient conditions for these stepsizes to achieve almost sure asymptotic convergence of the gradients to zero, proving the first guarantee for generalized AdaGrad stepsizes in the non-convex setting. Moreover, we show that these stepsizes allow to automatically adapt to the level of noise of the stochastic gradients in both the convex and non-convex settings, interpolating between O(1/T) and O(1/sqrt(T)), up to logarithmic terms. 
    more » « less
  3. In this work, we describe a generic approach to show convergence with high probability for both stochastic convex and non-convex optimization with sub-Gaussian noise. In previous works for convex optimization, either the convergence is only in expectation or the bound depends on the diameter of the domain. Instead, we show high probability convergence with bounds depending on the initial distance to the optimal solution. The algorithms use step sizes analogous to the standard settings and are universal to Lipschitz functions, smooth functions, and their linear combinations. The method can be applied to the non-convex case. We demonstrate an $$O((1+\sigma^{2}\log(1/\delta))/T+\sigma/\sqrt{T})$$ convergence rate when the number of iterations $$T$$ is known and an $$O((1+\sigma^{2}\log(T/\delta))/\sqrt{T})$$ convergence rate when $$T$$ is unknown for SGD, where $$1-\delta$$ is the desired success probability. These bounds improve over existing bounds in the literature. We also revisit AdaGrad-Norm (Ward et al., 2019) and show a new analysis to obtain a high probability bound that does not require the bounded gradient assumption made in previous works. The full version of our paper contains results for the standard per-coordinate AdaGrad. 
    more » « less
  4. In this work, we describe a generic approach to show convergence with high probability for both stochastic convex and non-convex optimization with sub-Gaussian noise. In previous works for convex optimization, either the convergence is only in expectation or the bound depends on the diameter of the domain. Instead, we show high probability convergence with bounds depending on the initial distance to the optimal solution. The algorithms use step sizes analogous to the standard settings and are universal to Lipschitz functions, smooth functions, and their linear combinations. The method can be applied to the non-convex case. We demonstrate an $$O((1+\sigma^{2}\log(1/\delta))/T+\sigma/\sqrt{T})$$ convergence rate when the number of iterations $$T$$ is known and an $$O((1+\sigma^{2}\log(T/\delta))/\sqrt{T})$$ convergence rate when $$T$$ is unknown for SGD, where $$1-\delta$$ is the desired success probability. These bounds improve over existing bounds in the literature. We also revisit AdaGrad-Norm \cite{ward2019adagrad} and show a new analysis to obtain a high probability bound that does not require the bounded gradient assumption made in previous works. The full version of our paper contains results for the standard per-coordinate AdaGrad. 
    more » « less
  5. We study stochastic gradient descent (SGD) in the master-worker architecture under Byzantine attacks. Building upon the recent advances in algorithmic high-dimensional robust statistics, in each SGD iteration, master employs a non-trivial decoding to estimate the true gradient from the unbiased stochastic gradients received from workers, some of which may be corrupt. We provide convergence analyses for both strongly-convex and non-convex smooth objectives under standard SGD assumptions. We can control the approximation error of our solution in both these settings by the mini-batch size of stochastic gradients; and we can make the approximation error as small as we want, provided that workers use a sufficiently large mini-batch size. Our algorithm can tolerate less than 1/3 fraction of Byzantine workers. It can approximately find the optimal parameters in the strongly-convex setting exponentially fast, and reaches to an approximate stationary point in the non-convex setting with linear speed, i.e., with a rate of 1/T, thus, matching the convergence rates of vanilla SGD in the Byzantine-free setting. 
    more » « less