skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A High Probability Analysis of Adaptive SGD with Momentum
Stochastic Gradient Descent (SGD) and its variants are the most used algorithms in machine learning applications. In particular, SGD with adaptive learning rates and momentum is the industry standard to train deep networks. Despite the enormous success of these methods, our theoretical understanding of these variants in the non-convex setting is not complete, with most of the results only proving convergence in expectation and with strong assumptions on the stochastic gradients. In this paper, we present a high probability analysis for adaptive and momentum algorithms, under weak assumptions on the function, stochastic gradients, and learning rates. We use it to prove for the first time the convergence of the gradients to zero in high probability in the smooth nonconvex setting for Delayed AdaGrad with momentum.  more » « less
Award ID(s):
1925930
PAR ID:
10208410
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Workshop on Beyond First Order Methods in ML Systems at ICML'20
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) for stochastic convex optimization under favorable geometry, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal in terms of finding a near-stationary point with respect to the l2-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the l1-norm of the gradient as the stationarity measure, as opposed to the standard l2-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the l1-norm stationarity measure, we establish an upper bound on the convergence rate of AdaGrad and a corresponding lower bound for SGD. In particular, we identify non-convex settings in which the iteration complexity of AdaGrad is favorable over SGD and show that, for certain configurations of problem parameters, it outperforms SGD by a factor of d, where d is the problem dimension. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions. 
    more » « less
  2. We propose a novel framework for analyzing convergence rates of stochastic optimization algorithms with adaptive step sizes. This framework is based on analyzing properties of an underlying generic stochastic process; in particular, we derive a bound on the expected stopping time of this process. We utilize this framework to analyze the expected global convergence rates of a stochastic variant of a traditional trust-region method. Although traditional trust-region methods rely on exact computations of the gradient, Hessian, and values of the objective function, this method assumes that these values are available only up to some dynamically adjusted accuracy. Moreover, this accuracy is assumed to hold only with some sufficiently large—but fixed—probability without any additional restrictions on the variance of the errors. This setting applies, for example, to standard stochastic optimization and machine learning formulations. Improving upon prior analysis, we show that the stochastic process defined by the trust-region method satisfies the assumptions of our proposed general framework. The stopping time in this setting is defined by an iterate satisfying a first-order accuracy condition. We demonstrate the first global complexity bound for a stochastic trust-region method under the assumption of sufficiently accurate stochastic gradients. Finally, we apply the same framework to derive second-order complexity bounds under additional assumptions. Previous 
    more » « less
  3. We propose a novel framework for analyzing convergence rates of stochastic optimization algorithms with adaptive step sizes. This framework is based on analyzing properties of an underlying generic stochastic process; in particular, we derive a bound on the expected stopping time of this process. We utilize this framework to analyze the expected global convergence rates of a stochastic variant of a traditional trust-region method. Although traditional trust-region methods rely on exact computations of the gradient, Hessian, and values of the objective function, this method assumes that these values are available only up to some dynamically adjusted accuracy. Moreover, this accuracy is assumed to hold only with some sufficiently large—but fixed—probability without any additional restrictions on the variance of the errors. This setting applies, for example, to standard stochastic optimization and machine learning formulations. Improving upon prior analysis, we show that the stochastic process defined by the trust-region method satisfies the assumptions of our proposed general framework. The stopping time in this setting is defined by an iterate satisfying a first-order accuracy condition. We demonstrate the first global complexity bound for a stochastic trust-region method under the assumption of sufficiently accurate stochastic gradients. Finally, we apply the same framework to derive second-order complexity bounds under additional assumptions. 
    more » « less
  4. We study stochastic gradient descent (SGD) in the master-worker architecture under Byzantine attacks. Building upon the recent advances in algorithmic high-dimensional robust statistics, in each SGD iteration, master employs a non-trivial decoding to estimate the true gradient from the unbiased stochastic gradients received from workers, some of which may be corrupt. We provide convergence analyses for both strongly-convex and non-convex smooth objectives under standard SGD assumptions. We can control the approximation error of our solution in both these settings by the mini-batch size of stochastic gradients; and we can make the approximation error as small as we want, provided that workers use a sufficiently large mini-batch size. Our algorithm can tolerate less than 1/3 fraction of Byzantine workers. It can approximately find the optimal parameters in the strongly-convex setting exponentially fast, and reaches to an approximate stationary point in the non-convex setting with linear speed, i.e., with a rate of 1/T, thus, matching the convergence rates of vanilla SGD in the Byzantine-free setting. 
    more » « less
  5. SGD with Momentum (SGDM) is a widely used family of algorithms for large-scale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of $$\Omega(\frac{\ln T}{\sqrt{T}})$$ after $$T$$ iterations. Based on this fact, we study a class of (both adaptive and non-adaptive) Follow-The-Regularized-Leader-based SGDM algorithms with increasing momentum and shrinking updates. For these algorithms, we show that the last iterate has optimal convergence $$O(\frac{1}{\sqrt{T}})$$ for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of $$T$$. Further, we show a variety of results for FTRL-based SGDM when used with adaptive stepsizes. Empirical results are shown as well. 
    more » « less