skip to main content


Title: On the Convergence of AdaGrad(Norm) on R^d: Beyond Convexity, Non-Asymptotic Rate and Acceleration
Existing analysis of AdaGrad and other adaptive methods for smooth convex optimization is typically for functions with bounded domain diameter. In unconstrained problems, previous works guarantee an asymptotic convergence rate without an explicit constant factor that holds true for the entire function class. Furthermore, in the stochastic setting, only a modified version of AdaGrad, different from the one commonly used in practice, in which the latest gradient is not used to update the stepsize, has been analyzed. Our paper aims at bridging these gaps and developing a deeper understanding of AdaGrad and its variants in the standard setting of smooth convex functions as well as the more general setting of quasar convex functions. First, we demonstrate new techniques to explicitly bound the convergence rate of the vanilla AdaGrad for unconstrained problems in both deterministic and stochastic settings. Second, we propose a variant of AdaGrad for which we can show the convergence of the last iterate, instead of the average iterate. Finally, we give new accelerated adaptive algorithms and their convergence guarantee in the deterministic setting with explicit dependency on the problem parameters, improving upon the asymptotic rate shown in previous works.  more » « less
Award ID(s):
1908510 1750333
NSF-PAR ID:
10465713
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Conference on Learning Representations
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Existing analysis of AdaGrad and other adaptive methods for smooth convex optimization is typically for functions with bounded domain diameter. In unconstrained problems, previous works guarantee an asymptotic convergence rate without an explicit constant factor that holds true for the entire function class. Furthermore, in the stochastic setting, only a modified version of AdaGrad, different from the one commonly used in practice, in which the latest gradient is not used to update the stepsize, has been analyzed. Our paper aims at bridging these gaps and developing a deeper understanding of AdaGrad and its variants in the standard setting of smooth convex functions as well as the more general setting of quasar convex functions. First, we demonstrate new techniques to explicitly bound the convergence rate of the vanilla AdaGrad for unconstrained problems in both deterministic and stochastic settings. Second, we propose a variant of AdaGrad for which we can show the convergence of the last iterate, instead of the average iterate. Finally, we give new accelerated adaptive algorithms and their convergence guarantee in the deterministic setting with explicit dependency on the problem parameters, improving upon the asymptotic rate shown in previous works. 
    more » « less
  2. Stochastic gradient descent is the method of choice for large scale optimization of machine learning objective functions. Yet, its performance is greatly variable and heavily depends on the choice of the stepsizes. This has motivated a large body of research on adaptive stepsizes. However, there is currently a gap in our theoretical understanding of these methods, especially in the non-convex setting. In this paper, we start closing this gap: we theoretically analyze in the convex and non-convex settings a generalized version of the AdaGrad stepsizes. We show sufficient conditions for these stepsizes to achieve almost sure asymptotic convergence of the gradients to zero, proving the first guarantee for generalized AdaGrad stepsizes in the non-convex setting. Moreover, we show that these stepsizes allow to automatically adapt to the level of noise of the stochastic gradients in both the convex and non-convex settings, interpolating between O(1/T) and O(1/sqrt(T)), up to logarithmic terms. 
    more » « less
  3. We provide new adaptive first-order methods for constrained convex optimization. Our main algorithms AdaACSA and AdaAGD+ are accelerated methods, which are universal in the sense that they achieve nearly-optimal convergence rates for both smooth and non-smooth functions, even when they only have access to stochastic gradients. In addition, they do not require any prior knowledge on how the objective function is parametrized, since they automatically adjust their per-coordinate learning rate. These can be seen as truly accelerated Adagrad methods for constrained optimization. We complement them with a simpler algorithm AdaGrad+ which enjoys the same features, and achieves the standard non-accelerated convergence rate. We also present a set of new results involving adaptive methods for unconstrained optimization and variational inequalities arising from monotone operators. 
    more » « less
  4. null (Ed.)
    We provide new adaptive first-order methods for constrained convex optimization. Our main algorithms AdaACSA and AdaAGD+ are accelerated methods, which are universal in the sense that they achieve nearly-optimal convergence rates for both smooth and non-smooth functions, even when they only have access to stochastic gradients. In addition, they do not require any prior knowledge on how the objective function is parametrized, since they automatically adjust their per-coordinate learning rate. These can be seen as truly accelerated Adagrad methods for constrained optimization. We complement them with a simpler algorithm AdaGrad+ which enjoys the same features, and achieves the standard non-accelerated convergence rate. We also present a set of new results involving adaptive methods for unconstrained optimization and variational inequalities arising from monotone operators. 
    more » « less
  5. SGD with Momentum (SGDM) is a widely used family of algorithms for large-scale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of $\Omega(\frac{\ln T}{\sqrt{T}})$ after $T$ iterations. Based on this fact, we study a class of (both adaptive and non-adaptive) Follow-The-Regularized-Leader-based SGDM algorithms with \emph{increasing momentum} and \emph{shrinking updates}. For these algorithms, we show that the last iterate has optimal convergence $O(\frac{1}{\sqrt{T}})$ for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of $T$. Further, we show a variety of results for FTRL-based SGDM when used with adaptive stepsizes. Empirical results are shown as well. 
    more » « less