SGD with Momentum (SGDM) is a widely used family of algorithms for largescale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of $\Omega(\frac{\ln T}{\sqrt{T}})$ after $T$ iterations. Based on this fact, we study a class of (both adaptive and nonadaptive) FollowTheRegularizedLeaderbased SGDM algorithms with \emph{increasing momentum} and \emph{shrinking updates}. For these algorithms, we show that the last iterate has optimal convergence $O(\frac{1}{\sqrt{T}})$ for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of $T$. Further, we show a variety of results for FTRLbased SGDM when used with adaptive stepsizes. Empirical results are shown as well.
Last Iterate is Slower than Averaged Iterate in Smooth ConvexConcave Saddle Point Problems
In this paper we study the smooth convexconcave saddle point problem. Specifically, we analyze the last iterate convergence properties of the Extragradient (EG) algorithm. It is well known that the ergodic (averaged) iterates of EG converge at a rate of O(1/T) (Nemirovski, 2004). In this paper, we show that the last iterate of EG converges at a rate of O(1/T‾‾√). To the best of our knowledge, this is the first paper to provide a convergence rate guarantee for the last iterate of EG for the smooth convexconcave saddle point problem. Moreover, we show that this rate is tight by proving a lower bound of Ω(1/T‾‾√) for the last iterate. This lower bound therefore shows a quadratic separation of the convergence rates of ergodic and last iterates in smooth convexconcave saddle point problems.
 Award ID(s):
 1741137
 Publication Date:
 NSFPAR ID:
 10228229
 Journal Name:
 33nd Annual Conference on Learning Theory, COLT 2020
 Sponsoring Org:
 National Science Foundation
More Like this


SGD with Momentum (SGDM) is a widely used family of algorithms for largescale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of $\Omega(\frac{\ln T}{\sqrt{T}})$ after $T$ iterations. Based on this fact, we study a class of (both adaptive and nonadaptive) FollowTheRegularizedLeaderbased SGDM algorithms with increasing momentum and shrinking updates. For these algorithms, we show that the last iterate has optimal convergence $O(\frac{1}{\sqrt{T}})$ for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of $T$. Further, we show a variety of results for FTRLbased SGDM when used with adaptive stepsizes. Empirical results are shown as well.

https://arxiv.org/abs/2010.13724 We study the question of obtaining lastiterate convergence rates for noregret learning algorithms in multiplayer games. We show that the optimistic gradient (OG) algorithm with a constant stepsize, which is noregret, achieves a lastiterate rate of O(1/T‾‾√) with respect to the gap function in smooth monotone games. This result addresses a question of Mertikopoulos & Zhou (2018), who asked whether extragradient approaches (such as OG) can be applied to achieve improved guarantees in the multiagent learning setting. The proof of our upper bound uses a new technique centered around an adaptive choice of potential function at each iteration. We also show that the O(1/T‾‾√) rate is tight for all pSCLI algorithms, which includes OG as a special case. As a byproduct of our lower bound analysis we additionally present a proof of a conjecture of Arjevani et al. (2015) which is more direct than previous approaches.

Motivated by applications in Game Theory, Optimization, and Generative Adversarial Networks, recent work of Daskalakis et al~\cite{DISZ17} and followup work of Liang and Stokes~\cite{LiangS18} have established that a variant of the widely used Gradient Descent/Ascent procedure, called "Optimistic Gradient Descent/Ascent (OGDA)", exhibits lastiterate convergence to saddle points in {\em unconstrained} convexconcave minmax optimization problems. We show that the same holds true in the more general problem of {\em constrained} minmax optimization under a variant of the noregret MultiplicativeWeightsUpdate method called "Optimistic MultiplicativeWeights Update (OMWU)". This answers an open question of Syrgkanis et al~\cite{SALS15}. The proof of our result requires fundamentally different techniques from those that exist in noregret learning literature and the aforementioned papers. We show that OMWU monotonically improves the KullbackLeibler divergence of the current iterate to the (appropriately normalized) minmax solution until it enters a neighborhood of the solution. Inside that neighborhood we show that OMWU becomes a contracting map converging to the exact solution. We believe that our techniques will be useful in the analysis of the last iterate of other learning algorithms.

Motivated by applications in Game Theory, Optimization, and Generative Adversarial Networks, recent work of Daskalakis et al \cite{DISZ17} and followup work of Liang and Stokes \cite{LiangS18} have established that a variant of the widely used Gradient Descent/Ascent procedure, called "Optimistic Gradient Descent/Ascent (OGDA)", exhibits lastiterate convergence to saddle points in {\em unconstrained} convexconcave minmax optimization problems. We show that the same holds true in the more general problem of {\em constrained} minmax optimization under a variant of the noregret MultiplicativeWeightsUpdate method called "Optimistic MultiplicativeWeights Update (OMWU)". This answers an open question of Syrgkanis et al \cite{SALS15}. The proof of our result requires fundamentally different techniques from those that exist in noregret learning literature and the aforementioned papers. We show that OMWU monotonically improves the KullbackLeibler divergence of the current iterate to the (appropriately normalized) minmax solution until it enters a neighborhood of the solution. Inside that neighborhood we show that OMWU is locally (asymptotically) stable converging to the exact solution. We believe that our techniques will be useful in the analysis of the last iterate of other learning algorithms.