Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms—Nesterov’s accelerated gradient method for strongly convex functions (NAG-) and Polyak’s heavy-ball method—we study an alternative limiting process that yields
Convergence analysis of accelerated first-order methods for convex optimization problems are developed from the point of view of ordinary differential equation solvers. A new dynamical system, called Nesterov accelerated gradient (NAG) flow, is derived from the connection between acceleration mechanism and
- Award ID(s):
- 2012465
- PAR ID:
- 10376482
- Publisher / Repository:
- Springer Science + Business Media
- Date Published:
- Journal Name:
- Mathematical Programming
- Volume:
- 195
- Issue:
- 1-2
- ISSN:
- 0025-5610
- Page Range / eLocation ID:
- p. 735-781
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract high-resolution ODEs . We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG- and Polyak’s heavy-ball method, but they allow the identification of a term that we refer to as “gradient correction” that is present in NAG- but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov’s accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result—that NAG- minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG- for smooth convex functions. -
null (Ed.)Motivated by the fact that the gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs), here we derive an ODE representation of the accelerated triple momentum (TM) algorithm. For unconstrained optimization problems with strongly convex cost, the TM algorithm has a proven faster convergence rate than the Nesterov's accelerated gradient (NAG) method but with the same computational complexity. We show that similar to the NAG method, in order to accurately capture the characteristics of the TM method, we need to use a high-resolution modeling to obtain the ODE representation of the TM algorithm. We propose a Lyapunov analysis to investigate the stability and convergence behavior of the proposed high-resolution ODE representation of the TM algorithm. We compare the rate of the ODE representation of the TM method with that of the NAG method to confirm its faster convergence. Our study also leads to a tighter bound on the worst rate of convergence for the ODE model of the NAG method. In this paper, we also discuss the use of the integral quadratic constraint (IQC) method to establish an estimate on the rate of convergence of the TM algorithm. A numerical example verifies our results.more » « less
-
We provide a second-order stochastic differential equation (SDE), which characterizes the continuous-time dynamics of accelerated stochastic mirror descent (ASMD) for strongly convex functions. This SDE plays a central role in designing new discrete-time ASMD algorithms via numerical discretization, and providing neat analyses of their convergence rates based on Lyapunov functions. Our results suggest that the only existing ASMD algorithm, namely, AC-SA proposed in Ghadimi & Lan (2012) is one instance of its kind, and we can actually derive new instances of ASMD with fewer tuning parameters. This sheds light on revisiting accelerated stochastic optimization through the lens of SDEs, which can lead to a better understanding of acceleration in stochastic optimization, as well as new simpler algorithms. Numerical experiments on both synthetic and real data support our theory.more » « less
-
In this paper, we present an accelerated quasi-Newton proximal extragradient method for solving unconstrained smooth convex optimization problems. With access only to the gradients of the objective function, we prove that our method can achieve a convergence rate of ${\bigO}\bigl(\min\{\frac{1}{k^2}, \frac{\sqrt{d\log k}}{k^{2.5}}\}\bigr)$, where $d$ is the problem dimension and $k$ is the number of iterations. In particular, in the regime where $k = \bigO(d)$, our method matches the \emph{optimal rate} of $\mathcal{O}(\frac{1}{k^2})$ by Nesterov's accelerated gradient (NAG). Moreover, in the the regime where $k = \Omega(d \log d)$, it outperforms NAG and converges at a \emph{faster rate} of $\mathcal{O}\bigl(\frac{\sqrt{d\log k}}{k^{2.5}}\bigr)$. To the best of our knowledge, this result is the first to demonstrate a provable gain for a quasi-Newton-type method over NAG in the convex setting. To achieve such results, we build our method on a recent variant of the Monteiro-Svaiter acceleration framework and adopt an online learning perspective to update the Hessian approximation matrices, in which we relate the convergence rate of our method to the dynamic regret of a specific online convex optimization problem in the space of matrices.more » « less
-
Dasgupta, Sanjoy ; Mandt, Stephan ; Li, Yingzhen (Ed.)We study accelerated optimization methods in the Gaussian phase retrieval problem. In this setting, we prove that gradient methods with Polyak or Nesterov momentum have similar implicit regularization to gradient descent. This implicit regularization ensures that the algorithms remain in a nice region, where the cost function is strongly convex and smooth despite being nonconvex in general. This ensures that these accelerated methods achieve faster rates of convergence than gradient descent. Experimental evidence demonstrates that the accelerated methods converge faster than gradient descent in practice.more » « less