skip to main content


Title: High-Resolution Modeling of the Fastest First-Order Optimization Method for Strongly Convex Functions
Motivated by the fact that the gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs), here we derive an ODE representation of the accelerated triple momentum (TM) algorithm. For unconstrained optimization problems with strongly convex cost, the TM algorithm has a proven faster convergence rate than the Nesterov's accelerated gradient (NAG) method but with the same computational complexity. We show that similar to the NAG method, in order to accurately capture the characteristics of the TM method, we need to use a high-resolution modeling to obtain the ODE representation of the TM algorithm. We propose a Lyapunov analysis to investigate the stability and convergence behavior of the proposed high-resolution ODE representation of the TM algorithm. We compare the rate of the ODE representation of the TM method with that of the NAG method to confirm its faster convergence. Our study also leads to a tighter bound on the worst rate of convergence for the ODE model of the NAG method. In this paper, we also discuss the use of the integral quadratic constraint (IQC) method to establish an estimate on the rate of convergence of the TM algorithm. A numerical example verifies our results.  more » « less
Award ID(s):
1653838
NSF-PAR ID:
10219110
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2020 59th IEEE Conference on Decision and Control (CDC)
Page Range / eLocation ID:
4237 to 4242
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms—Nesterov’s accelerated gradient method for strongly convex functions (NAG-) and Polyak’s heavy-ball method—we study an alternative limiting process that yieldshigh-resolution ODEs. We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG- and Polyak’s heavy-ball method, but they allow the identification of a term that we refer to as “gradient correction” that is present in NAG- but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov’s accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result—that NAG- minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG- for smooth convex functions.

     
    more » « less
  2. null (Ed.)
    Gradient descent-based optimization methods underpin the parameter training of neural networks, and hence comprise a significant component in the impressive test results found in a number of applications. Introducing stochasticity is key to their success in practical problems, and there is some understanding of the role of stochastic gradient descent in this context. Momentum modifications of gradient descent such as Polyak’s Heavy Ball method (HB) and Nesterov’s method of accelerated gradients (NAG), are also widely adopted. In this work our focus is on understanding the role of momentum in the training of neural networks, concentrating on the common situation in which the momentum contribution is fixed at each step of the algorithm. To expose the ideas simply we work in the deterministic setting. Our approach is to derive continuous time approximations of the discrete algorithms; these continuous time approximations provide insights into the mechanisms at play within the discrete algorithms. We prove three such approximations. Firstly we show that standard implementations of fixed momentum methods approximate a time-rescaled gradient descent flow, asymptotically as the learning rate shrinks to zero; this result does not distinguish momentum methods from pure gradient descent, in the limit of vanishing learning rate. We then proceed to prove two results aimed at understanding the observed practical advantages of fixed momentum methods over gradient descent, when implemented in the non-asymptotic regime with fixed small, but non-zero, learning rate. We achieve this by proving approximations to continuous time limits in which the small but fixed learning rate appears as a parameter; this is known as the method of modified equations in the numerical analysis literature, recently rediscovered as the high resolution ODE approximation in the machine learning context. In our second result we show that the momentum method is approximated by a continuous time gradient flow, with an additional momentum-dependent second order time-derivative correction, proportional to the learning rate; this may be used to explain the stabilizing effect of momentum algorithms in their transient phase. Furthermore in a third result we show that the momentum methods admit an exponentially attractive invariant manifold on which the dynamics reduces, approximately, to a gradient flow with respect to a modified loss function, equal to the original loss function plus a small perturbation proportional to the learning rate; this small correction provides convexification of the loss function and encodes additional robustness present in momentum methods, beyond the transient phase. 
    more » « less
  3. Abstract

    Convergence analysis of accelerated first-order methods for convex optimization problems are developed from the point of view of ordinary differential equation solvers. A new dynamical system, called Nesterov accelerated gradient (NAG) flow, is derived from the connection between acceleration mechanism andA-stability of ODE solvers, and the exponential decay of a tailored Lyapunov function along with the solution trajectory is proved. Numerical discretizations of NAG flow are then considered and convergence rates are established via a discrete Lyapunov function. The proposed differential equation solver approach can not only cover existing accelerated methods, such as FISTA, Güler’s proximal algorithm and Nesterov’s accelerated gradient method, but also produce new algorithms for composite convex optimization that possess accelerated convergence rates. Both the convex and the strongly convex cases are handled in a unified way in our approach.

     
    more » « less
  4. We present a GPU algorithm for deformable simulation. Our method offers good computational efficiency and penetration-free guarantee at the same time, which are not common with existing techniques. The main idea is an algorithmic integration of projective dynamics (PD) and incremental potential contact (IPC). PD is a position-based simulation framework, favored for its robust convergence and convenient implementation. We show that PD can be employed to handle the variational optimization with the interior point method e.g., IPC. While conceptually straightforward, this requires a dedicated rework over the collision resolution and the iteration modality to avoid incorrect collision projection with improved numerical convergence. IPC exploits a barrier-based formulation, which yields an infinitely large penalty when the constraint is on the verge of being violated. This mechanism guarantees intersection-free trajectories of deformable bodies during the simulation, as long as they are apart at the rest configuration. On the downside, IPC brings a large amount of nonlinearity to the system, making PD slower to converge. To mitigate this issue, we propose a novel GPU algorithm named A-Jacobi for faster linear solve at the global step of PD. A-Jacobi is based on Jacobi iteration, but it better harvests the computation capacity on modern GPUs by lumping several Jacobi steps into a single iteration. In addition, we also re-design the CCD root finding procedure by using a new minimum-gradient Newton algorithm. Those saved time budgets allow more iterations to accommodate stiff IPC barriers so that the result is both realistic and collision-free. Putting together, our algorithm simulates complicated models of both solids and shells on the GPU at an interactive rate or even in real time. 
    more » « less
  5. Network pruning is a widely used technique to reduce computation cost and model size for deep neural networks. However, the typical three-stage pipeline (i.e., training, pruning, and retraining (fine-tuning)) significantly increases the overall training time. In this article, we develop a systematic weight-pruning optimization approach based on surrogate Lagrangian relaxation (SLR), which is tailored to overcome difficulties caused by the discrete nature of the weight-pruning problem. We further prove that our method ensures fast convergence of the model compression problem, and the convergence of the SLR is accelerated by using quadratic penalties. Model parameters obtained by SLR during the training phase are much closer to their optimal values as compared to those obtained by other state-of-the-art methods. We evaluate our method on image classification tasks using CIFAR-10 and ImageNet with state-of-the-art multi-layer perceptron based networks such as MLP-Mixer; attention-based networks such as Swin Transformer; and convolutional neural network based models such as VGG-16, ResNet-18, ResNet-50, ResNet-110, and MobileNetV2. We also evaluate object detection and segmentation tasks on COCO, the KITTI benchmark, and the TuSimple lane detection dataset using a variety of models. Experimental results demonstrate that our SLR-based weight-pruning optimization approach achieves a higher compression rate than state-of-the-art methods under the same accuracy requirement and also can achieve higher accuracy under the same compression rate requirement. Under classification tasks, our SLR approach converges to the desired accuracy × faster on both of the datasets. Under object detection and segmentation tasks, SLR also converges 2× faster to the desired accuracy. Further, our SLR achieves high model accuracy even at the hardpruning stage without retraining, which reduces the traditional three-stage pruning into a two-stage process. Given a limited budget of retraining epochs, our approach quickly recovers the model’s accuracy.

     
    more » « less