skip to main content


Title: Accelerated Stochastic Mirror Descent: From Continuous-time Dynamics to Discrete-time Algorithms
We present a new framework to analyze accelerated stochastic mirror descent through the lens of continuous-time stochastic dynamic systems. It enables us to design new algorithms, and perform a unified and simple analysis of the convergence rates of these algorithms. More specifically, under this framework, we provide a Lyapunov function based analysis for the continuous-time stochastic dynamics, as well as several new discrete-time algorithms derived from the continuous-time dynamics. We show that for general convex objective functions, the derived discrete-time algorithms attain the optimal convergence rate. Empirical experiments corroborate our theory.  more » « less
Award ID(s):
1652539 1618948
NSF-PAR ID:
10063535
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
International Conference on Artificial Intelligence and Statistics
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We provide a second-order stochastic differential equation (SDE), which characterizes the continuous-time dynamics of accelerated stochastic mirror descent (ASMD) for strongly convex functions. This SDE plays a central role in designing new discrete-time ASMD algorithms via numerical discretization, and providing neat analyses of their convergence rates based on Lyapunov functions. Our results suggest that the only existing ASMD algorithm, namely, AC-SA proposed in Ghadimi & Lan (2012) is one instance of its kind, and we can actually derive new instances of ASMD with fewer tuning parameters. This sheds light on revisiting accelerated stochastic optimization through the lens of SDEs, which can lead to a better understanding of acceleration in stochastic optimization, as well as new simpler algorithms. Numerical experiments on both synthetic and real data support our theory. 
    more » « less
  2. null (Ed.)
    Gradient descent-based optimization methods underpin the parameter training of neural networks, and hence comprise a significant component in the impressive test results found in a number of applications. Introducing stochasticity is key to their success in practical problems, and there is some understanding of the role of stochastic gradient descent in this context. Momentum modifications of gradient descent such as Polyak’s Heavy Ball method (HB) and Nesterov’s method of accelerated gradients (NAG), are also widely adopted. In this work our focus is on understanding the role of momentum in the training of neural networks, concentrating on the common situation in which the momentum contribution is fixed at each step of the algorithm. To expose the ideas simply we work in the deterministic setting. Our approach is to derive continuous time approximations of the discrete algorithms; these continuous time approximations provide insights into the mechanisms at play within the discrete algorithms. We prove three such approximations. Firstly we show that standard implementations of fixed momentum methods approximate a time-rescaled gradient descent flow, asymptotically as the learning rate shrinks to zero; this result does not distinguish momentum methods from pure gradient descent, in the limit of vanishing learning rate. We then proceed to prove two results aimed at understanding the observed practical advantages of fixed momentum methods over gradient descent, when implemented in the non-asymptotic regime with fixed small, but non-zero, learning rate. We achieve this by proving approximations to continuous time limits in which the small but fixed learning rate appears as a parameter; this is known as the method of modified equations in the numerical analysis literature, recently rediscovered as the high resolution ODE approximation in the machine learning context. In our second result we show that the momentum method is approximated by a continuous time gradient flow, with an additional momentum-dependent second order time-derivative correction, proportional to the learning rate; this may be used to explain the stabilizing effect of momentum algorithms in their transient phase. Furthermore in a third result we show that the momentum methods admit an exponentially attractive invariant manifold on which the dynamics reduces, approximately, to a gradient flow with respect to a modified loss function, equal to the original loss function plus a small perturbation proportional to the learning rate; this small correction provides convexification of the loss function and encodes additional robustness present in momentum methods, beyond the transient phase. 
    more » « less
  3. Abstract

    Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms—Nesterov’s accelerated gradient method for strongly convex functions (NAG-) and Polyak’s heavy-ball method—we study an alternative limiting process that yieldshigh-resolution ODEs. We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG- and Polyak’s heavy-ball method, but they allow the identification of a term that we refer to as “gradient correction” that is present in NAG- but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov’s accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result—that NAG- minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG- for smooth convex functions.

     
    more » « less
  4. In this paper, we study the “decoding” problem for discrete-time, stochastic hybrid systems with linear dynamics in each mode. Given an output trace of the system, the decoding problem seeks to construct a sequence of modes and states that yield a trace “as close as possible” to the original output trace. The decoding problem generalizes the state estimation problem, and is applicable to hybrid systems with non-determinism. The decoding problem is NP-complete, and can be reduced to solving a mixed-integer linear program (MILP). In this paper, we decompose the decoding problem into two parts: (a) finding a sequence of discrete modes and transitions; and (b) finding corresponding continuous states for the mode/transition sequence. In particular, once a sequence of modes/transitions is fixed, the problem of “filling in” the continuous states is performed by a linear programming problem. In order to support the decomposition, we “cover” the set of all possible mode/transition sequences by a finite subset. We use well-known probabilistic arguments to justify a choice of cover with high confidence and design randomized algorithms for finding such covers. Our approach is demonstrated on a series of benchmarks, wherein we observe that relatively tiny fraction of the possible mode/transition sequences can be used as a cover. Furthermore, we show that the resulting linear programs can be solved rapidly by exploiting the tree structure of the set cover. 
    more » « less
  5. Abstract In this paper, a higher order time-discretization scheme is proposed, where the iterates approximate the solution of the stochastic semilinear wave equation driven by multiplicative noise with general drift and diffusion. We employ variational method for its error analysis and prove an improved convergence order of $\frac 32$ for the approximates of the solution. The core of the analysis is Hölder continuity in time and moment bounds for the solutions of the continuous and the discrete problem. Computational experiments are also presented. 
    more » « less