skip to main content


Title: RMSPROP CONVERGES WITH PROPER HYPERPARAMETER
Despite the existence of divergence examples, RMSprop remains one of the most popular algorithms in machine learning. Towards closing the gap between theory and practice, we prove that RMSprop converges with proper choice of hyperparameters under certain conditions. More specifically, we prove that when the hyper-parameter  beta_2 is close enough to 1, RMSprop and its random shuffling version converge to a bounded region in general, and to critical points in the interpolation regime. It is worth mentioning that our results do not depend on “bounded gradient" assumption, which is often the key assumption utilized by existing theoretical work for Adam-type adaptive gradient method. Removing this assumption allows us to establish a phase transition from divergence to non-divergence for RMSprop. Finally, based on our theory, we conjecture that in practice there is a critical threshold \beta_2 , such that RMSprop generates reasonably good results only if 1 >  \beta_2\ge \beta^*_2 . We provide empirical evidence for such a phase transition in our numerical experiments.  more » « less
Award ID(s):
1727757
NSF-PAR ID:
10273075
Author(s) / Creator(s):
Date Published:
Journal Name:
international conference on learning representation
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. As predictive models are increasingly being deployed in high-stakes decision making (e.g., loan approvals), there has been growing interest in post-hoc techniques which provide recourse to affected individuals. These techniques generate recourses under the assumption that the underlying predictive model does not change. However, in practice, models are often regularly updated for a variety of reasons (e.g., dataset shifts), thereby rendering previously prescribed recourses ineffective. To address this problem, we propose a novel framework, RObust Algorithmic Recourse (ROAR), that leverages adversarial training for finding recourses that are robust to model shifts. To the best of our knowledge, this work proposes the first ever solution to this critical problem. We also carry out theoretical analysis which underscores the importance of constructing recourses that are robust to model shifts: 1) We quantify the probability of invalidation for recourses generated without accounting for model shifts. 2) We prove that the additional cost incurred due to the robust recourses output by our framework is bounded. Experimental evaluation on multiple synthetic and real-world datasets demonstrates the efficacy of the proposed framework. 
    more » « less
  2. Stochastic gradient descent (SGD) is the optimization algorithm of choice in many machine learning applications such as regularized empirical risk minimization and training deep neural networks. The classical analysis of convergence of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, it is always violated for cases where the objective function is strongly convex. In (Bottou et al.,2016) a new analysis of convergence of SGD is performed under the assumption that stochastic gradients are bounded with respect to the true gradient norm. Here we show that for stochastic problems arising in machine learning such bound always holds. Moreover, we propose an alternative convergence analysis of SGD with diminishing learning rate regime, which is results in more relaxed conditions that those in (Bottou et al.,2016). We then move on the asynchronous parallel setting, and prove convergence of the Hogwild! algorithm in the same regime, obtaining the first convergence results for this method in the case of diminished learning rate. 
    more » « less
  3. Abstract

    We study the scaling limits of stochastic gradient descent (SGD) with constant step‐size in the high‐dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite‐dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step‐size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. We show a critical scaling regime for the step‐size, below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two‐layer networks for binary and XOR‐type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub‐optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations. At the same time, we demonstrate the benefit of overparametrization by showing that the latter probability goes to zero as the second layer width grows.

     
    more » « less
  4. Abstract

    Motivated by the challenge of sampling Gibbs measures with nonconvex potentials, we study a continuum birth–death dynamics. We improve results in previous works (Liuet al2023Appl. Math. Optim.8748; Luet al2019 arXiv:1905.09863) and provide weaker hypotheses under which the probability density of the birth–death governed by Kullback–Leibler divergence or byχ2divergence converge exponentially fast to the Gibbs equilibrium measure, with a universal rate that is independent of the potential barrier. To build a practical numerical sampler based on the pure birth–death dynamics, we consider an interacting particle system, which is inspired by the gradient flow structure and the classical Fokker–Planck equation and relies on kernel-based approximations of the measure. Using the technique of Γ-convergence of gradient flows, we show that on the torus, smooth and bounded positive solutions of the kernelised dynamics converge on finite time intervals, to the pure birth–death dynamics as the kernel bandwidth shrinks to zero. Moreover we provide quantitative estimates on the bias of minimisers of the energy corresponding to the kernelised dynamics. Finally we prove the long-time asymptotic results on the convergence of the asymptotic states of the kernelised dynamics towards the Gibbs measure.

     
    more » « less
  5. Abstract We develop a general compactification framework to facilitate analysis of nonautonomous ODEs where nonautonomous terms decay asymptotically. The strategy is to compactify the problem: the phase space is augmented with a bounded but open dimension and then extended at one or both ends by gluing in flow-invariant subspaces that carry autonomous dynamics of the limit systems from infinity. We derive the weakest decay conditions possible for the compactified system to be continuously differentiable on the extended phase space. This enables us to use equilibria and other compact invariant sets of the limit systems from infinity to analyze the original nonautonomous problem in the spirit of dynamical systems theory. Specifically, we prove that solutions of interest are contained in unique invariant manifolds of saddles for the limit systems when embedded in the extended phase space. The uniqueness holds in the general case, that is even if the compactification gives rise to a centre direction and the manifolds become centre or centre-stable manifolds. A wide range of problems including pullback attractors, rate-induced critical transitions (R-tipping) and nonlinear wave solutions fit naturally into our framework, and their analysis can be greatly simplified by the compactification. 
    more » « less