skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


This content will become publicly available on January 16, 2025

Title: Lion secretly solves constrained optimization: As lyapunov predicts
Lion (Evolved Sign Momentum), a new optimizer discovered through program search, has shown promising results in training large AI models. It performs comparably or favorably to AdamW but with greater memory efficiency. As we can expect from the results of a random search program, Lion incorporates elements from several existing algorithms, including signed momentum, decoupled weight decay, Polak, and Nesterov momentum, but does not fit into any existing category of theoretically grounded optimizers. Thus, even though Lion appears to perform well as a general-purpose optimizer for a wide range of tasks, its theoretical basis remains uncertain. This lack of theoretical clarity limits opportunities to further enhance and expand Lion's efficacy. This work aims to demystify Lion. Based on both continuous-time and discrete-time analysis, we demonstrate that Lion is a theoretically novel and principled approach for minimizing a general loss function $f(x)$ while enforcing a bound constraint $\norm{x}_\infty \leq 1/\lambda$. Lion achieves this through the incorporation of decoupled weight decay, where $\lambda$ represents the weight decay coefficient. Our analysis is made possible by the development of a new Lyapunov function for the Lion updates. It applies to a broader family of Lion-$\phi$ algorithms, where the $\text{sign}(\cdot)$ operator in Lion is replaced by the subgradient of a convex function $\phi$, leading to the solution of a general composite optimization problem of $\min_x f(x) + \phi^*(x)$. Our findings provide valuable insights into the dynamics of Lion and pave the way for further improvements and extensions of Lion-related algorithms.  more » « less
Award ID(s):
1846421 2037267
NSF-PAR ID:
10539920
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
International Conference of Learning Representations
Date Published:
ISBN:
9781713872740
Format(s):
Medium: X
Location:
International Conference of Learning Representations
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Differential privacy is a mathematical framework for developing statistical computations with provable guarantees of privacy and accuracy. In contrast to the privacy component of differential privacy, which has a clear mathematical and intuitive meaning, the accuracy component of differential privacy does not have a generally accepted definition; accuracy claims of differential privacy algorithms vary from algorithm to algorithm and are not instantiations of a general definition. We identify program discontinuity as a common theme in existing ad hoc definitions and introduce an alternative notion of accuracy parametrized by, what we call, — the of an input x w.r.t.  a deterministic computation f and a distance d , is the minimal distance d ( x , y ) over all y such that f ( y )≠ f ( x ). We show that our notion of accuracy subsumes the definition used in theoretical computer science, and captures known accuracy claims for differential privacy algorithms. In fact, our general notion of accuracy helps us prove better claims in some cases. Next, we study the decidability of accuracy. We first show that accuracy is in general undecidable. Then, we define a non-trivial class of probabilistic computations for which accuracy is decidable (unconditionally, or assuming Schanuel’s conjecture). We implement our decision procedure and experimentally evaluate the effectiveness of our approach for generating proofs or counterexamples of accuracy for common algorithms from the literature. 
    more » « less
  2. We consider using gradient descent to minimize the nonconvex function $f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $\phi$ is an underlying smooth convex cost function defined over $n\times n$ matrices. While only a second-order stationary point $X$ can be provably found in reasonable time, if $X$ is additionally \emph{rank deficient}, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank $r$ of the current iterate $X$ to be \emph{overparameterized} with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$. Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$, even when $\phi$ is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer $X^{\star}$. 
    more » « less
  3. This work considers the minimization of a general convex function f (X) over the cone of positive semi-definite matrices whose optimal solution X* is of low-rank. Standard first-order convex solvers require performing an eigenvalue decomposition in each iteration, severely limiting their scalability. A natural nonconvex reformulation of the problem factors the variable X into the product of a rectangular matrix with fewer columns and its transpose. For a special class of matrix sensing and completion problems with quadratic objective functions, local search algorithms applied to the factored problem have been shown to be much more efficient and, in spite of being nonconvex, to converge to the global optimum. The purpose of this work is to extend this line of study to general convex objective functions f (X) and investigate the geometry of the resulting factored formulations. Specifically, we prove that when f (X) satisfies the restricted well-conditioned assumption, each critical point of the factored problem either corresponds to the optimal solution X* or a strict saddle where the Hessian matrix has a strictly negative eigenvalue. Such a geometric structure of the factored formulation ensures that many local search algorithms can converge to the global optimum with random initializations. 
    more » « less
  4. We study the problem of certification: given queries to a function f : {0,1}n → {0,1} with certificate complexity ≤ k and an input x⋆, output a size-k certificate for f’s value on x⋆. For monotone functions, a classic local search algorithm of Angluin accomplishes this task with n queries, which we show is optimal for local search algorithms. Our main result is a new algorithm for certifying monotone functions with O(k8 logn) queries, which comes close to matching the information-theoretic lower bound of Ω(k logn). The design and analysis of our algorithm are based on a new connection to threshold phenomena in monotone functions. We further prove exponential-in-k lower bounds when f is non-monotone, and when f is monotone but the algorithm is only given random examples of f. These lower bounds show that assumptions on the structure of f and query access to it are both necessary for the polynomial dependence on k that we achieve. 
    more » « less
  5. Consider the linear transport equation in 1D under an external confining potential \begin{document}$ \Phi $\end{document}:

    For \begin{document}$ \Phi = \frac {x^2}2 + \frac { \varepsilon x^4}2 $\end{document} (with \begin{document}$ \varepsilon >0 $\end{document} small), we prove phase mixing and quantitative decay estimates for \begin{document}$ {\partial}_t \varphi : = - \Delta^{-1} \int_{ \mathbb{R}} {\partial}_t f \, \mathrm{d} v $\end{document}, with an inverse polynomial decay rate \begin{document}$ O({\langle} t{\rangle}^{-2}) $\end{document}. In the proof, we develop a commuting vector field approach, suitably adapted to this setting. We will explain why we hope this is relevant for the nonlinear stability of the zero solution for the Vlasov–Poisson system in \begin{document}$ 1 $\end{document}D under the external potential \begin{document}$ \Phi $\end{document}.

     
    more » « less