skip to main content

This content will become publicly available on May 28, 2024

Title: Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability.
Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness S(θ), is bounded by 2/η, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen et al. (2021) observed two important phenomena. The first, dubbed progressive sharpening, is that the sharpness steadily increases throughout training until it reaches the instability cutoff 2/η. The second, dubbed edge of stability, is that the sharpness hovers at 2/η for the remainder of training while the loss continues decreasing, albeit non-monotonically. We demonstrate that, far from being chaotic, the dynamics of gradient descent at the edge of stability can be captured by a cubic Taylor expansion: as the iterates diverge in direction of the top eigenvector of the Hessian due to instability, the cubic term in the local Taylor expansion of the loss function causes the curvature to decrease until stability is restored. This property, which we call self-stabilization, is a general property of gradient descent and explains its behavior at the edge of stability. A key consequence of self-stabilization is that gradient descent at the edge of stability implicitly follows projected gradient descent (PGD) under the constraint S(θ)≤2/η. Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions. Our analysis uncovers the mechanism for gradient descent's implicit bias towards stability.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
ICLR 2023
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recently, researchers observed that gradient descent for deep neural networks operates in an “edge-of-stability” (EoS) regime: the sharpness (maximum eigenvalue of the Hessian) is often larger than stability threshold 2/\eta (where \eta is the step size). Despite this, the loss oscillates and converges in the long run, and the sharpness at the end is just slightly below 2/\eta . While many other well-understood nonconvex objectives such as matrix factorization or two-layer networks can also converge despite large sharpness, there is often a larger gap between sharpness of the endpoint and 2/\eta . In this paper, we study EoS phenomenon by constructing a simple function that has the same behavior. We give rigorous analysis for its training dynamics in a large local region and explain why the fnal converging point has sharpness close to 2/\eta . Globally we observe that the training dynamics for our example have an interesting bifurcating behavior, which was also observed in the training of neural nets. 
    more » « less
  2. Many machine learning problems can be abstracted in solving game theory formulations and boil down to optimizing nested objectives, such as generative adversarial networks (GANs) and multi-agent reinforcement learning. Solving these games requires finding their stable fixed points or Nash equilibrium. However, existing algorithms for solving games suffer from empirical instability, hence demanding heavy ad-hoc tuning in practice. To tackle these challenges, we resort to the emerging scheme of Learning to Optimize (L2O), which discovers problem-specific efficient optimization algorithms through data-driven training. Our customized L2O framework for differentiable game theory problems, dubbed “Learning to Play Games" (L2PG), seeks a stable fixed point solution, by predicting the fast update direction from the past trajectory, with a novel gradient stability-aware, sign-based loss function. We further incorporate curriculum learning and self-learning to strengthen the empirical training stability and generalization of L2PG. On test problems including quadratic games and GANs, L2PG can substantially accelerate the convergence, and demonstrates a remarkably more stable trajectory. Codes are available at 
    more » « less
  3. In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an overparametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an \epsilon-local-minimizer, matches the corresponding deterministic rate of ˜O(1/\epsilon^2). We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an \epsilon-local-minimizer under interpolation-like conditions, is ˜O(1/\epsilon^2.5). While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of ˜O(1/\epsilon^1.5) corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings. 
    more » « less
  4. In this study, we used optical spectroscopy to characterize the physical properties of microvesicles released from the thermoacidophilic archaeon Sulfolobus acidocaldarius (Sa-MVs). The most abundant proteins in Sa-MVs are the S-layer proteins, which self-assemble on the vesicle surface forming an array of crystalline structures. Lipids in Sa-MVs are exclusively bipolar tetraethers. We found that when excited at 275 nm, intrinsic protein fluorescence of Sa-MVs at 23 °C has an emission maximum at 303 nm (or 296 nm measured at 75 °C), which is unusually low for protein samples containing multiple tryptophans and tyrosines. In the presence of 10–11 mM of the surfactant n-tetradecyl-β-d-maltoside (TDM), Sa-MVs were disintegrated, the emission maximum of intrinsic protein fluorescence was shifted to 312 nm, and the excitation maximum was changed from 288 nm to 280.5 nm, in conjunction with a significant decrease (>2 times) in excitation band sharpness. These data suggest that most of the fluorescent amino acid residues in native Sa-MVs are in a tightly packed protein matrix and that the S-layer proteins may form J-aggregates. The membranes in Sa-MVs, as well as those of unilamellar vesicles (LUVs) made of the polar lipid fraction E (PLFE) tetraether lipids isolated from S. acidocaldarius (LUVPLFE), LUVs reconstituted from the tetraether lipids extracted from Sa-MVs (LUVMV) and LUVs made of the diester lipids, were investigated using the probe 6-dodecanoyl-2-dimethylaminonaphthalene (Laurdan). The generalized polarization (GP) values of Laurdan in tightly packed Sa-MVs, LUVMV, and LUVPLFE were found to be much lower than those obtained from less tightly packed DPPC gel state, which echoes the previous finding that the GP values from tetraether lipid membranes cannot be directly compared with the GP values from diester lipid membranes, due to differences in probe disposition. Laurdan’s GP and red-edge excitation shift (REES) values in Sa-MVs and LUVMV decrease with increasing temperature monotonically with no sign for lipid phase transition. Laurdan’s REES values are high (9.3–18.9 nm) in the tetraether lipid membrane systems (i.e., Sa-MVs, LUVMV and LUVPLFE) and low (0.4–5.0 nm) in diester liposomes. The high REES and low GP values suggest that Laurdan in tetraether lipid membranes, especially in the membrane of Sa-MVs, is in a very motionally restricted environment, bound water molecules and the polar moieties in the tetraether lipid headgroups strongly interact with Laurdan’s excited state dipole moment, and “solvent” reorientation around Laurdan’s chromophore in tetraether lipid membranes occurs very slowly compared to Laurdan’s lifetime. 
    more » « less
  5. Batch normalization (BN) is a popular and ubiquitous method in deep learning that has been shown to decrease training time and improve generalization performance of neural networks. Despite its success, BN is not theoretically well understood. It is not suitable for use with very small mini-batch sizes or online learning. In this paper, we propose a new method called Batch Normalization Preconditioning (BNP). Instead of applying normalization explicitly through a batch normalization layer as is done in BN, BNP applies normalization by conditioning the parameter gradients directly during training. This is designed to improve the Hessian matrix of the loss function and hence convergence during training. One benefit is that BNP is not constrained on the mini-batch size and works in the online learning setting. Furthermore, its connection to BN provides theoretical insights on how BN improves training and how BN is applied to special architectures such as convolutional neural networks. For a theoretical foundation, we also present a novel Hessian condition number based convergence theory for a locally convex but not strong-convex loss, which is applicable to networks with a scale-invariant property. 
    more » « less