Search for: All records

Creators/Authors contains: "Lee, Jason D"

« Prev Next »

Total Resources

17

Resource Type
Conference Paper

11

Conference Proceeding

0

Dataset

0

Journal Article

6

Workshop Report

0

Availability
Full Text / Resource Available

15

Citation Only

2

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability.

Damian, Alex ; Nichani, Eshaan ; Lee, Jason D. ( May 2023 , ICLR 2023)

Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness S(θ), is bounded by 2/η, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen et al. (2021) observed two important phenomena. The first, dubbed progressive sharpening, is that the sharpness steadily increases throughout training until it reaches the instability cutoff 2/η. The second, dubbed edge of stability, is that the sharpness hovers at 2/η for the remainder of training while the loss continues decreasing, albeit non-monotonically. We demonstrate that, far from being chaotic, the dynamics of gradient descent at the edge of stability can be captured by a cubic Taylor expansion: as the iterates diverge in direction of the top eigenvector of the Hessian due to instability, the cubic term in the local Taylor expansion of the loss function causes the curvature to decrease until stability is restored. This property, which we call self-stabilization, is a general property of gradient descent and explains its behavior at the edge of stability. A key consequence of self-stabilization is that gradient descent at the edge of stability implicitly follows projected gradient descent (PGD) under the constraint S(θ)≤2/η. Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions. Our analysis uncovers the mechanism for gradient descent's implicit bias towards stability.
more » « less
Free, publicly-accessible full text available May 28, 2024
Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence

https://doi.org/10.1137/21M1456789

Zhan, Wenhao ; Cen, Shicong ; Huang, Baihe ; Chen, Yuxin ; Lee, Jason D. ; Chi, Yuejie ( June 2023 , SIAM Journal on Optimization)

Free, publicly-accessible full text available June 30, 2024
Neural Networks can Learn Representations with Gradient Descent

Damian, Alexandru ; Lee, Jason D. ; Soltanolkotabi, Mahdi ( July 2022 , Proceedings of Thirty Fifth Conference on Learning Theory)
Loh, Po-ling ; Raginsky, Maxim (Ed.)
Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form f⋆(x)=g(Ux) where U:\Rd→\Rr with d≫r. When the degree of f⋆ is p, it is known that n≍dp samples are necessary to learn f⋆ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f⋆. This results in an improved sample complexity of n≍d2 and enables transfer learning with sample complexity independent of d.
more » « less
Full Text Available
Provably Efficient Policy Optimization for Two-Player Zero-Sum Markov Games

Zhao, Yulai ; Tian, Yuandong ; Lee, Jason D ; Du, Simon S ( January 2022 , h International Conference on Artificial Intelligence and Statistics (AISTATS))

Full Text Available
Distributed Estimation for Principal Component Analysis: An Enlarged Eigenspace Analysis

https://doi.org/10.1080/01621459.2021.1886937

Chen, Xi ; Lee, Jason D. ; Li, He ; Yang, Yun ( April 2021 , Journal of the American Statistical Association)
null (Ed.)
Full Text Available
How Important is the Train-Validation Split in Meta-Learning?

Bai, Yu ; Chen, Minshuo ; Zhou, Pan ; Zhao, Tuo ; Lee, Jason D. ; Kakade, Sham ; Wang, Huan ; Xiong, Caiming. ( July 2021 , International Conference on Machine Learning)
null (Ed.)
Full Text Available
Optimal Gradient-based Algorithms for Non-concave Bandit Optimization

Huang, Baihe ; Huang, Kaixuan ; Kakade, Sham ; Lee, Jason D ; Lei, Qi ; Wang, Runzhe ; Yang, Jiaqi ( January 2021 , Advances in neural information processing systems)

Full Text Available
Going Beyond Linear RL: Sample Efficient Neural Function Approximation

Huang, Baihe ; Huang, Kaixuan ; Kakade, Sham ; Lee, Jason D ; Lei, Qi ; Wang, Runzhe ; Yang, Jiaqi ( January 2021 , Advances in neural information processing systems)

Full Text Available
SGD learns one-layer networks in WGANs

Lei, Q ; Lee, Jason D ; Dimakis, Alex G ; Daskalakis, D. ( July 2020 , Proceedings)

Full Text Available
Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy

Moroshko, Edward ; Gunasekar, Suriya ; Woodworth, Blake ; Lee, Jason D ; Srebro, Nathan ; Soudry, Daniel ( July 2020 , Advances in neural information processing systems)
null (Ed.)
We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks". This is the simplest model displaying a transition between "kernel" and non-kernel ("rich" or "active") regimes. We show how the transition is controlled by the relationship between the initialization scale and how accurately we minimize the training loss. Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies (well beyond 10−100). Moreover, the implicit bias at reasonable initialization scales and training accuracies is more complex and not captured by these limits.
more » « less
Full Text Available

« Prev Next »