NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Transformers Provably Learn Two-Mixture of Linear Classification via Gradient Flow

Yang, Hongru; Wang, Zhangyang; Lee, Jason D; Liang, Yingbin (April 2025, Neural Information Processing Systems (NeurIPS))

Free, publicly-accessible full text available April 30, 2026
Transformers Provably Learn Two-Mixture of Linear Classification via Gradient Flow

Yang, Hongru; Wang, Zhangyang; Lee, Jason D; Liang, Yingbin (April 2025, International Conference on Learning Representations (ICLR))

Free, publicly-accessible full text available April 25, 2026
Transformers provably learn two-mixture of linear classification via gradient flow

Yang, Hongru; Wang, Zhangyang; Lee, Jason D; Liang, Yingbin (April 2025, International Conference on Learning Representations (ICLR))

Free, publicly-accessible full text available April 24, 2026
Scaling Laws in Linear Regression: Compute, Parameters, and Data

Lin, Licong; Wu, Jingfeng; Kakade, Sham M; Bartlett, Peter L; Lee, Jason D (December 2024, Advances in neural information processing systems)

Full Text Available
Learning Hierarchical Polynomials with Three-Layer Neural Networks

Wang, Zihao; Nichani, Eshaan; Lee, Jason D (May 2024, International Conference on Learning Representations)

We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form where is a degree polynomial and is a degree polynomial. This function class generalizes the single-index model, which corresponds to , and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree polynomials , a three-layer neural network trained via layerwise gradient descent on the square loss learns the target up to vanishing test error in samples and polynomial time. This is a strict improvement over kernel methods, which require samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of being a quadratic. When is indeed a quadratic, we achieve the information-theoretically optimal sample complexity , which is an improvement over prior work (Nichani et al., 2023) requiring a sample size of . Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature with samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
more » « less
Full Text Available
Optimal Multi-Distribution Learning

Zhang, Zihan; Zhan, Wenhao; Chen, Yuxin; Du, Simon S; Lee, Jason D (July 2024, Conference on Learning Theory)

Full Text Available
Teaching Arithmetic to Small Transformers

Lee, Nayoung; Sreenivasan, Kartik; Lee, Jason D; Lee, Kangwook; Papailiopoulos, DImitris (May 2024, International Conference on Learning Representations)

Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how even small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and parameter scaling. Additionally, we discuss the challenges associated with length generalization. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction loss for rapidly eliciting arithmetic capabilities.
more » « less
Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking.

Lyu, Kaifeng; Jin, Jikai; Li, Zhiyuan; Du, S Simon; Lee, Jason D; Hu, Wei (May 2024, ICLR 2024)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability.

Wu, Jingfeng; Braverman, Vladimir; Lee, Jason D (December 2023, NeurIPS 2023)

Recent research has observed that in machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) [Cohen, et al., 2021], where the stepsizes are set to be large, resulting in non-monotonic losses induced by the GD iterates. This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime. Despite the presence of local oscillations, we prove that the logistic loss can be minimized by GD with \emph{any} constant stepsize over a long time scale. Furthermore, we prove that with \emph{any} constant stepsize, the GD iterates tend to infinity when projected to a max-margin direction (the hard-margin SVM direction) and converge to a fixed vector that minimizes a strongly convex potential when projected to the orthogonal complement of the max-margin direction. In contrast, we also show that in the EoS regime, GD iterates may diverge catastrophically under the exponential loss, highlighting the superiority of the logistic loss. These theoretical findings are in line with numerical simulations and complement existing theories on the convergence and implicit bias of GD for logistic regressio
more » « less
Provable Offline Preference-Based Reinforcement Learning

Zhan, Wenhao; Uehara, Masatoshi; Kallus, Nathan; Lee, Jason D; Sun, Wen (January 2024, International Conference on Learning Representations)

Full Text Available

« Prev Next »

Search for: All records