NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A deep learning algorithm for computing mean field control problems via forward-backward score dynamics

https://doi.org/10.1007/s40687-025-00531-9

Zhou, Mo; Osher, Stanley; Li, Wuchen (September 2025, Research in the Mathematical Sciences)

Free, publicly-accessible full text available September 1, 2026
How does Gradient Descent Learn Features---A Local Analysis for Regularized Two-Layer Neural Networks

Zhou, Mo; Ge, Rong (December 2024, NeurIPS 2024)

The ability of learning useful features is one of the major advantages of neural networks. Although recent works show that neural network can operate in a neural tangent kernel (NTK) regime that does not allow feature learning, many works also demonstrate the potential for neural networks to go beyond NTK regime and perform feature learning. Recently, a line of work highlighted the feature learning capabilities of the early stages of gradient-based training. In this paper we consider another mechanism for feature learning via gradient descent through a local convergence analysis. We show that once the loss is below a certain threshold, gradient descent with a carefully regularized objective will capture ground-truth directions. We further strengthen this local convergence analysis by incorporating early-stage feature learning analysis. Our results demonstrate that feature learning not only happens at the initial gradient steps, but can also occur towards the end of training.
more » « less
Full Text Available
MixedNUTS: Training-Free Accuracy-Robustness Balance via Nonlinearly Mixed Classifiers

Bai, Yatong; Zhou, Mo; Patel, Vishal; Sojoudi, Somayeh (August 2024, Transactions on machine learning research)

Full Text Available
Single Timescale Actor-Critic Method to Solve the Linear Quadratic Regulator with Convergence Guarantees

Zhou, Mo; Lu, Jianfeng (July 2023, Journal of Machine Learning Research)

Full Text Available
Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression

Zhou, Mo; Ge, Rong (January 2023, International Conference on Machine Learning)

In deep learning, often the training process nds an interpolator (a solution with 0 training loss), but the test loss is still low. This phenomenon, known as benign overfitting, is a major mystery that received a lot of recent attention. One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator, often characterized by minimizing certain norms. However, even for a simple sparse linear regression problem y = Ax+ noise with sparse x , neither minimum l_1 orl_`2 norm interpolator gives the optimal test loss. In this work, we give a different parametrization of the model which leads to a new implicit regularization effect that combines the benefit of l_1 and l_2 interpolators. We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss. Our result is based on careful analysis of the training dynamics and provides another example of implicit regularization effect that goes beyond norm minimization.
more » « less
Full Text Available
Understanding Edge-of-Stability Training Dynamics with a Minimalist Example

Zhu, Xingyu; Wang, Zixuan; Wang, Xiang; Zhou, Mo; Ge, Rong (January 2023, International Conference on Learning Representations)

Recently, researchers observed that gradient descent for deep neural networks operates in an “edge-of-stability” (EoS) regime: the sharpness (maximum eigenvalue of the Hessian) is often larger than stability threshold 2/\eta (where \eta is the step size). Despite this, the loss oscillates and converges in the long run, and the sharpness at the end is just slightly below 2/\eta . While many other well-understood nonconvex objectives such as matrix factorization or two-layer networks can also converge despite large sharpness, there is often a larger gap between sharpness of the endpoint and 2/\eta . In this paper, we study EoS phenomenon by constructing a simple function that has the same behavior. We give rigorous analysis for its training dynamics in a large local region and explain why the fnal converging point has sharpness close to 2/\eta . Globally we observe that the training dynamics for our example have an interesting bifurcating behavior, which was also observed in the training of neural nets.
more » « less
Full Text Available
Understanding The Robustness of Self-supervised Learning Through Topic Modeling

Luo, Zeping; Weng, Cindy; Wu, Shiyou; Zhou, Mo; Ge, Rong (January 2023, International Conference on Learning Representations)

Self-supervised learning has signi ficantly improved the performance of many NLP tasks. In this paper, we highlight a key advantage of self-supervised learning - when applied to data generated by topic models, self-supervised learning can be oblivious to the specifi c model, and hence is less susceptible to model misspeci fication. In particular, we prove that commonly used self-supervised objectives based on reconstruction or contrastive samples can both recover useful posterior information for general topic models. Empirically, we show that the same objectives can perform competitively against posterior inference using the correct model, while outperforming posterior inference using misspecifi ed model.
more » « less
Full Text Available
A Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural Network

Zhou, Mo; Ge, Rong; Jin, Chi (July 2021, COLT)

While over-parameterization is widely believed to be crucial for the success of optimization for the neural networks, most existing theories on over-parameterization do not fully explain the reason -- they either work in the Neural Tangent Kernel regime where neurons don't move much, or require an enormous number of neurons. In practice, when the data is generated using a teacher neural network, even mildly over-parameterized neural networks can achieve 0 loss and recover the directions of teacher neurons. In this paper we develop a local convergence theory for mildly over-parameterized two-layer neural net. We show that as long as the loss is already lower than a threshold (polynomial in relevant parameters), all student neurons in an over-parameterized two-layer neural network will converge to one of teacher neurons, and the loss will go to 0. Our result holds for any number of student neurons as long as it is at least as large as the number of teacher neurons, and our convergence rate is independent of the number of student neurons. A key component of our analysis is the new characterization of local optimization landscape -- we show the gradient satisfies a special case of Lojasiewicz property which is different from local strong convexity or PL conditions used in previous work.
more » « less
Full Text Available
Actor-Critic Method for High Dimensional Static Hamilton--Jacobi--Bellman Partial Differential Equations based on Neural Networks

https://doi.org/10.1137/21M1402303

Zhou, Mo; Han, Jiequn; Lu, Jianfeng (January 2021, SIAM Journal on Scientific Computing)

Full Text Available
Towards Understanding the Importance of Shortcut Connections in Residual Networks

Liu, Tianyi Liu; Chen, Minshuo; Zhou, Mo; Du, Simon; Zhou, Enlu; Zhao, Tuo. (December 2019, Advances in neural information processing systems)

Full Text Available

« Prev Next »

Search for: All records