The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has nearoptimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finitedimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lie in an infinitedimensional space vs. when the data lie in a finitedimensional space with dimension that grows faster than the sample size.
more » « less NSFPAR ID:
 10146370
 Publisher / Repository:
 Proceedings of the National Academy of Sciences
 Date Published:
 Journal Name:
 Proceedings of the National Academy of Sciences
 ISSN:
 00278424
 Page Range / eLocation ID:
 Article No. 201907378
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find nearoptimal solutions to nonconvex optimization problems, and despite giving a nearperfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with twolayer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.more » « less

In deep learning, often the training process nds an interpolator (a solution with 0 training loss), but the test loss is still low. This phenomenon, known as benign overfitting, is a major mystery that received a lot of recent attention. One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator, often characterized by minimizing certain norms. However, even for a simple sparse linear regression problem y = Ax+ noise with sparse x , neither minimum l_1 orl_`2 norm interpolator gives the optimal test loss. In this work, we give a different parametrization of the model which leads to a new implicit regularization effect that combines the benefit of l_1 and l_2 interpolators. We show that training our new model via gradient descent leads to an interpolator with nearoptimal test loss. Our result is based on careful analysis of the training dynamics and provides another example of implicit regularization effect that goes beyond norm minimization.more » « less

Abstract In this paper, we aim to explore novel machine learning (ML) techniques to facilitate and accelerate the construction of universal equationOfState (EOS) models with a high accuracy while ensuring important thermodynamic consistency. When applying ML to fit a universal EOS model, there are two key requirements: (1) a high prediction accuracy to ensure precise estimation of relevant physics properties and (2) physical interpretability to support important physicsrelated downstream applications. We first identify a set of fundamental challenges from the accuracy perspective, including an extremely wide range of input/output space and highly sparse training data. We demonstrate that while a neural network (NN) model may fit the EOS data well, the blackbox nature makes it difficult to provide physically interpretable results, leading to weak accountability of prediction results outside the training range and lack of guarantee to meet important thermodynamic consistency constraints. To this end, we propose a principled deep regression model that can be trained following a metalearning style to predict the desired quantities with a high accuracy using scarce training data. We further introduce a uniquely designed kernelbased regularizer for accurate uncertainty quantification. An ensemble technique is leveraged to battle model overfitting with improved prediction stability. Autodifferentiation is conducted to verify that necessary thermodynamic consistency conditions are maintained. Our evaluation results show an excellent fit of the EOS table and the predicted values are ready to use for important physicsrelated tasks.

null (Ed.)We study the supervised clustering problem under the twocomponent anisotropic Gaussian mixture model in high dimensions in the nonasymptotic setting. We first derive a lower and a matching upper bound for the minimax risk of clustering in this framework. We also show that in the highdimensional regime, the linear discriminant analysis (LDA) classifier turns out to be suboptimal in a minimax sense. Next, we characterize precisely the risk of regularized supervised least squares classifiers under $\ell_2$ regularization. We deduce the fact that the interpolating solution (0 training error solution) may outperform the regularized classifier, under mild assumptions on the covariance structure of the noise. Our analysis also shows that interpolation can be robust to corruption in the covariance of the noise when the signal is aligned with the ``clean'' part of the covariance, for the properly defined notion of alignment. To the best of our knowledge, this peculiar phenomenon has not yet been investigated in the rapidly growing literature related to interpolation. We conclude that interpolation is not only benign but can also be optimal and in some cases robust.more » « less

We develop a convex analytic approach to analyze finite width twolayer ReLU networks. We first prove that an optimal solution to the regularized training problem can be characterized as extreme points of a convex set, where simple solutions are encouraged via its convex geometrical properties. We then leverage this characterization to show that an optimal set of parameters yield linear spline interpolation for regression problems involving one dimensional or rankone data. We also characterize the classification decision regions in terms of a kernel matrix and minimum `1norm solutions. This is in contrast to Neural Tangent Kernel which is unable to explain predictions of finite width networks. Our convex geometric characterization also provides intuitive explanations of hidden neurons as autoencoders. In higher dimensions, we show that the training problem can be cast as a finite dimensional convex problem with infinitely many constraints. Then, we apply certain convex relaxations and introduce a cuttingplane algorithm to globally optimize the network. We further analyze the exactness of the relaxations to provide conditions for the convergence to a global optimum. Our analysis also shows that optimal network parameters can be also characterized as interpretable closedform formulas in some practically relevant special cases.more » « less