skip to main content

Title: Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression
In deep learning, often the training process nds an interpolator (a solution with 0 training loss), but the test loss is still low. This phenomenon, known as benign overfitting, is a major mystery that received a lot of recent attention. One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator, often characterized by minimizing certain norms. However, even for a simple sparse linear regression problem y = Ax+ noise with sparse x , neither minimum l_1 orl_`2 norm interpolator gives the optimal test loss. In this work, we give a different parametrization of the model which leads to a new implicit regularization effect that combines the benefit of l_1 and l_2 interpolators. We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss. Our result is based on careful analysis of the training dynamics and provides another example of implicit regularization effect that goes beyond norm minimization.  more » « less
Award ID(s):
1845171 2031849 1934964
Author(s) / Creator(s):
Date Published:
Journal Name:
International Conference on Machine Learning
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings. 
    more » « less
  2. null (Ed.)
    Ridge-like regularization often leads to improved generalization performance of machine learning models by mitigating overfitting. While ridge-regularized machine learning methods are widely used in many important applications, direct training via optimization could become challenging in huge data scenarios with millions of examples and features. We tackle such challenges by proposing a general approach that achieves ridge-like regularization through implicit techniques named Minipatch Ridge (MPRidge). Our approach is based on taking an ensemble of coefficients of unregularized learners trained on many tiny, random subsamples of both the examples and features of the training data, which we call minipatches. We empirically demonstrate that MPRidge induces an implicit ridge-like regularizing effect and performs nearly the same as explicit ridge regularization for a general class of predictors including logistic regression, SVM, and robust regression. Embarrassingly parallelizable, MPRidge provides a computationally appealing alternative to inducing ridge-like regularization for improving generalization performance in challenging big-data settings. 
    more » « less
  3. The main objective of Personalized Tour Recommendation (PTR) is to generate a sequence of point-of-interest (POIs) for a particular tourist, according to the user-specific constraints such as duration time, start and end points, the number of attractions planned to visit, and so on. Previous PTR solutions are based on either heuristics for solving the orienteering problem to maximize a global reward with a specified budget or approaches attempting to learn user visiting preferences and transition patterns with the stochastic process or recurrent neural networks. However, existing learning methodologies rely on historical trips to train the model and use the next visited POI as the supervised signal, which may not fully capture the coherence of preferences and thus recommend similar trips to different users, primarily due to the data sparsity problem and long-tailed distribution of POI popularity. This work presents a novel tour recommendation model by distilling knowledge and supervision signals from the trips in a self-supervised manner. We propose Contrastive Trajectory Learning for Tour Recommendation (CTLTR), which utilizes the intrinsic POI dependencies and traveling intent to discover extra knowledge and augments the sparse data via pre-training auxiliary self-supervised objectives. CTLTR provides a principled way to characterize the inherent data correlations while tackling the implicit feedback and weak supervision problems by learning robust representations applicable for tour planning. We introduce a hierarchical recurrent encoder-decoder to identify tourists’ intentions and use the contrastive loss to discover subsequence semantics and their sequential patterns through maximizing the mutual information. Additionally, we observe that a data augmentation step as the preliminary of contrastive learning can solve the overfitting issue resulting from data sparsity. We conduct extensive experiments on a range of real-world datasets and demonstrate that our model can significantly improve the recommendation performance over the state-of-the-art baselines in terms of both recommendation accuracy and visiting orders. 
    more » « less
  4. The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lie in an infinite-dimensional space vs. when the data lie in a finite-dimensional space with dimension that grows faster than the sample size.

    more » « less
  5. Abstract Hard-to-predict bursts of COVID-19 pandemic revealed significance of statistical modeling which would resolve spatio-temporal correlations over geographical areas, for example spread of the infection over a city with census tract granularity. In this manuscript, we provide algorithmic answers to the following two inter-related public health challenges of immense social impact which have not been adequately addressed (1) Inference Challenge assuming that there are N census blocks (nodes) in the city, and given an initial infection at any set of nodes, e.g. any N of possible single node infections, any $$N(N-1)/2$$ N ( N - 1 ) / 2 of possible two node infections, etc, what is the probability for a subset of census blocks to become infected by the time the spread of the infection burst is stabilized? (2) Prevention Challenge What is the minimal control action one can take to minimize the infected part of the stabilized state footprint? To answer the challenges, we build a Graphical Model of pandemic of the attractive Ising (pair-wise, binary) type, where each node represents a census tract and each edge factor represents the strength of the pairwise interaction between a pair of nodes, e.g. representing the inter-node travel, road closure and related, and each local bias/field represents the community level of immunization, acceptance of the social distance and mask wearing practice, etc. Resolving the Inference Challenge requires finding the Maximum-A-Posteriory (MAP), i.e. most probable, state of the Ising Model constrained to the set of initially infected nodes. (An infected node is in the $$+ \, 1$$ + 1 state and a node which remained safe is in the $$- \, 1$$ - 1 state.) We show that almost all attractive Ising Models on dense graphs result in either of the two possibilities (modes) for the MAP state: either all nodes which were not infected initially became infected, or all the initially uninfected nodes remain uninfected (susceptible). This bi-modal solution of the Inference Challenge allows us to re-state the Prevention Challenge as the following tractable convex programming : for the bare Ising Model with pair-wise and bias factors representing the system without prevention measures, such that the MAP state is fully infected for at least one of the initial infection patterns, find the closest, for example in $$l_1$$ l 1 , $$l_2$$ l 2 or any other convexity-preserving norm, therefore prevention-optimal, set of factors resulting in all the MAP states of the Ising model, with the optimal prevention measures applied, to become safe. We have illustrated efficiency of the scheme on a quasi-realistic model of Seattle. Our experiments have also revealed useful features, such as sparsity of the prevention solution in the case of the $$l_1$$ l 1 norm, and also somehow unexpected features, such as localization of the sparse prevention solution at pair-wise links which are NOT these which are most utilized/traveled. 
    more » « less