 NSFPAR ID:
 10409758
 Date Published:
 Journal Name:
 International Conference on Machine Learning
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find nearoptimal solutions to nonconvex optimization problems, and despite giving a nearperfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with twolayer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.more » « less

null (Ed.)Ridgelike regularization often leads to improved generalization performance of machine learning models by mitigating overfitting. While ridgeregularized machine learning methods are widely used in many important applications, direct training via optimization could become challenging in huge data scenarios with millions of examples and features. We tackle such challenges by proposing a general approach that achieves ridgelike regularization through implicit techniques named Minipatch Ridge (MPRidge). Our approach is based on taking an ensemble of coefficients of unregularized learners trained on many tiny, random subsamples of both the examples and features of the training data, which we call minipatches. We empirically demonstrate that MPRidge induces an implicit ridgelike regularizing effect and performs nearly the same as explicit ridge regularization for a general class of predictors including logistic regression, SVM, and robust regression. Embarrassingly parallelizable, MPRidge provides a computationally appealing alternative to inducing ridgelike regularization for improving generalization performance in challenging bigdata settings.more » « less

The main objective of Personalized Tour Recommendation (PTR) is to generate a sequence of pointofinterest (POIs) for a particular tourist, according to the userspecific constraints such as duration time, start and end points, the number of attractions planned to visit, and so on. Previous PTR solutions are based on either heuristics for solving the orienteering problem to maximize a global reward with a specified budget or approaches attempting to learn user visiting preferences and transition patterns with the stochastic process or recurrent neural networks. However, existing learning methodologies rely on historical trips to train the model and use the next visited POI as the supervised signal, which may not fully capture the coherence of preferences and thus recommend similar trips to different users, primarily due to the data sparsity problem and longtailed distribution of POI popularity. This work presents a novel tour recommendation model by distilling knowledge and supervision signals from the trips in a selfsupervised manner. We propose Contrastive Trajectory Learning for Tour Recommendation (CTLTR), which utilizes the intrinsic POI dependencies and traveling intent to discover extra knowledge and augments the sparse data via pretraining auxiliary selfsupervised objectives. CTLTR provides a principled way to characterize the inherent data correlations while tackling the implicit feedback and weak supervision problems by learning robust representations applicable for tour planning. We introduce a hierarchical recurrent encoderdecoder to identify touristsâ€™ intentions and use the contrastive loss to discover subsequence semantics and their sequential patterns through maximizing the mutual information. Additionally, we observe that a data augmentation step as the preliminary of contrastive learning can solve the overfitting issue resulting from data sparsity. We conduct extensive experiments on a range of realworld datasets and demonstrate that our model can significantly improve the recommendation performance over the stateoftheart baselines in terms of both recommendation accuracy and visiting orders.more » « less

The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has nearoptimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finitedimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lie in an infinitedimensional space vs. when the data lie in a finitedimensional space with dimension that grows faster than the sample size.

Abstract Hardtopredict bursts of COVID19 pandemic revealed significance of statistical modeling which would resolve spatiotemporal correlations over geographical areas, for example spread of the infection over a city with census tract granularity. In this manuscript, we provide algorithmic answers to the following two interrelated public health challenges of immense social impact which have not been adequately addressed (1) Inference Challenge assuming that there are N census blocks (nodes) in the city, and given an initial infection at any set of nodes, e.g. any N of possible single node infections, any $$N(N1)/2$$ N ( N  1 ) / 2 of possible two node infections, etc, what is the probability for a subset of census blocks to become infected by the time the spread of the infection burst is stabilized? (2) Prevention Challenge What is the minimal control action one can take to minimize the infected part of the stabilized state footprint? To answer the challenges, we build a Graphical Model of pandemic of the attractive Ising (pairwise, binary) type, where each node represents a census tract and each edge factor represents the strength of the pairwise interaction between a pair of nodes, e.g. representing the internode travel, road closure and related, and each local bias/field represents the community level of immunization, acceptance of the social distance and mask wearing practice, etc. Resolving the Inference Challenge requires finding the MaximumAPosteriory (MAP), i.e. most probable, state of the Ising Model constrained to the set of initially infected nodes. (An infected node is in the $$+ \, 1$$ + 1 state and a node which remained safe is in the $$ \, 1$$  1 state.) We show that almost all attractive Ising Models on dense graphs result in either of the two possibilities (modes) for the MAP state: either all nodes which were not infected initially became infected, or all the initially uninfected nodes remain uninfected (susceptible). This bimodal solution of the Inference Challenge allows us to restate the Prevention Challenge as the following tractable convex programming : for the bare Ising Model with pairwise and bias factors representing the system without prevention measures, such that the MAP state is fully infected for at least one of the initial infection patterns, find the closest, for example in $$l_1$$ l 1 , $$l_2$$ l 2 or any other convexitypreserving norm, therefore preventionoptimal, set of factors resulting in all the MAP states of the Ising model, with the optimal prevention measures applied, to become safe. We have illustrated efficiency of the scheme on a quasirealistic model of Seattle. Our experiments have also revealed useful features, such as sparsity of the prevention solution in the case of the $$l_1$$ l 1 norm, and also somehow unexpected features, such as localization of the sparse prevention solution at pairwise links which are NOT these which are most utilized/traveled.more » « less