Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Mixup is a data augmentation technique that relies on training using random convex combinations of data points and their labels. In recent years, Mixup has become a standard primitive used in the training of stateoftheart image classification models due to its demonstrated benefits over empirical risk minimization with regards to generalization and robustness. In this work, we try to explain some of this success from a feature learning perspective. We focus our attention on classification problems in which each class may have multiple associated features (or views) that can be used to predict the class correctly. Our main theoretical results demonstrate that, for a nontrivial class of data distributions with two features per class, training a 2layer convolutional network using empirical risk minimization can lead to learning only one feature for almost all classes while training with a specific instantiation of Mixup succeeds in learning both features for every class. We also show empirically that these theoretical insights extend to the practical settings of image benchmarks modified to have additional synthetic features.more » « less

In deep learning, often the training process nds an interpolator (a solution with 0 training loss), but the test loss is still low. This phenomenon, known as benign overfitting, is a major mystery that received a lot of recent attention. One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator, often characterized by minimizing certain norms. However, even for a simple sparse linear regression problem y = Ax+ noise with sparse x , neither minimum l_1 orl_`2 norm interpolator gives the optimal test loss. In this work, we give a different parametrization of the model which leads to a new implicit regularization effect that combines the benefit of l_1 and l_2 interpolators. We show that training our new model via gradient descent leads to an interpolator with nearoptimal test loss. Our result is based on careful analysis of the training dynamics and provides another example of implicit regularization effect that goes beyond norm minimization.more » « less

Meanfield limit has been successfully applied to neural networks, leading to many results in optimizing overparametrized networks. However, existing works often focus on twolayer networks and/or require large number of neurons. We give a new framework for extending the meanfield limit to multilayer network, and show that a polynomialsize threelayer network in our framework can learn the function constructed by Safran et al. (2019) – which is known to be not approximable by any twolayer networksmore » « less

Recently, researchers observed that gradient descent for deep neural networks operates in an “edgeofstability” (EoS) regime: the sharpness (maximum eigenvalue of the Hessian) is often larger than stability threshold 2/\eta (where \eta is the step size). Despite this, the loss oscillates and converges in the long run, and the sharpness at the end is just slightly below 2/\eta . While many other wellunderstood nonconvex objectives such as matrix factorization or twolayer networks can also converge despite large sharpness, there is often a larger gap between sharpness of the endpoint and 2/\eta . In this paper, we study EoS phenomenon by constructing a simple function that has the same behavior. We give rigorous analysis for its training dynamics in a large local region and explain why the fnal converging point has sharpness close to 2/\eta . Globally we observe that the training dynamics for our example have an interesting bifurcating behavior, which was also observed in the training of neural nets.more » « less

Selfsupervised learning has significantly improved the performance of many NLP tasks. In this paper, we highlight a key advantage of selfsupervised learning  when applied to data generated by topic models, selfsupervised learning can be oblivious to the specific model, and hence is less susceptible to model misspecification. In particular, we prove that commonly used selfsupervised objectives based on reconstruction or contrastive samples can both recover useful posterior information for general topic models. Empirically, we show that the same objectives can perform competitively against posterior inference using the correct model, while outperforming posterior inference using misspecified model.more » « less

Sparse coding refers to modeling a signal as sparse linear combinations of the elements of a learned dictionary. Sparse coding has proven to be a successful and interpretable approach in many applications, such as signal processing, computer vision, and medical imaging. While this success has spurred much work on sparse coding with provable guarantees, work on the setting where the learned dictionary is larger (or overrealized) with respect to the ground truth is comparatively nascent. Existing theoretical results in the overrealized regime are limited to the case of noiseless data. In this paper, we show that for overrealized sparse coding in the presence of noise, minimizing the standard dictionary learning objective can fail to recover the groundtruth dictionary, regardless of the magnitude of the signal in the datagenerating process. Furthermore, drawing from the growing body of work on selfsupervised learning, we propose a novel masking objective and we prove that minimizing this new objective can recover the groundtruth dictionary. We corroborate our theoretical results with experiments across several parameter regimes, showing that our proposed objective enjoys better empirical performance than the standard reconstruction objective.more » « less

In the Mixup training paradigm, a model is trained using convex combinations of data points and their associated labels. Despite seeing very few true data points during training, models trained using Mixup seem to still minimize the original empirical risk and exhibit better generalization and robustness on various tasks when compared to standard training. In this paper, we investigate how these benefits of Mixup training rely on properties of the data in the context of classification. For minimizing the original empirical risk, we compute a closed form for the Mixupoptimal classification, which allows us to construct a simple dataset on which minimizing the Mixup loss can provably lead to learning a classifier that does not minimize the empirical loss on the data. On the other hand, we also give sufficient conditions for Mixup training to also minimize the original empirical risk. For generalization, we characterize the margin of a Mixup classifier, and use this to understand why the decision boundary of a Mixup classifier can adapt better to the full structure of the training data when compared to standard training. In contrast, we also show that, for a large class of linear models and linearly separable datasets, Mixup training leads to learning the same classifier as standard training.more » « less

A popular line of recent research incorporates ML advice in the design of online algorithms to improve their performance in typical instances. These papers treat the ML algorithm as a blackbox, and redesign online algorithms to take advantage of ML predictions. In this paper, we ask the complementary question: can we redesign ML algorithms to provide better predictions for online algorithms? We explore this question in the context of the classic rentorbuy problem, and show that incorporating optimization benchmarks directly in ML loss functions leads to significantly better performance, while maintaining a worstcase adversarial result when the advice is completely wrong. We support this finding both through theoretical bounds and numerical simulations, and posit that “learning for optimization” is a fertile area for future research.more » « less

Recently, many reinforcement learning techniques have been shown to have provable guarantees in the simple case of linear dynamics, especially in problems like linear quadratic regulators. However, in practice many tasks require learning a policy from rich, highdimensional features such as images, which are unlikely to be linear. We consider a setting where there is a hidden linear subspace of the highdimensional feature space in which the dynamics are linear. We design natural objectives based on forward and inverse dynamics models. We prove that these objectives can be efficiently optimized and their local optimizers extract the hidden linear subspace. We empirically verify our theoretical results with synthetic data and explore the effectiveness of our approach (generalized to nonlinear settings) in simple control tasks with rich observations.more » « less

We explore the connection between outlierrobust highdimensional statistics and nonconvex optimization in the presence of sparsity constraints, with a focus on the fundamental tasks of robust sparse mean estimation and robust sparse PCA. We develop novel and simple optimization formulations for these problems such that any approximate stationary point of the associated optimization problem yields a nearoptimal solution for the underlying robust estimation task. As a corollary, we obtain that any firstorder method that efficiently converges to stationarity yields an efficient algorithm for these tasks.1 The obtained algorithms are simple, practical, and succeed under broader distributional assumptions compared to prior work.more » « less