Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Selfsupervised learning has significantly improved the performance of many NLP tasks. In this paper, we highlight a key advantage of selfsupervised learning  when applied to data generated by topic models, selfsupervised learning can be oblivious to the specific model, and hence is less susceptible to model misspecification. In particular, we prove that commonly used selfsupervised objectives based on reconstruction or contrastive samples can both recover useful posterior information for general topic models. Empirically, we show that the same objectives can perform competitively against posterior inference using the correct model, while outperforming posterior inference using misspecified model.more » « less

In the Mixup training paradigm, a model is trained using convex combinations of data points and their associated labels. Despite seeing very few true data points during training, models trained using Mixup seem to still minimize the original empirical risk and exhibit better generalization and robustness on various tasks when compared to standard training. In this paper, we investigate how these benefits of Mixup training rely on properties of the data in the context of classification. For minimizing the original empirical risk, we compute a closed form for the Mixupoptimal classification, which allows us to construct a simple dataset on which minimizing the Mixup loss can provably lead to learning a classifier that does not minimize the empirical loss on the data. On the other hand, we also give sufficient conditions for Mixup training to also minimize the original empirical risk. For generalization, we characterize the margin of a Mixup classifier, and use this to understand why the decision boundary of a Mixup classifier can adapt better to the full structure of the training data when compared to standard training. In contrast, we also show that, for a large class of linear models and linearly separable datasets, Mixup training leads to learning the same classifier as standard training.more » « less

A popular line of recent research incorporates ML advice in the design of online algorithms to improve their performance in typical instances. These papers treat the ML algorithm as a blackbox, and redesign online algorithms to take advantage of ML predictions. In this paper, we ask the complementary question: can we redesign ML algorithms to provide better predictions for online algorithms? We explore this question in the context of the classic rentorbuy problem, and show that incorporating optimization benchmarks directly in ML loss functions leads to significantly better performance, while maintaining a worstcase adversarial result when the advice is completely wrong. We support this finding both through theoretical bounds and numerical simulations, and posit that “learning for optimization” is a fertile area for future research.more » « less

Recently, many reinforcement learning techniques have been shown to have provable guarantees in the simple case of linear dynamics, especially in problems like linear quadratic regulators. However, in practice many tasks require learning a policy from rich, highdimensional features such as images, which are unlikely to be linear. We consider a setting where there is a hidden linear subspace of the highdimensional feature space in which the dynamics are linear. We design natural objectives based on forward and inverse dynamics models. We prove that these objectives can be efficiently optimized and their local optimizers extract the hidden linear subspace. We empirically verify our theoretical results with synthetic data and explore the effectiveness of our approach (generalized to nonlinear settings) in simple control tasks with rich observations.more » « less

Learningtolearn (using optimization algorithms to learn a new optimizer) has successfully trained efficient optimizers in practice. This approach relies on metagradient descent on a metaobjective based on the trajectory that the optimizer generates. However, there were few theoretical guarantees on how to avoid metagradient explosion/vanishing problems, or how to train an optimizer with good generalization performance. In this paper, we study the learningtolearn approach on a simple problem of tuning the step size for quadratic loss. Our results show that although there is a way to design the metaobjective so that the metagradient remain polynomially bounded, computing the metagradient directly using backpropagation leads to numerical issues that look similar to gradient explosion/vanishing problems. We also characterize when it is necessary to compute the metaobjective on a separate validation set instead of the original training set. Finally, we verify our results empirically and show that a similar phenomenon appears even for more complicated learned optimizers parametrized by neural networks.more » « less

While overparameterization is widely believed to be crucial for the success of optimization for the neural networks, most existing theories on overparameterization do not fully explain the reason  they either work in the Neural Tangent Kernel regime where neurons don't move much, or require an enormous number of neurons. In practice, when the data is generated using a teacher neural network, even mildly overparameterized neural networks can achieve 0 loss and recover the directions of teacher neurons. In this paper we develop a local convergence theory for mildly overparameterized twolayer neural net. We show that as long as the loss is already lower than a threshold (polynomial in relevant parameters), all student neurons in an overparameterized twolayer neural network will converge to one of teacher neurons, and the loss will go to 0. Our result holds for any number of student neurons as long as it is at least as large as the number of teacher neurons, and our convergence rate is independent of the number of student neurons. A key component of our analysis is the new characterization of local optimization landscape  we show the gradient satisfies a special case of Lojasiewicz property which is different from local strong convexity or PL conditions used in previous work.more » « less

We give a algorithm for exact sampling from the Bingham distribution p(x)∝exp(x⊤Ax) on the sphere Sd−1 with expected runtime of poly(d,λmax(A)−λmin(A)). The algorithm is based on rejection sampling, where the proposal distribution is a polynomial approximation of the pdf, and can be sampled from by explicitly evaluating integrals of polynomials over the sphere. Our algorithm gives exact samples, assuming exact computation of an inverse function of a polynomial. This is in contrast with Markov Chain Monte Carlo algorithms, which are not known to enjoy rapid mixing on this problem, and only give approximate samples. As a direct application, we use this to sample from the posterior distribution of a rank1 matrix inference problem in polynomial time.more » « less

In this paper we study the training dynamics for gradient flow on overparametrized tensor decomposition problems. Empirically, such training process often first fits larger components and then discovers smaller components, which is similar to a tensor deflation process that is commonly used in tensor decomposition algorithms. We prove that for orthogonally decomposable tensor, a slightly modified version of gradient flow would follow a tensor deflation process and recover all the tensor components. Our proof suggests that for orthogonal tensors, gradient flow dynamics works similarly as greedy lowrank learning in the matrix setting, which is a first step towards understanding the implicit regularization effect of overparametrized models for lowrank tensors.more » « less

A popular line of recent research incorporates ML advice in the design of online algorithms to improve their performance in typical instances. These papers treat the ML algorithm as a blackbox, and redesign online algorithms to take advantage of ML predictions. In this paper, we ask the complementary question: can we redesign ML algorithms to provide better predictions for online algorithms? We explore this question in the context of the classic rentorbuy problem, and show that incorporating optimization benchmarks directly in ML loss functions leads to significantly better performance, while maintaining a worstcase adversarial result when the advice is completely wrong. We support this finding both through theoretical bounds and numerical simulations, and posit that “learning for optimization” is a fertile area for future research.more » « less

We study the problem of highdimensional robust mean estimation in the presence of a constant fraction of adversarial outliers. A recent line of work has provided sophisticated polynomialtime algorithms for this problem with dimensionindependent error guarantees for a range of natural distribution families. In this work, we show that a natural nonconvex formulation of the problem can be solved directly by gradient descent. Our approach leverages a novel structural lemma, roughly showing that any approximate stationary point of our nonconvex objective gives a nearoptimal solution to the underlying robust estimation task. Our work establishes an intriguing connection between algorithmic highdimensional robust statistics and nonconvex optimization, which may have broader applications to other robust estimation tasks.more » « less