Modern neural networks are often quite wide, causing large memory and computation costs. It is thus of great interest to train a narrower network. However, training narrow neural nets remains a challenging task. We ask two theoretical questions: Can narrow networks have as strong expressivity as wide ones? If so, does the loss function exhibit a benign optimization landscape? In this work, we provide partially affirmative answers to both questions for 1hiddenlayer networks with fewer than n (sample size) neurons when the activation is smooth. First, we prove that as long as the width m>=2n=d (where d is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss. Second, we identify a nice local region with no localmin or saddle points. Nevertheless, it is not clear whether gradient descent can stay in this nice region. Third, we consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer. It is expected that projected gradient methods converge to KKT points under mild technical conditions, but we leave the rigorous convergence analysis to future work. Thorough numerical results showmore »
A Local Convergence Theory for Mildly OverParameterized TwoLayer Neural Network
While overparameterization is widely believed to be crucial for the success of optimization for the neural networks, most existing theories on overparameterization do not fully explain the reason  they either work in the Neural Tangent Kernel regime where neurons don't move much, or require an enormous number of neurons. In practice, when the data is generated using a teacher neural network, even mildly overparameterized neural networks can achieve 0 loss and recover the directions of teacher neurons. In this paper we develop a local convergence theory for mildly overparameterized twolayer neural net. We show that as long as the loss is already lower than a threshold (polynomial in relevant parameters), all student neurons in an overparameterized twolayer neural network will converge to one of teacher neurons, and the loss will go to 0. Our result holds for any number of student neurons as long as it is at least as large as the number of teacher neurons, and our convergence rate is independent of the number of student neurons. A key component of our analysis is the new characterization of local optimization landscape  we show the gradient satisfies a special case of Lojasiewicz property which is different from more »
 Publication Date:
 NSFPAR ID:
 10231316
 Journal Name:
 COLT
 Sponsoring Org:
 National Science Foundation
More Like this


Daumé, H ; Singh, A (Ed.)An acknowledged weakness of neural networks is their vulnerability to adversarial perturbations to the inputs. To improve the robustness of these models, one of the most popular defense mechanisms is to alternatively maximize the loss over the constrained perturbations (or called adversaries) on the inputs using projected gradient ascent and minimize over weights. In this paper, we analyze the dynamics of the maximization step towards understanding the experimentally observed effectiveness of this defense mechanism. Specifically, we investigate the nonconcave landscape of the adversaries for a twolayer neural network with a quadratic loss. Our main result proves that projected gradient ascent finds a local maximum of this nonconcave problem in a polynomial number of iterations with high probability. To our knowledge, this is the first work that provides a convergence analysis of the firstorder adversaries. Moreover, our analysis demonstrates that, in the initial phase of adversarial training, the scale of the inputs matters in the sense that a smaller input scale leads to faster convergence of adversarial training and a “more regular” landscape. Finally, we show that these theoretical findings are in excellent agreement with a series of experiments.

Feature representations from pretrained deep neural networks have been known to exhibit excellent generalization and utility across a variety of related tasks. Finetuning is by far the simplest and most widely used approach that seeks to exploit and adapt these feature representations to novel tasks with limited data. Despite the effectiveness of finetuning, itis often suboptimal and requires very careful optimization to prevent severe overfitting to small datasets. The problem of suboptimality and overfitting, is due in part to the large number of parameters used in a typical deep convolutional neural network. To address these problems, we propose a simple yet effective regularization method for finetuning pretrained deep networks for the task of kshot learning. To prevent overfitting, our key strategy is to cluster the model parameters while ensuring intracluster similarity and intercluster diversity of the parameters, effectively regularizing the dimensionality of the parameter search space. In particular, we identify groups of neurons within each layer of a deep network that shares similar activation patterns. When the network is to be finetuned for a classification task using only k examples, we propagate a single gradient to all of the neuron parameters that belong to the same group. The grouping ofmore »

We give a new algorithm for learning a twolayer neural network under a general class of input distributions. Assuming there is a groundtruth twolayer network y = Aσ(Wx) + ξ, where A,W are weight matrices, ξ represents noise, and the number of neurons in the hidden layer is no larger than the input or output, our algorithm is guaranteed to recover the parameters A,W of the groundtruth network. The only requirement on the input x is that it is symmetric, which still allows highly complicated and structured input. Our algorithm is based on the methodofmoments framework and extends several results in tensor decompositions. We use spectral algorithms to avoid the complicated nonconvex optimization in learning neural networks. Experiments show that our algorithm can robustly learn the groundtruth neural network with a small number of samples for many symmetric input distributions.

We introduce a model for ant trail formation, building upon previous work on biologically feasible local algorithms that plausibly describe how ants maintain trail networks. The model is a variant of a reinforced random walk on a directed graph, where ants lay pheromone on edges as they traverse them and the next edge to traverse is chosen based on the level of pheromone; this pheromone decays with time. There is a bidirectional flow of ants in the network: the forward flow proceeds along forward edges from source (e.g. the nest) to sink (e.g. a food source), and the backward flow in the opposite direction. Some fraction of ants are lost as they pass through each node (modeling the loss of ants due to exploration observed in the field). We initiate a theoretical study of this model. We note that ant navigation has inspired the field of ant colony optimization, heuristics that have been applied to several combinatorial optimization problems; however the algorithms developed there are considerably more complex and not constrained to being biologically feasible. We first consider the linear decision rule, where the flow divides itself among the next set of edges in proportion to their pheromone level. Here,more »