Modern neural networks are often quite wide, causing large memory and computation
costs. It is thus of great interest to train a narrower network. However,
training narrow neural nets remains a challenging task. We ask two theoretical
questions: Can narrow networks have as strong expressivity as wide ones? If so,
does the loss function exhibit a benign optimization landscape? In this work, we
provide partially affirmative answers to both questions for 1hiddenlayer networks
with fewer than n (sample size) neurons when the activation is smooth. First, we
prove that as long as the width m>=2n=d (where d is the input dimension), its
expressivity is strong, i.e., there exists at least one global minimizer with zero
training loss. Second, we identify a nice local region with no localmin or saddle
points. Nevertheless, it is not clear whether gradient descent can stay in this nice region.
Third, we consider a constrained optimization formulation where the feasible
region is the nice local region, and prove that every KKT point is a nearly global
minimizer. It is expected that projected gradient methods converge to KKT points
under mild technical conditions, but we leave the rigorous convergence analysis
to future work. Thorough numerical results show that projected gradient methods
on this constrained formulation significantly outperform SGD for training narrow
neural nets.
more »
« less
A Local Convergence Theory for Mildly OverParameterized TwoLayer Neural Network
While overparameterization is widely believed to be crucial for the success of optimization for the neural networks, most existing theories on overparameterization do not fully explain the reason  they either work in the Neural Tangent Kernel regime where neurons don't move much, or require an enormous number of neurons. In practice, when the data is generated using a teacher neural network, even mildly overparameterized neural networks can achieve 0 loss and recover the directions of teacher neurons. In this paper we develop a local convergence theory for mildly overparameterized twolayer neural net. We show that as long as the loss is already lower than a threshold (polynomial in relevant parameters), all student neurons in an overparameterized twolayer neural network will converge to one of teacher neurons, and the loss will go to 0. Our result holds for any number of student neurons as long as it is at least as large as the number of teacher neurons, and our convergence rate is independent of the number of student neurons. A key component of our analysis is the new characterization of local optimization landscape  we show the gradient satisfies a special case of Lojasiewicz property which is different from local strong convexity or PL conditions used in previous work.
more »
« less
 NSFPAR ID:
 10231316
 Date Published:
 Journal Name:
 COLT
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Daumé, H ; Singh, A (Ed.)An acknowledged weakness of neural networks is their vulnerability to adversarial perturbations to the inputs. To improve the robustness of these models, one of the most popular defense mechanisms is to alternatively maximize the loss over the constrained perturbations (or called adversaries) on the inputs using projected gradient ascent and minimize over weights. In this paper, we analyze the dynamics of the maximization step towards understanding the experimentally observed effectiveness of this defense mechanism. Specifically, we investigate the nonconcave landscape of the adversaries for a twolayer neural network with a quadratic loss. Our main result proves that projected gradient ascent finds a local maximum of this nonconcave problem in a polynomial number of iterations with high probability. To our knowledge, this is the first work that provides a convergence analysis of the firstorder adversaries. Moreover, our analysis demonstrates that, in the initial phase of adversarial training, the scale of the inputs matters in the sense that a smaller input scale leads to faster convergence of adversarial training and a “more regular” landscape. Finally, we show that these theoretical findings are in excellent agreement with a series of experiments.more » « less

A quantum neural network (QNN) is a parameterized mapping efficiently implementable on nearterm Noisy IntermediateScale Quantum (NISQ) computers. It can be used for supervised learning when combined with classical gradientbased optimizers. Despite the existing empirical and theoretical investigations, the convergence of QNN training is not fully understood. Inspired by the success of the neural tangent kernels (NTKs) in probing into the dynamics of classical neural networks, a recent line of works proposes to study overparameterized QNNs by examining a quantum version of tangent kernels. In this work, we study the dynamics of QNNs and show that contrary to popular belief it is qualitatively different from that of any kernel regression: due to the unitarity of quantum operations, there is a nonnegligible deviation from the tangent kernel regression derived at the random initialization. As a result of the deviation, we prove the atmost sublinear convergence for QNNs with Pauli measurements, which is beyond the explanatory power of any kernel regression dynamics. We then present the actual dynamics of QNNs in the limit of overparameterization. The new dynamics capture the change of convergence rate during training, and implies that the range of measurements is crucial to the fast QNN convergence.more » « less

Abstract We develop new theoretical results on matrix perturbation to shed light on the impact of architecture on the performance of a deep network. In particular, we explain analytically what deep learning practitioners have long observed empirically: the parameters of some deep architectures (e.g., residual networks, ResNets, and Dense networks, DenseNets) are easier to optimize than others (e.g., convolutional networks, ConvNets). Building on our earlier work connecting deep networks with continuous piecewiseaffine splines, we develop an exact local linear representation of a deep network layer for a family of modern deep networks that includes ConvNets at one end of a spectrum and ResNets, DenseNets, and other networks with skip connections at the other. For regression and classification tasks that optimize the squarederror loss, we show that the optimization loss surface of a modern deep network is piecewise quadratic in the parameters, with local shape governed by the singular values of a matrix that is a function of the local linear representation. We develop new perturbation results for how the singular values of matrices of this sort behave as we add a fraction of the identity and multiply by certain diagonal matrices. A direct application of our perturbation results explains analytically why a network with skip connections (such as a ResNet or DenseNet) is easier to optimize than a ConvNet: thanks to its more stable singular values and smaller condition number, the local loss surface of such a network is less erratic, less eccentric, and features local minima that are more accommodating to gradientbased optimization. Our results also shed new light on the impact of different nonlinear activation functions on a deep network’s singular values, regardless of its architecture.more » « less

Feature representations from pretrained deep neural networks have been known to exhibit excellent generalization and utility across a variety of related tasks. Finetuning is by far the simplest and most widely used approach that seeks to exploit and adapt these feature representations to novel tasks with limited data. Despite the effectiveness of finetuning, itis often suboptimal and requires very careful optimization to prevent severe overfitting to small datasets. The problem of suboptimality and overfitting, is due in part to the large number of parameters used in a typical deep convolutional neural network. To address these problems, we propose a simple yet effective regularization method for finetuning pretrained deep networks for the task of kshot learning. To prevent overfitting, our key strategy is to cluster the model parameters while ensuring intracluster similarity and intercluster diversity of the parameters, effectively regularizing the dimensionality of the parameter search space. In particular, we identify groups of neurons within each layer of a deep network that shares similar activation patterns. When the network is to be finetuned for a classification task using only k examples, we propagate a single gradient to all of the neuron parameters that belong to the same group. The grouping of neurons is nontrivial as neuron activations depend on the distribution of the input data. To efficiently search for optimal groupings conditioned on the input data, we propose a reinforcement learning search strategy using recurrent networks to learn the optimal group assignments for each network layer. Experimental results show that our method can be easily applied to several popular convolutional neural networks and improve upon other stateoftheart finetuning based kshot learning strategies by more than10%more » « less