Neural networks provide a rich class of highdimensional, nonconvex optimization problems. Despite their nonconvexity, gradientdescent methods often successfully optimize these mod els. This has motivated a recent spur in research attempting to characterize properties of their loss surface that may explain such success. In this paper, we address this phenomenon by studying a key topological property of the loss: the presence or absence of spurious valleys, defined as connected components of sublevel sets that do not include a global minimum. Focusing on a class of onehiddenlayer neural networks defined by smooth (but generally nonlinear) activation functions, we identify a notion of intrinsic dimension and show that it provides necessary and sufficient conditions for the absence of spurious valleys. More concretely, finite intrinsic dimension guarantees that for sufficiently overparametrised models no spurious valleys exist, independently of the data distribution. Conversely, infinite intrinsic dimension implies that spurious valleys do exist for certain data distributions, independently of model overparametrisation. Besides these positive and negative results, we show that, although spurious valleys may exist in general, they are confined to low risk levels and avoided with high probability on overparametrised models.
The mergedstaircase property: a necessary and nearly sufficient condition for SGD learning of sparse functions on twolayer neural networks
It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parametrizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest —nonlinear but regular networks— no tight characterization has yet been achieved, despite significant developments.
We take a step in this direction by considering depth2 neural networks trained by SGD in the meanfield regime. We consider functions on binary inputs that depend on a latent lowdimensional subspace (i.e., small number of coordinates). This regime is of interest since it is poorly under stood how neural networks routinely tackle highdimensional datasets and adapt to latent low dimensional structure without suffering from the curse of dimensionality. Accordingly, we study SGDlearnability with O(d) sample complexity in a large ambient dimension d.
Our main results characterize a hierarchical property —the mergedstaircase property— that is both necessary and nearly sufficient for learning in this setting. We further show that nonlinear training is necessary: for this class of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently. The key tools are a new “dimensionfree” dynamics approximation result that applies to functions defined on more »
 Award ID(s):
 2031883
 Publication Date:
 NSFPAR ID:
 10344261
 Journal Name:
 Proceedings of Thirty Fifth Conference on Learning Theory
 Volume:
 178
 Page Range or eLocationID:
 47824887
 Sponsoring Org:
 National Science Foundation
More Like this


Anandkumar, Animashree (Ed.)Neural networks provide a rich class of highdimensional, nonconvex optimization problems. Despite their nonconvexity, gradientdescent methods often successfully optimize these models. This has motivated a recent spur in research attempting to characterize properties of their loss surface that may explain such success. In this paper, we address this phenomenon by studying a key topological property of the loss: the presence or absence of spurious valleys, defined as connected components of sublevel sets that do not include a global minimum. Focusing on a class of twolayer neural networks defined by smooth (but generally nonlinear) activation functions, we identify a notion of intrinsic dimension and show that it provides necessary and sufficient conditions for the absence of spurious valleys. More concretely, finite intrinsic dimension guarantees that for sufficiently overparametrised models no spurious valleys exist, independently of the data distribution. Conversely, infinite intrinsic dimension implies that spurious valleys do exist for certain data distributions, independently of model overparametrisation. Besides these positive and negative results, we show that, although spurious valleys may exist in general, they are confined to low risk levels and avoided with high probability on overparametrised models.

We consider the problem of finding a twolayer neural network with sigmoid, rectified linear unit (ReLU), or binary step activation functions that "fits" a training data set as accurately as possible as quantified by the training error; and study the following question: \emph{does a low training error guarantee that the norm of the output layer (outer norm) itself is small?} We answer affirmatively this question for the case of nonnegative output weights. Using a simple covering number argument, we establish that under quite mild distributional assumptions on the input/label pairs; any such network achieving a small training error on polynomially many data necessarily has a wellcontrolled outer norm. Notably, our results (a) have a polynomial (in d) sample complexity, (b) are independent of the number of hidden units (which can potentially be very high), (c) are oblivious to the training algorithm; and (d) require quite mild assumptions on the data (in particular the input vector X∈ℝd need not have independent coordinates). We then leverage our bounds to establish generalization guarantees for such networks through \emph{fatshattering dimension}, a scalesensitive measure of the complexity class that the network architectures we investigate belong to. Notably, our generalization bounds also have good sample complexitymore »

We consider the problem of finding a twolayer neural network with sigmoid, rectified linear unit (ReLU), or binary step activation functions that “fits” a training data set as accurately as possible as quantified by the training error; and study the following question: does a low training error guarantee that the norm of the output layer (outer norm) itself is small? We answer affirmatively this question for the case of nonnegative output weights. Using a simple covering number argument, we establish that under quite mild distributional assumptions on the input/label pairs; any such network achieving a small training error on polynomially many data necessarily has a wellcontrolled outer norm. Notably, our results (a) have a polynomial (in d) sample complexity, (b) are independent of the number of hidden units (which can potentially be very high), (c) are oblivious to the training algorithm; and (d) require quite mild assumptions on the data (in particular the input vector X ∈ Rd need not have independent coordinates). We then leverage our bounds to establish generalization guarantees for such networks through fatshattering dimension, a scalesensitive measure of the complexity class that the network architectures we investigate belong to. Notably, our generalization bounds also have goodmore »

In this paper we prove that Local (S)GD (or FedAvg) can optimize deep neural networks with Rectified Linear Unit (ReLU) activation function in polynomial time. Despite the established convergence theory of Local SGD on optimizing general smooth functions in communicationefficient distributed optimization, its convergence on nonsmooth ReLU networks still eludes full theoretical understanding. The key property used in many Local SGD analysis on smooth function is gradient Lipschitzness, so that the gradient on local models will not drift far away from that on averaged model. However, this decent property does not hold in networks with nonsmooth ReLU activation function. We show that, even though ReLU network does not admit gradient Lipschitzness property, the difference between gradients on local models and average model will not change too much, under the dynamics of Local SGD. We validate our theoretical results via extensive experiments. This work is the first to show the convergence of Local SGD on nonsmooth functions, and will shed lights on the optimization theory of federated training of deep neural networks.