We prove the first superpolynomial lower bounds for learning onelayer neural networks with respect to the Gaussian distribution using gradient descent. We show that any classifier trained using gradient descent with respect to squareloss will fail to achieve small test error in polynomial time given access to samples labeled by a onelayer neural network. For classification, we give a stronger result, namely that any statistical query (SQ) algorithm (including gradient descent) will fail to achieve small test error in polynomial time. Prior work held only for gradient descent run with small batch sizes, required sharp activations, and applied to specific classes of queries. Our lower bounds hold for broad classes of activations including ReLU and sigmoid. The core of our result relies on a novel construction of a simple family of neural networks that are exactly orthogonal with respect to all spherically symmetric distributions.
more »
« less
The staircase property: How hierarchical structure can guide deep learning
This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We define the ``staircase'' property for functions over the Boolean hypercube, which posits that highorder Fourier coefficients are reachable from lowerorder Fourier coefficients along increasing chains. We prove that functions satisfying this property can be learned in polynomial time using layerwise stochastic coordinate descent on regular neural networks  a class of network architectures and initializations that have homogeneity properties. Our analysis shows that for such staircase functions and neural networks, the gradientbased algorithm learns highlevel features by greedily combining lowerlevel features along the depth of the network. We further back our theoretical results with experiments showing that staircase functions are learnable by more standard ResNet architectures with stochastic gradient descent. Both the theoretical and experimental results support the fact that the staircase property has a role to play in understanding the capabilities of gradientbased learning on regular networks, in contrast to general polynomialsize networks that can emulate any Statistical Query or PAC algorithm, as recently shown.
more »
« less
 Award ID(s):
 2031883
 NSFPAR ID:
 10344265
 Date Published:
 Journal Name:
 Advances in Neural Information Processing Systems
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular fullprecision counterparts. To maintain the same performance level especially at low bitwidths, QDNNs must be retrained. Their training involves piecewise constant activation functions and discrete weights; hence, mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the fullprecision weights and their quantization (the socalled blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bitwidth such as binarization. In full quantization of ResNet18 for ImageNet classification task, BCGD gives 64.36% top1 accuracy with binary weights across all layers and 4bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent for a twolinearlayer neural network model with Gaussian input data and prove that the expected coarse gradient correlates positively with the underlying true gradient.more » « less

It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parametrizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest —nonlinear but regular networks— no tight characterization has yet been achieved, despite significant developments. We take a step in this direction by considering depth2 neural networks trained by SGD in the meanfield regime. We consider functions on binary inputs that depend on a latent lowdimensional subspace (i.e., small number of coordinates). This regime is of interest since it is poorly under stood how neural networks routinely tackle highdimensional datasets and adapt to latent low dimensional structure without suffering from the curse of dimensionality. Accordingly, we study SGDlearnability with O(d) sample complexity in a large ambient dimension d. Our main results characterize a hierarchical property —the mergedstaircase property— that is both necessary and nearly sufficient for learning in this setting. We further show that nonlinear training is necessary: for this class of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently. The key tools are a new “dimensionfree” dynamics approximation result that applies to functions defined on a latent space of lowdimension, a proof of global convergence based on polynomial identity testing, and an improvement of lower bounds against linear methods for nonalmost orthogonal functions.more » « less

In this paper we prove that Local (S)GD (or FedAvg) can optimize deep neural networks with Rectified Linear Unit (ReLU) activation function in polynomial time. Despite the established convergence theory of Local SGD on optimizing general smooth functions in communicationefficient distributed optimization, its convergence on nonsmooth ReLU networks still eludes full theoretical understanding. The key property used in many Local SGD analysis on smooth function is gradient Lipschitzness, so that the gradient on local models will not drift far away from that on averaged model. However, this decent property does not hold in networks with nonsmooth ReLU activation function. We show that, even though ReLU network does not admit gradient Lipschitzness property, the difference between gradients on local models and average model will not change too much, under the dynamics of Local SGD. We validate our theoretical results via extensive experiments. This work is the first to show the convergence of Local SGD on nonsmooth functions, and will shed lights on the optimization theory of federated training of deep neural networks.more » « less

We investigate the robustness of stochastic approximation approaches against data poisoning attacks. We focus on twolayer neural networks with ReLU activation and show that under a specific notion of separability in the RKHS induced by the infinitewidth network, training (finitewidth) networks with stochastic gradient descent is robust against data poisoning attacks. Interestingly, we find that in addition to a lower bound on the width of the network, which is standard in the literature, we also require a distributiondependent upper bound on the width for robust generalization. We provide extensive empirical evaluations that support and validate our theoretical results.more » « less