Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime.
Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form $f(x)=g(Ux)$ where $U: \R^d \to \R^r$ with $d≫r$. When the degree of f⋆ is p, it is known that n≍dp samples are necessary to learn f⋆ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f. This results in an improved sample complexity of n≍d2r+drp. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation U but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of d.
more »
« less
This content will become publicly available on May 5, 2025
Learning Hierarchical Polynomials with ThreeLayer Neural Networks
We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with threelayer neural networks. We specifically consider target functions of the form
where
is a degree
polynomial and
is a degree
polynomial. This function class generalizes the singleindex model, which corresponds to
, and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree
polynomials
, a threelayer neural network trained via layerwise gradient descent on the square loss learns the target
up to vanishing test error in
samples and polynomial time. This is a strict improvement over kernel methods, which require
samples, as well as existing guarantees for twolayer networks, which require the target function to be lowrank. Our result also generalizes prior works on threelayer neural networks, which were restricted to the case of
being a quadratic. When
is indeed a quadratic, we achieve the informationtheoretically optimal sample complexity
, which is an improvement over prior work (Nichani et al., 2023) requiring a sample size of
. Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature
with
samples. This work demonstrates the ability of threelayer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
more »
« less
 Award ID(s):
 2144994
 NSFPAR ID:
 10511450
 Publisher / Repository:
 Openreview
 Date Published:
 Journal Name:
 International Conference on Learning Representations
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Loh, Poling ; Raginsky, Maxim (Ed.)Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form f⋆(x)=g(Ux) where U:\Rd→\Rr with d≫r. When the degree of f⋆ is p, it is known that n≍dp samples are necessary to learn f⋆ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f⋆. This results in an improved sample complexity of n≍d2 and enables transfer learning with sample complexity independent of d.more » « less

It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parametrizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest —nonlinear but regular networks— no tight characterization has yet been achieved, despite significant developments. We take a step in this direction by considering depth2 neural networks trained by SGD in the meanfield regime. We consider functions on binary inputs that depend on a latent lowdimensional subspace (i.e., small number of coordinates). This regime is of interest since it is poorly under stood how neural networks routinely tackle highdimensional datasets and adapt to latent low dimensional structure without suffering from the curse of dimensionality. Accordingly, we study SGDlearnability with O(d) sample complexity in a large ambient dimension d. Our main results characterize a hierarchical property —the mergedstaircase property— that is both necessary and nearly sufficient for learning in this setting. We further show that nonlinear training is necessary: for this class of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently. The key tools are a new “dimensionfree” dynamics approximation result that applies to functions defined on a latent space of lowdimension, a proof of global convergence based on polynomial identity testing, and an improvement of lower bounds against linear methods for nonalmost orthogonal functions.more » « less

Micciancio, Daniele ; Ristenpart, Thomas. (Ed.)We present the first explicit construction of a nonmalleable code that can handle tampering functions that are boundeddegree polynomials. Prior to our work, this was only known for degree1 polynomials (affine tampering functions), due to Chattopad hyay and Li (STOC 2017). As a direct corollary, we obtain an explicit nonmalleable code that is secure against tampering by boundedsize arithmetic circuits. We show applications of our nonmalleable code in constructing nonmalleable se cret sharing schemes that are robust against boundeddegree polynomial tampering. In fact our result is stronger: we can handle adversaries that can adaptively choose the polynomial tampering function based on initial leakage of a bounded number of shares. Our results are derived from explicit constructions of seedless nonmalleable ex tractors that can handle boundeddegree polynomial tampering functions. Prior to our work, no such result was known even for degree2 (quadratic) polynomials.more » « less

We study deep neural networks with polynomial activations, particularly their expressive power. For a fixed architecture and activation degree, a polynomial neural network defines an algebraic map from weights to polynomials. The image of this map is the functional space associated to the network, and it is an irreducible algebraic variety upon taking closure. This paper proposes the dimension of this variety as a precise measure of the expressive power of polynomial neural networks. We obtain several theoretical results regarding this dimension as a function of architecture, including an exact formula for high activation degrees, as well as upper and lower bounds on layer widths in order for deep polynomials networks to fill the ambient functional space. We also present computational evidence that it is profitable in terms of expressiveness for layer widths to increase monotonically and then decrease monotonically. Finally, we link our study to favorable optimization properties when training weights, and we draw intriguing connections with tensor and polynomial decompositions.more » « less