skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Learning Hierarchical Polynomials with Three-Layer Neural Networks
We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form where is a degree polynomial and is a degree polynomial. This function class generalizes the single-index model, which corresponds to , and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree polynomials , a three-layer neural network trained via layerwise gradient descent on the square loss learns the target up to vanishing test error in samples and polynomial time. This is a strict improvement over kernel methods, which require samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of being a quadratic. When is indeed a quadratic, we achieve the information-theoretically optimal sample complexity , which is an improvement over prior work (Nichani et al., 2023) requiring a sample size of . Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature with samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.  more » « less
Award ID(s):
2144994
PAR ID:
10511450
Author(s) / Creator(s):
; ;
Publisher / Repository:
Openreview
Date Published:
Journal Name:
International Conference on Learning Representations
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form $f(x)=g(Ux)$ where $$U: \R^d \to \R^r$$ with $d≫r$. When the degree of f⋆ is p, it is known that n≍dp samples are necessary to learn f⋆ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f. This results in an improved sample complexity of n≍d2r+drp. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation U but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of d. 
    more » « less
  2. Micciancio, Daniele; Ristenpart, Thomas. (Ed.)
    We present the first explicit construction of a non-malleable code that can handle tampering functions that are bounded-degree polynomials. Prior to our work, this was only known for degree-1 polynomials (affine tampering functions), due to Chattopad- hyay and Li (STOC 2017). As a direct corollary, we obtain an explicit non-malleable code that is secure against tampering by bounded-size arithmetic circuits. We show applications of our non-malleable code in constructing non-malleable se- cret sharing schemes that are robust against bounded-degree polynomial tampering. In fact our result is stronger: we can handle adversaries that can adaptively choose the polynomial tampering function based on initial leakage of a bounded number of shares. Our results are derived from explicit constructions of seedless non-malleable ex- tractors that can handle bounded-degree polynomial tampering functions. Prior to our work, no such result was known even for degree-2 (quadratic) polynomials. 
    more » « less
  3. Loh, Po-ling; Raginsky, Maxim (Ed.)
    Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form f⋆(x)=g(Ux) where U:\Rd→\Rr with d≫r. When the degree of f⋆ is p, it is known that n≍dp samples are necessary to learn f⋆ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f⋆. This results in an improved sample complexity of n≍d2 and enables transfer learning with sample complexity independent of d. 
    more » « less
  4. It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parametrizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest —non-linear but regular networks— no tight characterization has yet been achieved, despite significant developments. We take a step in this direction by considering depth-2 neural networks trained by SGD in the mean-field regime. We consider functions on binary inputs that depend on a latent low-dimensional subspace (i.e., small number of coordinates). This regime is of interest since it is poorly under- stood how neural networks routinely tackle high-dimensional datasets and adapt to latent low- dimensional structure without suffering from the curse of dimensionality. Accordingly, we study SGD-learnability with O(d) sample complexity in a large ambient dimension d. Our main results characterize a hierarchical property —the merged-staircase property— that is both necessary and nearly sufficient for learning in this setting. We further show that non-linear training is necessary: for this class of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently. The key tools are a new “dimension-free” dynamics approximation result that applies to functions defined on a latent space of low-dimension, a proof of global convergence based on polynomial identity testing, and an improvement of lower bounds against linear methods for non-almost orthogonal functions. 
    more » « less
  5. Ta-Shma, Amnon (Ed.)
    We study the fundamental challenge of exhibiting explicit functions that have small correlation with low-degree polynomials over 𝔽₂. Our main contributions include: 1) In STOC 2020, CHHLZ introduced a new technique to prove correlation bounds. Using their technique they established new correlation bounds for low-degree polynomials. They conjectured that their technique generalizes to higher degree polynomials as well. We give a counterexample to their conjecture, in fact ruling out weaker parameters and showing what they prove is essentially the best possible. 2) We propose a new approach for proving correlation bounds with the central "mod functions," consisting of two steps: (I) the polynomials that maximize correlation are symmetric and (II) symmetric polynomials have small correlation. Contrary to related results in the literature, we conjecture that (I) is true. We argue this approach is not affected by existing "barrier results." 3) We prove our conjecture for quadratic polynomials. Specifically, we determine the maximum possible correlation between quadratic polynomials modulo 2 and the functions (x_1,… ,x_n) → z^{∑ x_i} for any z on the complex unit circle, and show that it is achieved by symmetric polynomials. To obtain our results we develop a new proof technique: we express correlation in terms of directional derivatives and analyze it by slowly restricting the direction. 4) We make partial progress on the conjecture for cubic polynomials, in particular proving tight correlation bounds for cubic polynomials whose degree-3 part is symmetric. 
    more » « less