skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective
We consider the problem of learning a single-index target function f∗ : Rd → R under the spiked covariance data: f∗(x) = σ∗   √ 1 1+θ ⟨x,μ⟩   , x ∼ N(0, Id + θμμ⊤), θ ≍ dβ for β ∈ [0, 1), where the link function σ∗ : R → R is a degree-p polynomial with information exponent k (defined as the lowest degree in the Hermite expansion of σ∗), and it depends on the projection of input x onto the spike (signal) direction μ ∈ Rd. In the proportional asymptotic limit where the number of training examples n and the dimensionality d jointly diverge: n, d → ∞, n/d → ψ ∈ (0,∞), we ask the following question: how large should the spike magnitude θ be, in order for (i) kernel methods, (ii) neural networks optimized by gradient descent, to learn f∗? We show that for kernel ridge regression, β ≥ 1 − 1 p is both sufficient and necessary. Whereas for two-layer neural networks trained with gradient descent, β > 1 − 1 k suffices. Our results demonstrate that both kernel methods and neural networks benefit from low-dimensional structures in the data. Further, since k ≤ p by definition, neural networks can adapt to such structures more effectively.  more » « less
Award ID(s):
2154099
PAR ID:
10540433
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Oh, A; Naumann, T; Globerson, A; Saenko, K; Hardt, M; Levine, S
Publisher / Repository:
Advances in Neural Information Processing Systems 36 (NeurIPS 2023)
Date Published:
Volume:
36
ISBN:
9781713871088
Page Range / eLocation ID:
20695--20728
Format(s):
Medium: X
Location:
New Orleans
Sponsoring Org:
National Science Foundation
More Like this
  1. Loh, Po-ling; Raginsky, Maxim (Ed.)
    Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form f⋆(x)=g(Ux) where U:\Rd→\Rr with d≫r. When the degree of f⋆ is p, it is known that n≍dp samples are necessary to learn f⋆ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f⋆. This results in an improved sample complexity of n≍d2 and enables transfer learning with sample complexity independent of d. 
    more » « less
  2. Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form $f(x)=g(Ux)$ where $$U: \R^d \to \R^r$$ with $d≫r$. When the degree of f⋆ is p, it is known that n≍dp samples are necessary to learn f⋆ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f. This results in an improved sample complexity of n≍d2r+drp. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation U but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of d. 
    more » « less
  3. Over-parametrization is an important technique in training neural networks. In both theory and practice, training a larger network allows the optimization algorithm to avoid bad local optimal solutions. In this paper we study a closely related tensor decomposition problem: given an l-th order tensor in (Rd)⊗l of rank r (where r≪d), can variants of gradient descent find a rank m decomposition where m>r? We show that in a lazy training regime (similar to the NTK regime for neural networks) one needs at least m=Ω(dl−1), while a variant of gradient descent can find an approximate tensor when m=O∗(r2.5llogd). Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data. 
    more » « less
  4. Numerous experimental and computational studies show that continuous hopper flows of granular materials obey the Beverloo equation that relates the volume flow rate Q and the orifice width w : Q ∼ ( w / σ avg − k ) β , where σ avg is the average particle diameter, kσ avg is an offset where Q ∼ 0, the power-law scaling exponent β = d − 1/2, and d is the spatial dimension. Recent studies of hopper flows of deformable particles in different background fluids suggest that the particle stiffness and dissipation mechanism can also strongly affect the power-law scaling exponent β . We carry out computational studies of hopper flows of deformable particles with both kinetic friction and background fluid dissipation in two and three dimensions. We show that the exponent β varies continuously with the ratio of the viscous drag to the kinetic friction coefficient, λ = ζ / μ . β = d − 1/2 in the λ → 0 limit and d − 3/2 in the λ → ∞ limit, with a midpoint λ c that depends on the hopper opening angle θ w . We also characterize the spatial structure of the flows and associate changes in spatial structure of the hopper flows to changes in the exponent β . The offset k increases with particle stiffness until k ∼ k max in the hard-particle limit, where k max ∼ 3.5 is larger for λ → ∞ compared to that for λ → 0. Finally, we show that the simulations of hopper flows of deformable particles in the λ → ∞ limit recapitulate the experimental results for quasi-2D hopper flows of oil droplets in water. 
    more » « less
  5. null (Ed.)
    We present approximation and exact algorithms for piecewise regression of univariate and bivariate data using fixed-degree polynomials. Specifically, given a set S of n data points (x1, y1), . . . , (xn, yn) ∈ Rd × R where d ∈ {1, 2}, the goal is to segment xi’s into some (arbitrary) number of disjoint pieces P1, . . . , Pk, where each piece Pj is associated with a fixed-degree polynomial fj : Rd → R, to minimize the total loss function λk+􏰄ni=1(yi −f(xi))2, where λ ≥ 0 is a regularization term that penalizes model complexity (number of pieces) and f : 􏰇kj=1 Pj → R is the piecewise polynomial function defined as f|Pj = fj. The pieces P1,...,Pk are disjoint intervals of R in the case of univariate data and disjoint axis-aligned rectangles in the case of bivariate data. Our error approximation allows use of any fixed-degree polynomial, not just linear functions. Our main results are the following. For univariate data, we present a (1 + ε)-approximation algorithm with time complexity O(nε log1ε), assuming that data is presented in sorted order of xi’s. For bivariate data, we √ present three results: a sub-exponential exact algorithm with running time nO( n); a polynomial-time constant- approximation algorithm; and a quasi-polynomial time approximation scheme (QPTAS). The bivariate case is believed to be NP-hard in the folklore but we could not find a published record in the literature, so in this paper we also present a hardness proof for completeness. 
    more » « less