skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Neural Networks can Learn Representations with Gradient Descent
Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form $f(x)=g(Ux)$ where $$U: \R^d \to \R^r$$ with $d≫r$. When the degree of f⋆ is p, it is known that n≍dp samples are necessary to learn f⋆ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f. This results in an improved sample complexity of n≍d2r+drp. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation U but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of d.  more » « less
Award ID(s):
1846369 1813877
PAR ID:
10398966
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Loh, Po-ling; Raginsky, Maxim (Ed.)
    Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form f⋆(x)=g(Ux) where U:\Rd→\Rr with d≫r. When the degree of f⋆ is p, it is known that n≍dp samples are necessary to learn f⋆ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f⋆. This results in an improved sample complexity of n≍d2 and enables transfer learning with sample complexity independent of d. 
    more » « less
  2. Oh, A; Naumann, T; Globerson, A; Saenko, K; Hardt, M; Levine, S (Ed.)
    We consider the problem of learning a single-index target function f∗ : Rd → R under the spiked covariance data: f∗(x) = σ∗   √ 1 1+θ ⟨x,μ⟩   , x ∼ N(0, Id + θμμ⊤), θ ≍ dβ for β ∈ [0, 1), where the link function σ∗ : R → R is a degree-p polynomial with information exponent k (defined as the lowest degree in the Hermite expansion of σ∗), and it depends on the projection of input x onto the spike (signal) direction μ ∈ Rd. In the proportional asymptotic limit where the number of training examples n and the dimensionality d jointly diverge: n, d → ∞, n/d → ψ ∈ (0,∞), we ask the following question: how large should the spike magnitude θ be, in order for (i) kernel methods, (ii) neural networks optimized by gradient descent, to learn f∗? We show that for kernel ridge regression, β ≥ 1 − 1 p is both sufficient and necessary. Whereas for two-layer neural networks trained with gradient descent, β > 1 − 1 k suffices. Our results demonstrate that both kernel methods and neural networks benefit from low-dimensional structures in the data. Further, since k ≤ p by definition, neural networks can adapt to such structures more effectively. 
    more » « less
  3. The ability of learning useful features is one of the major advantages of neural networks. Although recent works show that neural network can operate in a neural tangent kernel (NTK) regime that does not allow feature learning, many works also demonstrate the potential for neural networks to go beyond NTK regime and perform feature learning. Recently, a line of work highlighted the feature learning capabilities of the early stages of gradient-based training. In this paper we consider another mechanism for feature learning via gradient descent through a local convergence analysis. We show that once the loss is below a certain threshold, gradient descent with a carefully regularized objective will capture ground-truth directions. We further strengthen this local convergence analysis by incorporating early-stage feature learning analysis. Our results demonstrate that feature learning not only happens at the initial gradient steps, but can also occur towards the end of training. 
    more » « less
  4. A key challenge in complex visuomotor control is learning abstract representations that are ef- fective for specifying goals, planning, and gen- eralization. To this end, we introduce universal planning networks (UPN). UPNs embed differen- tiable planning within a goal-directed policy. This planning computation unrolls a forward model in a latent space and infers an optimal action plan through gradient descent trajectory optimization. The plan-by-gradient-descent process and its un- derlying representations are learned end-to-end to directly optimize a supervised imitation learning objective. We find that the representations learned are not only effective for goal-directed visual imi- tation via gradient-based trajectory optimization, but can also provide a metric for specifying goals using images. The learned representations can be leveraged to specify distance-based rewards to reach new target states for model-free reinforce- ment learning, resulting in substantially more ef- fective learning when solving new tasks described via image-based goals. We were able to achieve successful transfer of visuomotor planning strate- gies across robots with significantly different mor- phologies and actuation capabilities. 
    more » « less
  5. Over-parametrization is an important technique in training neural networks. In both theory and practice, training a larger network allows the optimization algorithm to avoid bad local optimal solutions. In this paper we study a closely related tensor decomposition problem: given an l-th order tensor in (Rd)⊗l of rank r (where r≪d), can variants of gradient descent find a rank m decomposition where m>r? We show that in a lazy training regime (similar to the NTK regime for neural networks) one needs at least m=Ω(dl−1), while a variant of gradient descent can find an approximate tensor when m=O∗(r2.5llogd). Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data. 
    more » « less