Oh, A
; Naumann, T
; Globerson, A
; Saenko, K
; Hardt, M
; Levine, S
(Ed.)
We consider the problem of learning a single-index target function f∗ : Rd → R
under the spiked covariance data:
f∗(x) = σ∗
√ 1
1+θ
⟨x,μ⟩
, x ∼ N(0, Id + θμμ⊤), θ ≍ dβ for β ∈ [0, 1),
where the link function σ∗ : R → R is a degree-p polynomial with information
exponent k (defined as the lowest degree in the Hermite expansion of σ∗), and it
depends on the projection of input x onto the spike (signal) direction μ ∈ Rd. In
the proportional asymptotic limit where the number of training examples n and
the dimensionality d jointly diverge: n, d → ∞, n/d → ψ ∈ (0,∞), we ask the
following question: how large should the spike magnitude θ be, in order for (i)
kernel methods, (ii) neural networks optimized by gradient descent, to learn f∗?
We show that for kernel ridge regression, β ≥ 1 − 1
p is both sufficient and necessary.
Whereas for two-layer neural networks trained with gradient descent,
β > 1 − 1
k suffices. Our results demonstrate that both kernel methods and neural
networks benefit from low-dimensional structures in the data. Further, since k ≤ p
by definition, neural networks can adapt to such structures more effectively.
more »
« less