This content will become publicly available on May 6, 2025
- Award ID(s):
- 1845360
- PAR ID:
- 10526177
- Publisher / Repository:
- International Conference on Learning Representations (ICLR)
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Sparse high-dimensional functions have arisen as a rich framework to study the behavior of gradient-descent methods using shallow neural networks, showcasing their ability to perform feature learning beyond linear models. Amongst those functions, the simplest are single-index models f(x) = φ(x · θ∗), where the labels are generated by an arbitrary non-linear scalar link function φ applied to an unknown one-dimensional projection θ∗ of the input data. By focusing on Gaussian data, several recent works have built a remarkable picture, where the so-called information exponent (related to the regularity of the link function) controls the required sample complexity. In essence, these tools exploit the stability and spherical symmetry of Gaussian distributions. In this work, building from the framework of [Ben Arous et al., 2021], we explore extensions of this picture beyond the Gaussian setting, where both stability or symmetry might be violated. Focusing on the planted setting where φ is known, our main results establish that Stochastic Gradient Descent can efficiently recover the unknown direction θ∗ in the high- dimensional regime, under assumptions that extend previous works [Yehudai and Shamir, 2020, Wu, 2022]more » « less
-
We focus on the task of learning a single index model with respect to the isotropic Gaussian distribution in d dimensions. Prior work has shown that the sample complexity of learning the hidden direction is governed by the information exponent k of the link function. Ben Arous et al. showed that d^k samples suffice for learning and that this is tight for online SGD. However, the CSQ lowerbound for gradient based methods only shows that d^{k/2} samples are necessary. In this work, we close the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns the hidden direction with the correct number of samples. We also draw connections to statistical analyses of tensor PCA and to the implicit regularization effects of minibatch SGD on empirical losses.more » « less
-
Oh, A ; Naumann, T ; Globerson, A ; Saenko, K ; Hardt, M ; Levine, S (Ed.)We consider the problem of learning a single-index target function f∗ : Rd → R under the spiked covariance data: f∗(x) = σ∗ √ 1 1+θ ⟨x,μ⟩ , x ∼ N(0, Id + θμμ⊤), θ ≍ dβ for β ∈ [0, 1), where the link function σ∗ : R → R is a degree-p polynomial with information exponent k (defined as the lowest degree in the Hermite expansion of σ∗), and it depends on the projection of input x onto the spike (signal) direction μ ∈ Rd. In the proportional asymptotic limit where the number of training examples n and the dimensionality d jointly diverge: n, d → ∞, n/d → ψ ∈ (0,∞), we ask the following question: how large should the spike magnitude θ be, in order for (i) kernel methods, (ii) neural networks optimized by gradient descent, to learn f∗? We show that for kernel ridge regression, β ≥ 1 − 1 p is both sufficient and necessary. Whereas for two-layer neural networks trained with gradient descent, β > 1 − 1 k suffices. Our results demonstrate that both kernel methods and neural networks benefit from low-dimensional structures in the data. Further, since k ≤ p by definition, neural networks can adapt to such structures more effectively.more » « less
-
We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form where is a degree polynomial and is a degree polynomial. This function class generalizes the single-index model, which corresponds to , and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree polynomials , a three-layer neural network trained via layerwise gradient descent on the square loss learns the target up to vanishing test error in samples and polynomial time. This is a strict improvement over kernel methods, which require samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of being a quadratic. When is indeed a quadratic, we achieve the information-theoretically optimal sample complexity , which is an improvement over prior work (Nichani et al., 2023) requiring a sample size of . Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature with samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.more » « less
-
Abstract We study the classical ranking and selection problem, where the ultimate goal is to find the unknown best alternative in terms of the probability of correct selection or expected opportunity cost. However, this paper adopts an alternative sampling approach to achieve this goal, where sampling decisions are made with the objective of maximizing information about the unknown best alternative, or equivalently, minimizing its Shannon entropy. This adaptive learning is formulated via a Bayesian stochastic dynamic programming problem, by which several properties of the learning problem are presented, including the monotonicity of the optimal value function in an information‐seeking setting. Since the state space of the stochastic dynamic program is unbounded in the Gaussian setting, a one‐step look‐ahead approach is used to develop a policy. The proposed policy seeks to maximize the one‐step information gain about the unknown best alternative, and therefore, it is called information gradient (IG). It is also proved that the IG policy is consistent, that is, as the sampling budget grows to infinity, the IG policy finds the true best alternative almost surely. Later, a computationally efficient estimate of the proposed policy, called approximated information gradient (AIG), is introduced and in the numerical experiments its performance is tested against recent benchmarks alongside several sensitivity analyses. Results show that AIG performs competitively against other algorithms from the literature.