skip to main content


Title: Statistical-Query Lower Bounds via Functional Gradients
We give the first statistical-query lower bounds for agnostically learning any non-polynomial activation with respect to Gaussian marginals (e.g., ReLU, sigmoid, sign). For the specific problem of ReLU regression (equivalently, agnostically learning a ReLU), we show that any statistical-query algorithm with tolerance n−(1/ϵ)b must use at least 2ncϵ queries for some constant b,c>0, where n is the dimension and ϵ is the accuracy parameter. Our results rule out general (as opposed to correlational) SQ learning algorithms, which is unusual for real-valued learning problems. Our techniques involve a gradient boosting procedure for "amplifying" recent lower bounds due to Diakonikolas et al. (COLT 2020) and Goel et al. (ICML 2020) on the SQ dimension of functions computed by two-layer neural networks. The crucial new ingredient is the use of a nonstandard convex functional during the boosting procedure. This also yields a best-possible reduction between two commonly studied models of learning: agnostic learning and probabilistic concepts.  more » « less
Award ID(s):
1909204
NSF-PAR ID:
10297553
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We give superpolynomial statistical query (SQ) lower bounds for learning two-hidden-layer ReLU networks with respect to Gaussian inputs in the standard (noise-free) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) or restricted models such as correlational SQ. Prior work hinted at the impossibility of our result: Vempala and Wilmes showed that general SQ lower bounds cannot apply to any real-valued family of functions that satisfies a simple non-degeneracy condition. To circumvent their result, we refine a lifting procedure due to Daniely and Vardi that reduces Boolean PAC learning problems to Gaussian ones. We show how to extend their technique to other learning models and, in many well-studied cases, obtain a more efficient reduction. As such, we also prove new cryptographic hardness results for PAC learning two-hidden-layer ReLU networks, as well as new lower bounds for learning constant-depth ReLU networks from label queries. 
    more » « less
  2. We prove several hardness results for training depth-2 neural networks with the ReLU activation function; these networks are simply weighted sums (that may include negative coefficients) of ReLUs. Our goal is to output a depth-2 neural network that minimizes the square loss with respect to a given training set. We prove that this problem is NP-hard already for a network with a single ReLU. We also prove NP-hardness for outputting a weighted sum of k ReLUs minimizing the squared error (for k>1) even in the realizable setting (i.e., when the labels are consistent with an unknown depth-2 ReLU network). We are also able to obtain lower bounds on the running time in terms of the desired additive error ϵ. To obtain our lower bounds, we use the Gap Exponential Time Hypothesis (Gap-ETH) as well as a new hypothesis regarding the hardness of approximating the well known Densest k-Subgraph problem in subexponential time (these hypotheses are used separately in proving different lower bounds). For example, we prove that under reasonable hardness assumptions, any proper learning algorithm for finding the best fitting ReLU must run in time exponential in (1/epsilon)^2. Together with a previous work regarding improperly learning a ReLU (Goel et al., COLT'17), this implies the first separation between proper and improper algorithms for learning a ReLU. We also study the problem of properly learning a depth-2 network of ReLUs with bounded weights giving new (worst-case) upper bounds on the running time needed to learn such networks both in the realizable and agnostic settings. Our upper bounds on the running time essentially matches our lower bounds in terms of the dependency on epsilon. 
    more » « less
  3. We prove the first superpolynomial lower bounds for learning one-layer neural networks with respect to the Gaussian distribution using gradient descent. We show that any classifier trained using gradient descent with respect to square-loss will fail to achieve small test error in polynomial time given access to samples labeled by a one-layer neural network. For classification, we give a stronger result, namely that any statistical query (SQ) algorithm (including gradient descent) will fail to achieve small test error in polynomial time. Prior work held only for gradient descent run with small batch sizes, required sharp activations, and applied to specific classes of queries. Our lower bounds hold for broad classes of activations including ReLU and sigmoid. The core of our result relies on a novel construction of a simple family of neural networks that are exactly orthogonal with respect to all spherically symmetric distributions. 
    more » « less
  4. null (Ed.)
    The approximate degree of a Boolean function f is the least degree of a real polynomial that approximates f pointwise to error at most 1/3. The approximate degree of f is known to be a lower bound on the quantum query complexity of f (Beals et al., FOCS 1998 and J. ACM 2001). We find tight or nearly tight bounds on the approximate degree and quantum query complexities of several basic functions. Specifically, we show the following. k-Distinctness: For any constant k, the approximate degree and quantum query complexity of the k-distinctness function is Ω(n3/4−1/(2k)). This is nearly tight for large k, as Belovs (FOCS 2012) has shown that for any constant k, the approximate degree and quantum query complexity of k-distinctness is O(n3/4−1/(2k+2−4)). Image size testing: The approximate degree and quantum query complexity of testing the size of the image of a function [n]→[n] is Ω~(n1/2). This proves a conjecture of Ambainis et al. (SODA 2016), and it implies tight lower bounds on the approximate degree and quantum query complexity of the following natural problems. k-Junta testing: A tight Ω~(k1/2) lower bound for k-junta testing, answering the main open question of Ambainis et al. (SODA 2016). Statistical distance from uniform: A tight Ω~(n1/2) lower bound for approximating the statistical distance of a distribution from uniform, answering the main question left open by Bravyi et al. (STACS 2010 and IEEE Trans. Inf. Theory 2011). Shannon entropy: A tight Ω~(n1/2) lower bound for approximating Shannon entropy up to a certain additive constant, answering a question of Li and Wu (2017). Surjectivity: The approximate degree of the surjectivity function is Ω~(n3/4). The best prior lower bound was Ω(n2/3). Our result matches an upper bound of O~(n3/4) due to Sherstov (STOC 2018), which we reprove using different techniques. The quantum query complexity of this function is known to be Θ(n) (Beame and Machmouchi, Quantum Inf. Comput. 2012 and Sherstov, FOCS 2015). Our upper bound for surjectivity introduces new techniques for approximating Boolean functions by low-degree polynomials. Our lower bounds are proved by significantly refining techniques recently introduced by Bun and Thaler (FOCS 2017). 
    more » « less
  5. Given a graph and an integer k, Densest k-Subgraph is the algorithmic task of finding the subgraph on k vertices with the maximum number of edges. This is a fundamental problem that has been subject to intense study for decades, with applications spanning a wide variety of fields. The state-of-the-art algorithm is an O(n^{1/4+ϵ})-factor approximation (for any ϵ>0) due to Bhaskara et al. [STOC '10]. Moreover, the so-called log-density framework predicts that this is optimal, i.e. it is impossible for an efficient algorithm to achieve an O(n^{1/4−ϵ})-factor approximation. In the average case, Densest k-Subgraph is a prototypical noisy inference task which is conjectured to exhibit a statistical-computational gap. In this work, we provide the strongest evidence yet of hardness for Densest k-Subgraph by showing matching lower bounds against the powerful Sum-of-Squares (SoS) algorithm, a meta-algorithm based on convex programming that achieves state-of-art algorithmic guarantees for many optimization and inference problems. For k ≤ n^1/2, we obtain a degree n^δ SoS lower bound for the hard regime as predicted by the log-density framework. To show this, we utilize the modern framework for proving SoS lower bounds on average-case problems pioneered by Barak et al. [FOCS '16]. A key issue is that small denser-than-average subgraphs in the input will greatly affect the value of the candidate pseudo-expectation operator around the subgraph. To handle this challenge, we devise a novel matrix factorization scheme based on the positive minimum vertex separator. We then prove an intersection tradeoff lemma to show that the error terms when using this separator are indeed small. 
    more » « less