We give superpolynomial statistical query (SQ) lower bounds for learning twohiddenlayer ReLU networks with respect to Gaussian inputs in the standard (noisefree) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) or restricted models such as correlational SQ.
Prior work hinted at the impossibility of our result: Vempala and Wilmes showed that general SQ lower bounds cannot apply to any realvalued family of functions that satisfies a simple nondegeneracy condition.
To circumvent their result, we refine a lifting procedure due to Daniely and Vardi that reduces Boolean PAC learning problems to Gaussian ones. We show how to extend their technique to other learning models and, in many wellstudied cases, obtain a more efficient reduction. As such, we also prove new cryptographic hardness results for PAC learning twohiddenlayer ReLU networks, as well as new lower bounds for learning constantdepth ReLU networks from label queries.
more »
« less
On the Power of Learning from kWise Queries
Several wellstudied models of access to data samples, including statistical queries, local differential privacy and lowcommunication algorithms rely on queries that provide information about a function of a single sample. (For example, a statistical query (SQ) gives an estimate of $\E_{x\sim D}[q(x)]$ for any choice of the query function $q:X\rightarrow \R$, where $D$ is an unknown data distribution.) Yet some data analysis algorithms rely on properties of functions that depend on multiple samples. Such algorithms would be naturally implemented using $k$wise queries each of which is specified by a function $q:X^k\rightarrow \R$. Hence it is natural to ask whether algorithms using $k$wise queries can solve learning problems more efficiently and by how much.
Blum, Kalai, Wasserman~\cite{blum2003noise} showed that for any weak PAC learning problem over a fixed distribution, the complexity of learning with $k$wise SQs is smaller than the (unary) SQ complexity by a factor of at most $2^k$. We show that for more general problems over distributions the picture is substantially richer. For every $k$, the complexity of distributionindependent PAC learning with $k$wise queries can be exponentially larger than learning with $(k+1)$wise queries. We then give two approaches for simulating a $k$wise query using unary queries. The first approach exploits the structure of the problem that needs to be solved. It generalizes and strengthens (exponentially) the results of Blum \etal \cite{blum2003noise}. It allows us to derive strong lower bounds for learning DNF formulas and stochastic constraint satisfaction problems that hold against algorithms using $k$wise queries. The second approach exploits the $k$party communication complexity of the $k$wise query function.
more »
« less
 NSFPAR ID:
 10026311
 Date Published:
 Journal Name:
 Innovations in Theoretical Computer Science (ITCS)
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


null (Ed.)We give the first statisticalquery lower bounds for agnostically learning any nonpolynomial activation with respect to Gaussian marginals (e.g., ReLU, sigmoid, sign). For the specific problem of ReLU regression (equivalently, agnostically learning a ReLU), we show that any statisticalquery algorithm with tolerance n−(1/ϵ)b must use at least 2ncϵ queries for some constant b,c>0, where n is the dimension and ϵ is the accuracy parameter. Our results rule out general (as opposed to correlational) SQ learning algorithms, which is unusual for realvalued learning problems. Our techniques involve a gradient boosting procedure for "amplifying" recent lower bounds due to Diakonikolas et al. (COLT 2020) and Goel et al. (ICML 2020) on the SQ dimension of functions computed by twolayer neural networks. The crucial new ingredient is the use of a nonstandard convex functional during the boosting procedure. This also yields a bestpossible reduction between two commonly studied models of learning: agnostic learning and probabilistic concepts.more » « less

null (Ed.)Abstract We show a simple reduction which demonstrates the cryptographic hardness of learning a single periodic neuron over isotropic Gaussian distributions in the presence of noise. More precisely, our reduction shows that any polynomialtime algorithm (not necessarily gradientbased) for learning such functions under small noise implies a polynomialtime quantum algorithm for solving worstcase lattice problems, whose hardness form the foundation of latticebased cryptography. Our core hard family of functions, which are wellapproximated by onelayer neural networks, take the general form of a univariate periodic function applied to an affine projection of the data. These functions have appeared in previous seminal works which demonstrate their hardness against gradientbased (Shamir’18), and Statistical Query (SQ) algorithms (Song et al.’17). We show that if (polynomially) small noise is added to the labels, the intractability of learning these functions applies to all polynomialtime algorithms, beyond gradientbased and SQ algorithms, under the aforementioned cryptographic assumptions. Moreover, we demonstrate the necessity of noise in the hardness result by designing a polynomialtime algorithm for learning certain families of such functions under exponentially small adversarial noise. Our proposed algorithm is not a gradientbased or an SQ algorithm, but is rather based on the celebrated LenstraLenstraLovász (LLL) lattice basis reduction algorithm. Furthermore, in the absence of noise, this algorithm can be directly applied to solve CLWE detection (Bruna et al.’21) and phase retrieval with an optimal sample complexity of d + 1 samples. In the former case, this improves upon the quadraticind sample complexity required in (Bruna et al.’21).more » « less

Kraus, Andreas (Ed.)In this paper we study the fundamental problems of maximizing a continuous nonmonotone submodular function over the hypercube, both with and without coordinatewise concavity. This family of optimization problems has several applications in machine learning, economics, and communication systems. Our main result is the first 1 2 approximation algorithm for continuous submodular function maximization; this approximation factor of 1 2 is the best possible for algorithms that only query the objective function at polynomially many points. For the special case of DRsubmodular maximization, i.e. when the submodular function is also coordinatewise concave along all coordinates, we provide a different 1 2 approximation algorithm that runs in quasilinear time. Both these results improve upon prior work (Bian et al., 2017a,b; Soma and Yoshida, 2017). Our first algorithm uses novel ideas such as reducing the guaranteed approximation problem to analyzing a zerosum game for each coordinate, and incorporates the geometry of this zerosum game to fix the value at this coordinate. Our second algorithm exploits coordinatewise concavity to identify a monotone equilibrium condition sufficient for getting the required approximation guarantee, and hunts for the equilibrium point using binary search. We further run experiments to verify the performance of our proposed algorithms in related machine learning applications.more » « less

SingleIndex Models are highdimensional regression problems with planted structure, whereby labels depend on an unknown onedimensional projection of the input via a generic, nonlinear, and potentially nondeterministic transformation. As such, they encompass a broad class of statistical inference tasks, and provide a rich template to study statistical and computational tradeoffs in the highdimensional regime. While the informationtheoretic sample complexity to recover the hidden direction is lin ear in the dimension d, we show that computationally efficient algorithms, both within the Statistical Query (SQ) and the LowDegree Polynomial (LDP) framework, necessarily require Ω(dk⋆/2) samples, where k⋆ is a “generative” exponent associated with the model that we explicitly characterize. Moreover, we show that this sample complexity is also sufficient, by establishing matching upper bounds using a partialtrace algorithm. Therefore, our results pro vide evidence of a sharp computationaltostatistical gap (under both the SQ and LDP class) whenever k⋆ > 2. To complete the study, we construct smooth and Lipschitz deterministic target functions with arbitrarily large generative exponents k⋆.more » « less