We prove several hardness results for training depth-2 neural networks with the ReLU activation function; these networks are simply weighted sums (that may include negative coefficients) of ReLUs. Our goal is to output a depth-2 neural network that minimizes the square loss with respect to a given training set. We prove that this problem is NP-hard already for a network with a single ReLU. We also prove NP-hardness for outputting a weighted sum of k ReLUs minimizing the squared error (for k>1) even in the realizable setting (i.e., when the labels are consistent with an unknown depth-2 ReLU network). We are also able to obtain lower bounds on the running time in terms of the desired additive error ϵ. To obtain our lower bounds, we use the Gap Exponential Time Hypothesis (Gap-ETH) as well as a new hypothesis regarding the hardness of approximating the well known Densest k-Subgraph problem in subexponential time (these hypotheses are used separately in proving different lower bounds). For example, we prove that under reasonable hardness assumptions, any proper learning algorithm for finding the best fitting ReLU must run in time exponential in (1/epsilon)^2. Together with a previous work regarding improperly learning a ReLU (Goel et al., COLT'17), this implies the first separation between proper and improper algorithms for learning a ReLU. We also study the problem of properly learning a depth-2 network of ReLUs with bounded weights giving new (worst-case) upper bounds on the running time needed to learn such networks both in the realizable and agnostic settings. Our upper bounds on the running time essentially matches our lower bounds in terms of the dependency on epsilon.
more »
« less
Low-Depth Arithmetic Circuit Lower Bounds: Bypassing Set-Multilinearization
A recent breakthrough work of Limaye, Srinivasan and Tavenas [Nutan Limaye et al., 2021] proved superpolynomial lower bounds for low-depth arithmetic circuits via a "hardness escalation" approach: they proved lower bounds for low-depth set-multilinear circuits and then lifted the bounds to low-depth general circuits. In this work, we prove superpolynomial lower bounds for low-depth circuits by bypassing the hardness escalation, i.e., the set-multilinearization, step. As set-multilinearization comes with an exponential blow-up in circuit size, our direct proof opens up the possibility of proving an exponential lower bound for low-depth homogeneous circuits by evading a crucial bottleneck. Our bounds hold for the iterated matrix multiplication and the Nisan-Wigderson design polynomials. We also define a subclass of unrestricted depth homogeneous formulas which we call unique parse tree (UPT) formulas, and prove superpolynomial lower bounds for these. This significantly generalizes the superpolynomial lower bounds for regular formulas [Neeraj Kayal et al., 2014; Hervé Fournier et al., 2015].
more »
« less
- Award ID(s):
- 2152413
- NSF-PAR ID:
- 10484523
- Editor(s):
- Etessami, Kousha; Feige, Uriel; Puppis, Gabriele
- Publisher / Repository:
- Schloss Dagstuhl – Leibniz-Zentrum für Informatik
- Date Published:
- Journal Name:
- 50th International Colloquium on Automata, Languages, and Programming
- Subject(s) / Keyword(s):
- arithmetic circuits low-depth circuits lower bounds shifted partials Theory of computation → Algebraic complexity theory
- Format(s):
- Medium: X
- Location:
- Paderborn, Germany
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We give superpolynomial statistical query (SQ) lower bounds for learning two-hidden-layer ReLU networks with respect to Gaussian inputs in the standard (noise-free) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) or restricted models such as correlational SQ. Prior work hinted at the impossibility of our result: Vempala and Wilmes showed that general SQ lower bounds cannot apply to any real-valued family of functions that satisfies a simple non-degeneracy condition. To circumvent their result, we refine a lifting procedure due to Daniely and Vardi that reduces Boolean PAC learning problems to Gaussian ones. We show how to extend their technique to other learning models and, in many well-studied cases, obtain a more efficient reduction. As such, we also prove new cryptographic hardness results for PAC learning two-hidden-layer ReLU networks, as well as new lower bounds for learning constant-depth ReLU networks from label queries.more » « less
-
The classical Beardwood-Halton-Hammersly theorem (1959) asserts the existence of an asymptotic formula of the form $\beta \sqrt n$ for the minimum length of a Traveling Salesperson Tour throuh $n$ random points in the unit square, and in the decades since it was proved, the existence of such formulas has been shown for other such \emph{Euclidean functionals} on random points in the unit square as well. Despite more than 50 years of attention, however, it remained unknown whether the minimum length TSP through $n$ random points in $[0,1]^2$ was asymptotically distinct from its natural lower bounds, such as the minimum length spanning tree, the minimum length 2-factor, or, as raised by Goemans and Bertsimas, from its linear programming relaxation. We prove that the TSP on random points in Euclidean space is indeed asymptotically distinct from these and other natural lower bounds, and show that this separation implies that branch-and-bound algorithms based on these natural lower bounds must take nearly exponential ($e^{\tilde \Omega(n)}$) time to solve the TSP to optimality, even in average case. This is the first average-case superpolynomial lower bound for these branch-and-bound algorithms (a lower bound as strong as $e^{\tilde \Omega (n)}$ was not even been known in worst-case analysis).more » « less
-
We study the properties of output distributions of noisy random circuits. We obtain upper and lower bounds on the expected distance of the output distribution from the “useless” uniform distribution. These bounds are tight with respect to the dependence on circuit depth. Our proof techniques also allow us to make statements about the presence or absence of anticoncentration for both noisy and noiseless circuits. We uncover a number of interesting consequences for hardness proofs of sampling schemes that aim to show a quantum computational advantage over classical computation. Specifically, we discuss recent barrier results for depth-agnostic and/or noise-agnostic proof techniques. We show that in certain depth regimes, noise-agnostic proof techniques might still work in order to prove an often-conjectured claim in the literature on quantum computational advantage, contrary to what has been thought prior to this work.more » « less
-
One powerful theme in complexity theory and pseudorandomness in the past few decades has been the use of lower bounds to give pseudorandom generators (PRGs). However, the general results using this hardness vs. randomness paradigm suffer from a quantitative loss in parameters, and hence do not give nontrivial implications for models where we don’t know super-polynomial lower bounds but do know lower bounds of a fixed polynomial. We show that when such lower bounds are proved using random restrictions, we can construct PRGs which are essentially best possible without in turn improving the lower bounds. More specifically, say that a circuit family has shrinkage exponent Γ if a random restriction leaving a p fraction of variables unset shrinks the size of any circuit in the family by a factor of p Γ + o (1) . Our PRG uses a seed of length s 1/(Γ + 1) + o (1) to fool circuits in the family of size s . By using this generic construction, we get PRGs with polynomially small error for the following classes of circuits of size s and with the following seed lengths: (1) For de Morgan formulas, seed length s 1/3+ o (1) ; (2) For formulas over an arbitrary basis, seed length s 1/2+ o (1) ; (3) For read-once de Morgan formulas, seed length s .234... ; (4) For branching programs of size s , seed length s 1/2+ o (1) . The previous best PRGs known for these classes used seeds of length bigger than n /2 to output n bits, and worked only for size s = O ( n ) [8].more » « less