skip to main content


Title: Tight Hardness Results for Training Depth-2 ReLU Networks
We prove several hardness results for training depth-2 neural networks with the ReLU activation function; these networks are simply weighted sums (that may include negative coefficients) of ReLUs. Our goal is to output a depth-2 neural network that minimizes the square loss with respect to a given training set. We prove that this problem is NP-hard already for a network with a single ReLU. We also prove NP-hardness for outputting a weighted sum of k ReLUs minimizing the squared error (for k>1) even in the realizable setting (i.e., when the labels are consistent with an unknown depth-2 ReLU network). We are also able to obtain lower bounds on the running time in terms of the desired additive error ϵ. To obtain our lower bounds, we use the Gap Exponential Time Hypothesis (Gap-ETH) as well as a new hypothesis regarding the hardness of approximating the well known Densest k-Subgraph problem in subexponential time (these hypotheses are used separately in proving different lower bounds). For example, we prove that under reasonable hardness assumptions, any proper learning algorithm for finding the best fitting ReLU must run in time exponential in (1/epsilon)^2. Together with a previous work regarding improperly learning a ReLU (Goel et al., COLT'17), this implies the first separation between proper and improper algorithms for learning a ReLU. We also study the problem of properly learning a depth-2 network of ReLUs with bounded weights giving new (worst-case) upper bounds on the running time needed to learn such networks both in the realizable and agnostic settings. Our upper bounds on the running time essentially matches our lower bounds in terms of the dependency on epsilon.  more » « less
Award ID(s):
1909204
NSF-PAR ID:
10478033
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ITCS 2021
Date Published:
Journal Name:
ITCS 2021
Format(s):
Medium: X
Location:
Online
Sponsoring Org:
National Science Foundation
More Like this
  1. We study the identity testing problem in the context of spin systems or undirected graphical models, where it takes the following form: given the parameter specification of the model M and a sampling oracle for the distribution \mu_{M^*} of an unknown model M^*, can we efficiently determine if the two models M and M^* are the same? We consider identity testing for both soft-constraint and hard-constraint systems. In particular, we prove hardness results in two prototypical cases, the Ising model and proper colorings, and explore whether identity testing is any easier than structure learning. For the ferromagnetic (attractive) Ising model, Daskalasis et al. (2018) presented a polynomial time algorithm for identity testing. We prove hardness results in the antiferromagnetic (repulsive) setting in the same regime of parameters where structure learning is known to require a super-polynomial number of samples. In particular, for n-vertex graphs of maximum degree d, we prove that if |\beta| d = \omega(\log n) (where \beta is the inverse temperature parameter), then there is no identity testing algorithm for the antiferromagnetic Ising model that runs in polynomial time unless RP = NP. We also establish computational lower bounds for a broader set of parameters under the (randomized) exponential time hypothesis. In our proofs, we use random graphs as gadgets; this is inspired by similar constructions in seminal works on the hardness of approximate counting. In the hard-constraint setting, we present hardness results for identity testing for proper colorings. Our results are based on the presumed hardness of #BIS, the problem of (approximately) counting independent sets in bipartite graphs. In particular, we prove that identity testing for colorings is hard in the same range of parameters where structure learning is known to be hard, which in turn matches the parameter regime for NP-hardness of the corresponding decision problem. 
    more » « less
  2. Etessami, Kousha ; Feige, Uriel ; Puppis, Gabriele (Ed.)
    We study the time complexity of the discrete k-center problem and related (exact) geometric set cover problems when k or the size of the cover is small. We obtain a plethora of new results: - We give the first subquadratic algorithm for rectilinear discrete 3-center in 2D, running in Õ(n^{3/2}) time. - We prove a lower bound of Ω(n^{4/3-δ}) for rectilinear discrete 3-center in 4D, for any constant δ > 0, under a standard hypothesis about triangle detection in sparse graphs. - Given n points and n weighted axis-aligned unit squares in 2D, we give the first subquadratic algorithm for finding a minimum-weight cover of the points by 3 unit squares, running in Õ(n^{8/5}) time. We also prove a lower bound of Ω(n^{3/2-δ}) for the same problem in 2D, under the well-known APSP Hypothesis. For arbitrary axis-aligned rectangles in 2D, our upper bound is Õ(n^{7/4}). - We prove a lower bound of Ω(n^{2-δ}) for Euclidean discrete 2-center in 13D, under the Hyperclique Hypothesis. This lower bound nearly matches the straightforward upper bound of Õ(n^ω), if the matrix multiplication exponent ω is equal to 2. - We similarly prove an Ω(n^{k-δ}) lower bound for Euclidean discrete k-center in O(k) dimensions for any constant k ≥ 3, under the Hyperclique Hypothesis. This lower bound again nearly matches known upper bounds if ω = 2. - We also prove an Ω(n^{2-δ}) lower bound for the problem of finding 2 boxes to cover the largest number of points, given n points and n boxes in 12D . This matches the straightforward near-quadratic upper bound. 
    more » « less
  3. We give superpolynomial statistical query (SQ) lower bounds for learning two-hidden-layer ReLU networks with respect to Gaussian inputs in the standard (noise-free) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) or restricted models such as correlational SQ. Prior work hinted at the impossibility of our result: Vempala and Wilmes showed that general SQ lower bounds cannot apply to any real-valued family of functions that satisfies a simple non-degeneracy condition. To circumvent their result, we refine a lifting procedure due to Daniely and Vardi that reduces Boolean PAC learning problems to Gaussian ones. We show how to extend their technique to other learning models and, in many well-studied cases, obtain a more efficient reduction. As such, we also prove new cryptographic hardness results for PAC learning two-hidden-layer ReLU networks, as well as new lower bounds for learning constant-depth ReLU networks from label queries. 
    more » « less
  4. The Exponential-Time Hypothesis ( \(\mathtt {ETH} \) ) is a strengthening of the \(\mathcal {P} \ne \mathcal {NP} \) conjecture, stating that \(3\text{-}\mathtt {SAT} \) on n variables cannot be solved in (uniform) time 2 ϵ · n , for some ϵ > 0. In recent years, analogous hypotheses that are “exponentially-strong” forms of other classical complexity conjectures (such as \(\mathcal {NP}\nsubseteq \mathcal {BPP} \) or \(co\mathcal {NP}\nsubseteq \mathcal {NP} \) ) have also been introduced, and have become widely influential. In this work, we focus on the interaction of exponential-time hypotheses with the fundamental and closely-related questions of derandomization and circuit lower bounds . We show that even relatively-mild variants of exponential-time hypotheses have far-reaching implications to derandomization, circuit lower bounds, and the connections between the two. Specifically, we prove that: (1) The Randomized Exponential-Time Hypothesis ( \(\mathsf {rETH} \) ) implies that \(\mathcal {BPP} \) can be simulated on “average-case” in deterministic (nearly-)polynomial-time (i.e., in time \(2^{\tilde{O}(\log (n))}=n^{\mathrm{loglog}(n)^{O(1)}} \) ). The derandomization relies on a conditional construction of a pseudorandom generator with near-exponential stretch (i.e., with seed length \(\tilde{O}(\log (n)) \) ); this significantly improves the state-of-the-art in uniform “hardness-to-randomness” results, which previously only yielded pseudorandom generators with sub-exponential stretch from such hypotheses. (2) The Non-Deterministic Exponential-Time Hypothesis ( \(\mathsf {NETH} \) ) implies that derandomization of \(\mathcal {BPP} \) is completely equivalent to circuit lower bounds against \(\mathcal {E} \) , and in particular that pseudorandom generators are necessary for derandomization. In fact, we show that the foregoing equivalence follows from a very weak version of \(\mathsf {NETH} \) , and we also show that this very weak version is necessary to prove a slightly stronger conclusion that we deduce from it. Lastly, we show that disproving certain exponential-time hypotheses requires proving breakthrough circuit lower bounds. In particular, if \(\mathtt {CircuitSAT} \) for circuits over n bits of size poly( n ) can be solved by probabilistic algorithms in time 2 n /polylog( n ) , then \(\mathcal {BPE} \) does not have circuits of quasilinear size. 
    more » « less
  5. We study the problem of inverting a deep generative model with ReLU activations. Inversion corresponds to finding a latent code vector that explains observed measurements as much as possible. In most prior works this is performed by attempting to solve a non-convex optimization problem involving the generator. In this paper we obtain several novel theoretical results for the inversion problem. We show that for the realizable case, single layer inversion can be performed exactly in polynomial time, by solving a linear program. Further, we show that for multiple layers, inversion is NP-hard and the pre-image set can be non-convex. For generative models of arbitrary depth, we show that exact recovery is possible in polynomial time with high probability, if the layers are expanding and the weights are randomly selected. Very recent work analyzed the same problem for gradient descent inversion. Their analysis requires significantly higher expansion (logarithmic in the latent dimension) while our proposed algorithm can provably reconstruct even with constant factor expansion. We also provide provable error bounds for different norms for reconstructing noisy observations. Our empirical validation demonstrates that we obtain better reconstructions when the latent dimension is large. 
    more » « less