Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

We consider the wellstudied problem of learning a linear combination of k ReLU activations with respect to a Gaussian distribution on inputs in d dimensions. We give the first polynomialtime algorithm that succeeds whenever k is a constant. All prior polynomialtime learners require additional assumptions on the network, such as positive combining coefficients or the matrix of hidden weight vectors being wellconditioned. Our approach is based on analyzing random contractions of higherorder moment tensors. We use a multiscale analysis to argue that sufficiently close neurons can be collapsed together, sidestepping the conditioning issues present in prior work. This allows us to design an iterative procedure to discover individual neurons.more » « lessFree, publiclyaccessible full text available July 12, 2024

We give superpolynomial statistical query (SQ) lower bounds for learning twohiddenlayer ReLU networks with respect to Gaussian inputs in the standard (noisefree) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) or restricted models such as correlational SQ. Prior work hinted at the impossibility of our result: Vempala and Wilmes showed that general SQ lower bounds cannot apply to any realvalued family of functions that satisfies a simple nondegeneracy condition. To circumvent their result, we refine a lifting procedure due to Daniely and Vardi that reduces Boolean PAC learning problems to Gaussian ones. We show how to extend their technique to other learning models and, in many wellstudied cases, obtain a more efficient reduction. As such, we also prove new cryptographic hardness results for PAC learning twohiddenlayer ReLU networks, as well as new lower bounds for learning constantdepth ReLU networks from label queries.more » « less

We prove several hardness results for training depth2 neural networks with the ReLU activation function; these networks are simply weighted sums (that may include negative coefficients) of ReLUs. Our goal is to output a depth2 neural network that minimizes the square loss with respect to a given training set. We prove that this problem is NPhard already for a network with a single ReLU. We also prove NPhardness for outputting a weighted sum of k ReLUs minimizing the squared error (for k>1) even in the realizable setting (i.e., when the labels are consistent with an unknown depth2 ReLU network). We are also able to obtain lower bounds on the running time in terms of the desired additive error ϵ. To obtain our lower bounds, we use the Gap Exponential Time Hypothesis (GapETH) as well as a new hypothesis regarding the hardness of approximating the well known Densest kSubgraph problem in subexponential time (these hypotheses are used separately in proving different lower bounds). For example, we prove that under reasonable hardness assumptions, any proper learning algorithm for finding the best fitting ReLU must run in time exponential in (1/epsilon)^2. Together with a previous work regarding improperly learning a ReLU (Goel et al., COLT'17), this implies the first separation between proper and improper algorithms for learning a ReLU. We also study the problem of properly learning a depth2 network of ReLUs with bounded weights giving new (worstcase) upper bounds on the running time needed to learn such networks both in the realizable and agnostic settings. Our upper bounds on the running time essentially matches our lower bounds in terms of the dependency on epsilon.more » « less

null (Ed.)Graphical models are powerful tools for modeling highdimensional data, but learning graphical models in the presence of latent variables is wellknown to be difficult. In this work we give new results for learning Restricted Boltzmann Machines, probably the most wellstudied class of latent variable models. Our results are based on new connections to learning twolayer neural networks under ℓ∞ bounded input; for both problems, we give nearly optimal results under the conjectured hardness of sparse parity with noise. Using the connection between RBMs and feedforward networks, we also initiate the theoretical study of supervised RBMs [Hinton, 2012], a version of neuralnetwork learning that couples distributional assumptions induced from the underlying graphical model with the architecture of the unknown function class. We then give an algorithm for learning a natural class of supervised RBMs with better runtime than what is possible for its related class of networks without distributional assumptions.more » « less

null (Ed.)We give the first statisticalquery lower bounds for agnostically learning any nonpolynomial activation with respect to Gaussian marginals (e.g., ReLU, sigmoid, sign). For the specific problem of ReLU regression (equivalently, agnostically learning a ReLU), we show that any statisticalquery algorithm with tolerance n−(1/ϵ)b must use at least 2ncϵ queries for some constant b,c>0, where n is the dimension and ϵ is the accuracy parameter. Our results rule out general (as opposed to correlational) SQ learning algorithms, which is unusual for realvalued learning problems. Our techniques involve a gradient boosting procedure for "amplifying" recent lower bounds due to Diakonikolas et al. (COLT 2020) and Goel et al. (ICML 2020) on the SQ dimension of functions computed by twolayer neural networks. The crucial new ingredient is the use of a nonstandard convex functional during the boosting procedure. This also yields a bestpossible reduction between two commonly studied models of learning: agnostic learning and probabilistic concepts.more » « less

null (Ed.)We give the first polynomialtime algorithm for robust regression in the listdecodable setting where an adversary can corrupt a greater than 1/2 fraction of examples. For any our algorithm takes as input a sample of n linear equations where (1 \alpha) n of the equations are {\em arbitrarily} chosen. It outputs a list that contains a linear function that is close to the truth. Our algorithm succeeds whenever the inliers are chosen from a \emph{certifiably} anticoncentrated distribution D. In particular, this gives an efficient algorithm to find an optimal size list when the inlier distribution is standard Gaussian. For discrete product distributions that are anticoncentrated only in \emph{regular} directions, we give an algorithm that achieves similar guarantee under the promise that the true linear function has all coordinates of the same magnitude. To complement our result, we prove that the anticoncentration assumption on the inliers is informationtheoretically necessary. Our algorithm is based on a new framework for listdecodable learning that strengthens the `identifiability to algorithms' paradigm based on the sumofsquares method.more » « less

null (Ed.)We give the first efficient algorithm for learning the structure of an Ising model that tolerates independent failures; that is, each entry of the observed sample is missing with some unknown probability 𝑝. Our algorithm matches the essentially optimal runtime and sample complexity bounds of recent work for learning Ising models due to Klivans and Meka (2017). We devise a novel unbiased estimator for the gradient of the Interaction Screening Objective (ISO) due to Vuffray et al. (2016) and apply a stochastic multiplicative gradient descent algorithm to minimize this objective. Solutions to this minimization recover the neighborhood information of the underlying Ising model on a node by node basis.more » « less