skip to main content


Title: Eigenvalue Decay Implies Polynomial-Time Learnability of Neural Networks
We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs). We make no assumptions on the structure of the network or the labels. Given sufficiently-strong polynomial eigenvalue decay, we obtain {\em fully}-polynomial time algorithms in {\em all} the relevant parameters with respect to square-loss. Milder decay assumptions also lead to improved algorithms. This is the first purely distributional assumption that leads to polynomial-time algorithms for networks of ReLUs, even with one hidden layer. Further, unlike prior distributional assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has been observed in practice on common data sets.  more » « less
Award ID(s):
1717896
NSF-PAR ID:
10080188
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Neural Information Processing - Letters and Reviews
ISSN:
1738-2564
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. There are many important high dimensional function classes that have fast agnostic learning algorithms when strong assumptions on the distribution of examples can be made, such as Gaussianity or uniformity over the domain. But how can one be sufficiently confident that the data indeed satisfies the distributional assumption, so that one can trust in the output quality of the agnostic learning algorithm? We propose a model by which to systematically study the design of tester-learner pairs (A,T), such that if the distribution on examples in the data passes the tester T then one can safely trust the output of the agnostic learner A on the data. To demonstrate the power of the model, we apply it to the classical problem of agnostically learning halfspaces under the standard Gaussian distribution and present a tester-learner pair with a combined run-time of nÕ(1/є4). This qualitatively matches that of the best known ordinary agnostic learning algorithms for this task. In contrast, finite sample Gaussian distribution testers do not exist for the L1 and EMD distance measures. Previously it was known that half-spaces are well-approximated with low-degree polynomials relative to the Gaussian distribution. A key step in our analysis is showing that this is the case even relative to distributions whose low-degree moments approximately match those of a Gaussian. We also go beyond spherically-symmetric distributions, and give a tester-learner pair for halfspaces under the uniform distribution on {0,1}n with combined run-time of nÕ(1/є4). This is achieved using polynomial approximation theory and critical index machinery of [Diakonikolas, Gopalan, Jaiswal, Servedio, and Viola 2009]. Can one design agnostic learning algorithms under distributional assumptions and count on future technical work to produce, as a matter of course, tester-learner pairs with similar run-time? Our answer is a resounding no, as we show there exist some well-studied settings for which 2Õ(√n) run-time agnostic learning algorithms are available, yet the combined run-times of tester-learner pairs must be as high as 2Ω(n). On that account, the design of tester-learner pairs is a research direction in its own right independent of standard agnostic learning. To be specific, our lower bounds apply to the problems of agnostically learning convex sets under the Gaussian distribution and for monotone Boolean functions under the uniform distribution over {0,1}n. 
    more » « less
  2. We give a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU). We make no assumptions on the structure of the network, and the algorithm succeeds with respect to {\em any} distribution on the unit ball in n dimensions (hidden weight vectors also have unit norm). This is the first assumption-free, provably efficient algorithm for learning neural networks with two nonlinear layers. Our algorithm-- Alphatron-- is a simple, iterative update rule that combines isotonic regression with kernel methods. It outputs a hypothesis that yields efficient oracle access to interpretable features. It also suggests a new approach to Boolean learning problems via real-valued conditional-mean functions, sidestepping traditional hardness results from computational learning theory. Along these lines, we subsume and improve many longstanding results for PAC learning Boolean functions to the more general, real-valued setting of {\em probabilistic concepts}, a model that (unlike PAC learning) requires non-i.i.d. noise-tolerance. 
    more » « less
  3. Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL. 
    more » « less
  4. null (Ed.)
    The {\sc Acceptance Probability Estimation Problem} (APEP) is to additively approximate the acceptance probability of a Boolean circuit. This problem admits a probabilistic approximation scheme. A central question is whether we can design a {\em pseudodeterministic} approximation algorithm for this problem: a probabilistic polynomial-time algorithm that outputs a canonical approximation with high probability. Recently, it was shown that such an algorithm would imply that {\em every approximation algorithm can be made pseudodeterministic} (Dixon, Pavan, Vinodchandran; ITCS 2021). The main conceptual contribution of this work is to establish that the existence of a pseudodeterministic algorithm for APEP is fundamentally connected to the relationship between probabilistic promise classes and the corresponding standard complexity classes. In particular, we show the following equivalence: {\em every promise problem in PromiseBPP has a solution in BPP if and only if APEP has a pseudodeterministic algorithm}. Based on this intuition, we show that pseudodeterministic algorithms for APEP can shed light on a few central topics in complexity theory such as circuit lowerbounds, probabilistic hierarchy theorems, and multi-pseudodeterminism. 
    more » « less
  5. Abstract

    Absence of (complex) zeros property is at the heart of the interpolation method developed by Barvinok for designing deterministic approximation algorithms for various graph counting and related problems. An earlier method used for the same problem is one based on the correlation decay property. Remarkably, the classes of graphs for which the two methods apply often coincide or nearly coincide. In this article we show that this is not a coincidence. We establish that if the interpolation method is valid for a family of graphs, then this family exhibits a form of the correlation decay property which is asymptotic strong spatial mixing at superlogarithmic distances. Our proof is based on a certain graph polynomial representation of the associated partition function. This representation is at the heart of the design of the polynomial time algorithms underlying the interpolation method itself. We conjecture that our result holds for all, and not just amenable graphs. Indeed this conjecture was recently confirmed by Regts. See the body of the article for details.

     
    more » « less