We give a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU). We make no assumptions on the structure of the network, and the algorithm succeeds with respect to {\em any} distribution on the unit ball in n dimensions (hidden weight vectors also have unit norm). This is the first assumption-free, provably efficient algorithm for learning neural networks with two nonlinear layers. Our algorithm-- Alphatron-- is a simple, iterative update rule that combines isotonic regression with kernel methods. It outputs a hypothesis that yields efficient oracle access to interpretable features. It also suggests a new approach to Boolean learning problems via real-valued conditional-mean functions, sidestepping traditional hardness results from computational learning theory. Along these lines, we subsume and improve many longstanding results for PAC learning Boolean functions to the more general, real-valued setting of {\em probabilistic concepts}, a model that (unlike PAC learning) requires non-i.i.d. noise-tolerance.
Eigenvalue Decay Implies Polynomial-Time Learnability of Neural Networks
We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs). We make no assumptions on the structure of the network or the labels. Given sufficiently-strong polynomial eigenvalue decay, we obtain {\em fully}-polynomial time algorithms in {\em all} the relevant parameters with respect to square-loss. Milder decay assumptions also lead to improved algorithms. This is the first purely distributional assumption that leads to polynomial-time algorithms for networks of ReLUs, even with one hidden layer. Further, unlike prior distributional assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has been observed in practice on common data sets.
- Award ID(s):
- 1717896
- Publication Date:
- NSF-PAR ID:
- 10080188
- Journal Name:
- Neural Information Processing - Letters and Reviews
- ISSN:
- 1738-2564
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.
-
The {\sc Acceptance Probability Estimation Problem} (APEP) is to additively approximate the acceptance probability of a Boolean circuit. This problem admits a probabilistic approximation scheme. A central question is whether we can design a {\em pseudodeterministic} approximation algorithm for this problem: a probabilistic polynomial-time algorithm that outputs a canonical approximation with high probability. Recently, it was shown that such an algorithm would imply that {\em every approximation algorithm can be made pseudodeterministic} (Dixon, Pavan, Vinodchandran; ITCS 2021). The main conceptual contribution of this work is to establish that the existence of a pseudodeterministic algorithm for APEP is fundamentally connected to the relationship between probabilistic promise classes and the corresponding standard complexity classes. In particular, we show the following equivalence: {\em every promise problem in PromiseBPP has a solution in BPP if and only if APEP has a pseudodeterministic algorithm}. Based on this intuition, we show that pseudodeterministic algorithms for APEP can shed light on a few central topics in complexity theory such as circuit lowerbounds, probabilistic hierarchy theorems, and multi-pseudodeterminism.
-
In the robust submodular partitioning problem, we aim to allocate a set of items into m blocks, so that the evaluation of the minimum block according to a submodular function is maximized. Robust submodular partitioning promotes the diversity of every block in the partition. It has many applications in machine learning, e.g., partitioning data for distributed training so that the gradients computed on every block are consistent. We study an extension of the robust submodular partition problem with additional constraints (e.g., cardinality, multiple matroids, and/or knapsack) on every block. For example, when partitioning data for distributed training, we can add a constraint that the number of samples of each class is the same in each partition block, ensuring data balance. We present two classes of algorithms, i.e., Min-Block Greedy based algorithms (with an ⌦(1/m) bound), and Round-Robin Greedy based algorithms (with a constant bound) and show that under various constraints, they still have good approximation guarantees. Interestingly, while normally the latter runs in only weakly polynomial time, we show that using the two together yields strongly polynomial running time while preserving the approximation guarantee. Lastly, we apply the algorithms on a real-world machine learning data partitioning problem showing good results.
-
Error-correcting codes that admit {\em local} decoding and correcting algorithms have been the focus of much recent research due to their numerous theoretical and practical applications. An important goal is to obtain the best possible tradeoffs between the number of queries the algorithm makes to its oracle (the {\em locality} of the task), and the amount of redundancy in the encoding (the {\em information rate}). In Hamming's classical adversarial channel model, the current tradeoffs are dramatic, allowing either small locality, but superpolynomial blocklength, or small blocklength, but high locality. However, in the computationally bounded, adversarial channel model, proposed by Lipton (STACS 1994), constructions of locally decodable codes suddenly exhibit small locality and small blocklength, but these constructions require strong trusted setup assumptions e.g., Ostrovsky, Pandey and Sahai (ICALP 2007) construct private locally decodable codes in the setting where the sender and receiver already share a symmetric key. We study variants of locally decodable and locally correctable codes in computationally bounded, adversarial channels, in a setting with no public-key or private-key cryptographic setup. The only setup assumption we require is the selection of the {\em public} parameters (seed) for a collision-resistant hash function. Specifically, we provide constructions of {\em relaxed locallymore »