We give a polynomialtime algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU). We make no assumptions on the structure of the network, and the algorithm succeeds with respect to {\em any} distribution on the unit ball in n dimensions (hidden weight vectors also have unit norm). This is the first assumptionfree, provably efficient algorithm for learning neural networks with two nonlinear layers. Our algorithm Alphatron is a simple, iterative update rule that combines isotonic regression with kernel methods. It outputs a hypothesis that yields efficient oracle access to interpretable features. It also suggests a new approach to Boolean learning problems via realvalued conditionalmean functions, sidestepping traditional hardness results from computational learning theory. Along these lines, we subsume and improve many longstanding results for PAC learning Boolean functions to the more general, realvalued setting of {\em probabilistic concepts}, a model that (unlike PAC learning) requires noni.i.d. noisetolerance.
Eigenvalue Decay Implies PolynomialTime Learnability of Neural Networks
We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worstcase. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomialtime algorithms in the nonrealizable setting for expressive classes of networks (e.g. feedforward networks of ReLUs). We make no assumptions on the structure of the network or the labels. Given sufficientlystrong polynomial eigenvalue decay, we obtain {\em fully}polynomial time algorithms in {\em all} the relevant parameters with respect to squareloss. Milder decay assumptions also lead to improved algorithms. This is the first purely distributional assumption that leads to polynomialtime algorithms for networks of ReLUs, even with one hidden layer. Further, unlike prior distributional assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has been observed in practice on common data sets.
 Award ID(s):
 1717896
 Publication Date:
 NSFPAR ID:
 10080188
 Journal Name:
 Neural Information Processing  Letters and Reviews
 ISSN:
 17382564
 Sponsoring Org:
 National Science Foundation
More Like this


Sampleefficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellmancompleteness) and the data coverage (e.g., allpolicy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sampleefficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primaldual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a densityratio function against offline data. With proper regularization, the algorithm enjoys polynomial sample complexity, under only realizability and singlepolicy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primaldual algorithms for offline RL.

The {\sc Acceptance Probability Estimation Problem} (APEP) is to additively approximate the acceptance probability of a Boolean circuit. This problem admits a probabilistic approximation scheme. A central question is whether we can design a {\em pseudodeterministic} approximation algorithm for this problem: a probabilistic polynomialtime algorithm that outputs a canonical approximation with high probability. Recently, it was shown that such an algorithm would imply that {\em every approximation algorithm can be made pseudodeterministic} (Dixon, Pavan, Vinodchandran; ITCS 2021). The main conceptual contribution of this work is to establish that the existence of a pseudodeterministic algorithm for APEP is fundamentally connected to the relationship between probabilistic promise classes and the corresponding standard complexity classes. In particular, we show the following equivalence: {\em every promise problem in PromiseBPP has a solution in BPP if and only if APEP has a pseudodeterministic algorithm}. Based on this intuition, we show that pseudodeterministic algorithms for APEP can shed light on a few central topics in complexity theory such as circuit lowerbounds, probabilistic hierarchy theorems, and multipseudodeterminism.

In the robust submodular partitioning problem, we aim to allocate a set of items into m blocks, so that the evaluation of the minimum block according to a submodular function is maximized. Robust submodular partitioning promotes the diversity of every block in the partition. It has many applications in machine learning, e.g., partitioning data for distributed training so that the gradients computed on every block are consistent. We study an extension of the robust submodular partition problem with additional constraints (e.g., cardinality, multiple matroids, and/or knapsack) on every block. For example, when partitioning data for distributed training, we can add a constraint that the number of samples of each class is the same in each partition block, ensuring data balance. We present two classes of algorithms, i.e., MinBlock Greedy based algorithms (with an ⌦(1/m) bound), and RoundRobin Greedy based algorithms (with a constant bound) and show that under various constraints, they still have good approximation guarantees. Interestingly, while normally the latter runs in only weakly polynomial time, we show that using the two together yields strongly polynomial running time while preserving the approximation guarantee. Lastly, we apply the algorithms on a realworld machine learning data partitioning problem showing good results.

Errorcorrecting codes that admit {\em local} decoding and correcting algorithms have been the focus of much recent research due to their numerous theoretical and practical applications. An important goal is to obtain the best possible tradeoffs between the number of queries the algorithm makes to its oracle (the {\em locality} of the task), and the amount of redundancy in the encoding (the {\em information rate}). In Hamming's classical adversarial channel model, the current tradeoffs are dramatic, allowing either small locality, but superpolynomial blocklength, or small blocklength, but high locality. However, in the computationally bounded, adversarial channel model, proposed by Lipton (STACS 1994), constructions of locally decodable codes suddenly exhibit small locality and small blocklength, but these constructions require strong trusted setup assumptions e.g., Ostrovsky, Pandey and Sahai (ICALP 2007) construct private locally decodable codes in the setting where the sender and receiver already share a symmetric key. We study variants of locally decodable and locally correctable codes in computationally bounded, adversarial channels, in a setting with no publickey or privatekey cryptographic setup. The only setup assumption we require is the selection of the {\em public} parameters (seed) for a collisionresistant hash function. Specifically, we provide constructions of {\em relaxed locallymore »