skip to main content


This content will become publicly available on November 6, 2024

Title: Agnostic Proper Learning of Monotone Functions: Beyond the Black-box Correction Barrier
We give the first agnostic, efficient, proper learning algorithm for monotone Boolean functions. Given 2O~(n√/ε) uniformly random examples of an unknown function f:{±1}n→{±1}, our algorithm outputs a hypothesis g:{±1}n→{±1} that is monotone and (opt +ε)-close to f, where opt is the distance from f to the closest monotone function. The running time of the algorithm (and consequently the size and evaluation time of the hypothesis) is also 2O~(n√/ε), nearly matching the lower bound of [13]. We also give an algorithm for estimating up to additive error ε the distance of an unknown function f to monotone using a run-time of 2O~(n√/ε). Previously, for both of these problems, sample-efficient algorithms were known, but these algorithms were not run-time efficient. Our work thus closes this gap in our knowledge between the run-time and sample complexity.This work builds upon the improper learning algorithm of [17] and the proper semiagnostic learning algorithm of [40], which obtains a non-monotone Boolean-valued hypothesis, then “corrects” it to monotone using query-efficient local computation algorithms on graphs. This black-box correction approach can achieve no error better than 2 opt +ε information-theoretically; we bypass this barrier bya)augmenting the improper learner with a convex optimization step, andb)learning and correcting a real-valued function before rounding its values to Boolean. Our real-valued correction algorithm solves the “poset sorting” problem of [40] for functions over general posets with non-Boolean labels.  more » « less
Award ID(s):
2310818
NSF-PAR ID:
10494236
Author(s) / Creator(s):
;
Publisher / Repository:
64th IEEE Symposium on Foundations of Computer Science, FOCS 2023
Date Published:
Journal Name:
64th IEEE Symposium on Foundations of Computer Science, FOCS 2023
Page Range / eLocation ID:
1149-1170
Format(s):
Medium: X
Location:
Santa Cruz, California
Sponsoring Org:
National Science Foundation
More Like this
  1. There are many important high dimensional function classes that have fast agnostic learning algorithms when strong assumptions on the distribution of examples can be made, such as Gaussianity or uniformity over the domain. But how can one be sufficiently confident that the data indeed satisfies the distributional assumption, so that one can trust in the output quality of the agnostic learning algorithm? We propose a model by which to systematically study the design of tester-learner pairs (A,T), such that if the distribution on examples in the data passes the tester T then one can safely trust the output of the agnostic learner A on the data. To demonstrate the power of the model, we apply it to the classical problem of agnostically learning halfspaces under the standard Gaussian distribution and present a tester-learner pair with a combined run-time of nÕ(1/є4). This qualitatively matches that of the best known ordinary agnostic learning algorithms for this task. In contrast, finite sample Gaussian distribution testers do not exist for the L1 and EMD distance measures. Previously it was known that half-spaces are well-approximated with low-degree polynomials relative to the Gaussian distribution. A key step in our analysis is showing that this is the case even relative to distributions whose low-degree moments approximately match those of a Gaussian. We also go beyond spherically-symmetric distributions, and give a tester-learner pair for halfspaces under the uniform distribution on {0,1}n with combined run-time of nÕ(1/є4). This is achieved using polynomial approximation theory and critical index machinery of [Diakonikolas, Gopalan, Jaiswal, Servedio, and Viola 2009]. Can one design agnostic learning algorithms under distributional assumptions and count on future technical work to produce, as a matter of course, tester-learner pairs with similar run-time? Our answer is a resounding no, as we show there exist some well-studied settings for which 2Õ(√n) run-time agnostic learning algorithms are available, yet the combined run-times of tester-learner pairs must be as high as 2Ω(n). On that account, the design of tester-learner pairs is a research direction in its own right independent of standard agnostic learning. To be specific, our lower bounds apply to the problems of agnostically learning convex sets under the Gaussian distribution and for monotone Boolean functions under the uniform distribution over {0,1}n. 
    more » « less
  2. We give a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU). We make no assumptions on the structure of the network, and the algorithm succeeds with respect to {\em any} distribution on the unit ball in n dimensions (hidden weight vectors also have unit norm). This is the first assumption-free, provably efficient algorithm for learning neural networks with two nonlinear layers. Our algorithm-- Alphatron-- is a simple, iterative update rule that combines isotonic regression with kernel methods. It outputs a hypothesis that yields efficient oracle access to interpretable features. It also suggests a new approach to Boolean learning problems via real-valued conditional-mean functions, sidestepping traditional hardness results from computational learning theory. Along these lines, we subsume and improve many longstanding results for PAC learning Boolean functions to the more general, real-valued setting of {\em probabilistic concepts}, a model that (unlike PAC learning) requires non-i.i.d. noise-tolerance. 
    more » « less
  3. We design a nonadaptive algorithm that, given a Boolean function f: {0, 1}^n → {0, 1} which is α-far from monotone, makes poly(n, 1/α) queries and returns an estimate that, with high probability, is an O-tilde(\sqrt{n})-approximation to the distance of f to monotonicity. Furthermore, we show that for any constant k > 0, approximating the distance to monotonicity up to n^(1/2−k)-factor requires 2^{n^k} nonadaptive queries, thereby ruling out a poly(n, 1/α)-query nonadaptive algorithm for such approximations. This answers a question of Seshadhri (Property Testing Review, 2014) for the case of nonadaptive algorithms. Approximating the distance to a property is closely related to tolerantly testing that property. Our lower bound stands in contrast to standard (non-tolerant) testing of monotonicity that can be done nonadaptively with O-tilde(n/ε^2) queries. We obtain our lower bound by proving an analogous bound for erasure-resilient testers. An α-erasure-resilient tester for a desired property gets oracle access to a function that has at most an α fraction of values erased. The tester has to accept (with probability at least 2/3) if the erasures can be filled in to ensure that the resulting function has the property and to reject (with probability at least 2/3) if every completion of erasures results in a function that is ε-far from having the property. Our method yields the same lower bounds for unateness and being a k-junta. These lower bounds improve exponentially on the existing lower bounds for these properties. 
    more » « less
  4. We give the first reconstruction algorithm for decision trees: given queries to a function f that is opt-close to a size-s decision tree, our algorithm provides query access to a decision tree T where: - T has size S := s^O((log s)²/ε³); - dist(f,T) ≤ O(opt)+ε; - Every query to T is answered with poly((log s)/ε)⋅ log n queries to f and in poly((log s)/ε)⋅ n log n time. This yields a tolerant tester that distinguishes functions that are close to size-s decision trees from those that are far from size-S decision trees. The polylogarithmic dependence on s in the efficiency of our tester is exponentially smaller than that of existing testers. Since decision tree complexity is well known to be related to numerous other boolean function properties, our results also provide a new algorithm for reconstructing and testing these properties. 
    more » « less
  5. We prove several hardness results for training depth-2 neural networks with the ReLU activation function; these networks are simply weighted sums (that may include negative coefficients) of ReLUs. Our goal is to output a depth-2 neural network that minimizes the square loss with respect to a given training set. We prove that this problem is NP-hard already for a network with a single ReLU. We also prove NP-hardness for outputting a weighted sum of k ReLUs minimizing the squared error (for k>1) even in the realizable setting (i.e., when the labels are consistent with an unknown depth-2 ReLU network). We are also able to obtain lower bounds on the running time in terms of the desired additive error ϵ. To obtain our lower bounds, we use the Gap Exponential Time Hypothesis (Gap-ETH) as well as a new hypothesis regarding the hardness of approximating the well known Densest k-Subgraph problem in subexponential time (these hypotheses are used separately in proving different lower bounds). For example, we prove that under reasonable hardness assumptions, any proper learning algorithm for finding the best fitting ReLU must run in time exponential in (1/epsilon)^2. Together with a previous work regarding improperly learning a ReLU (Goel et al., COLT'17), this implies the first separation between proper and improper algorithms for learning a ReLU. We also study the problem of properly learning a depth-2 network of ReLUs with bounded weights giving new (worst-case) upper bounds on the running time needed to learn such networks both in the realizable and agnostic settings. Our upper bounds on the running time essentially matches our lower bounds in terms of the dependency on epsilon. 
    more » « less