skip to main content


This content will become publicly available on June 20, 2024

Title: Testing Distributional Assumptions of Learning Algorithms
There are many important high dimensional function classes that have fast agnostic learning algorithms when strong assumptions on the distribution of examples can be made, such as Gaussianity or uniformity over the domain. But how can one be sufficiently confident that the data indeed satisfies the distributional assumption, so that one can trust in the output quality of the agnostic learning algorithm? We propose a model by which to systematically study the design of tester-learner pairs (A,T), such that if the distribution on examples in the data passes the tester T then one can safely trust the output of the agnostic learner A on the data. To demonstrate the power of the model, we apply it to the classical problem of agnostically learning halfspaces under the standard Gaussian distribution and present a tester-learner pair with a combined run-time of nÕ(1/є4). This qualitatively matches that of the best known ordinary agnostic learning algorithms for this task. In contrast, finite sample Gaussian distribution testers do not exist for the L1 and EMD distance measures. Previously it was known that half-spaces are well-approximated with low-degree polynomials relative to the Gaussian distribution. A key step in our analysis is showing that this is the case even relative to distributions whose low-degree moments approximately match those of a Gaussian. We also go beyond spherically-symmetric distributions, and give a tester-learner pair for halfspaces under the uniform distribution on {0,1}n with combined run-time of nÕ(1/є4). This is achieved using polynomial approximation theory and critical index machinery of [Diakonikolas, Gopalan, Jaiswal, Servedio, and Viola 2009]. Can one design agnostic learning algorithms under distributional assumptions and count on future technical work to produce, as a matter of course, tester-learner pairs with similar run-time? Our answer is a resounding no, as we show there exist some well-studied settings for which 2Õ(√n) run-time agnostic learning algorithms are available, yet the combined run-times of tester-learner pairs must be as high as 2Ω(n). On that account, the design of tester-learner pairs is a research direction in its own right independent of standard agnostic learning. To be specific, our lower bounds apply to the problems of agnostically learning convex sets under the Gaussian distribution and for monotone Boolean functions under the uniform distribution over {0,1}n.  more » « less
Award ID(s):
2310818
NSF-PAR ID:
10494077
Author(s) / Creator(s):
;
Publisher / Repository:
55th Annual ACM Symposium on Theory of Computing, STOC 2023
Date Published:
Journal Name:
55th Annual ACM Symposium on Theory of Computing, STOC 2023
Page Range / eLocation ID:
1643-1656
Format(s):
Medium: X
Location:
Orlando, Florida
Sponsoring Org:
National Science Foundation
More Like this
  1. We show how any PAC learning algorithm that works under the uniform distribution can be transformed, in a blackbox fashion, into one that works under an arbitrary and unknown distribution ‍D. The efficiency of our transformation scales with the inherent complexity of ‍D, running in (n, (md)d) time for distributions over n whose pmfs are computed by depth-d decision trees, where m is the sample complexity of the original algorithm. For monotone distributions our transformation uses only samples from ‍D, and for general ones it uses subcube conditioning samples. A key technical ingredient is an algorithm which, given the aforementioned access to D, produces an optimal decision tree decomposition of D: an approximation of D as a mixture of uniform distributions over disjoint subcubes. With this decomposition in hand, we run the uniform-distribution learner on each subcube and combine the hypotheses using the decision tree. This algorithmic decomposition lemma also yields new algorithms for learning decision tree distributions with runtimes that exponentially improve on the prior state of the art—results of independent interest in distribution learning. 
    more » « less
  2. We give the first tester-learner for halfspaces that succeeds universally over a wide class of structured distributions. Our universal tester-learner runs in fully polynomial time and has the following guarantee: the learner achieves error O(opt)+ϵ on any labeled distribution that the tester accepts, and moreover, the tester accepts whenever the marginal is any distribution that satisfies a Poincare inequality. In contrast to prior work on testable learning, our tester is not tailored to any single target distribution but rather succeeds for an entire target class of distributions. The class of Poincare distributions includes all strongly log-concave distributions, and, assuming the Kannan--Lovasz--Simonovits (KLS) conjecture, includes all log-concave distributions. In the special case where the label noise is known to be Massart, our tester-learner achieves error opt+ϵ while accepting all log-concave distributions unconditionally (without assuming KLS).Our tests rely on checking hypercontractivity of the unknown distribution using a sum-of-squares (SOS) program, and crucially make use of the fact that Poincare distributions are certifiably hypercontractive in the SOS framework. 
    more » « less
  3. Abstract We show that a very simple pseudorandom generator fools intersections of k linear threshold functions (LTFs) and arbitrary functions of k LTFs over n-dimensional Gaussian space. The two analyses of our PRG (for intersections versus arbitrary functions of LTFs) are quite different from each other and from previous analyses of PRGs for functions of halfspaces. Our analysis for arbitrary functions of LTFs establishes bounds on the Wasserstein distance between Gaussian random vectors with similar covariance matrices, and combines these bounds with a conversion from Wasserstein distance to "union-of-orthants" distance from [Xi Chen et al., 2014]. Our analysis for intersections of LTFs uses extensions of the classical Sudakov-Fernique type inequalities, which give bounds on the difference between the expectations of the maxima of two Gaussian random vectors with similar covariance matrices. For all values of k, our generator has seed length O(log n) + poly(k) for arbitrary functions of k LTFs and O(log n) + poly(log k) for intersections of k LTFs. The best previous result, due to [Gopalan et al., 2010], only gave such PRGs for arbitrary functions of k LTFs when k=O(log log n) and for intersections of k LTFs when k=O((log n)/(log log n)). Thus our PRG achieves an O(log n) seed length for values of k that are exponentially larger than previous work could achieve. By combining our PRG over Gaussian space with an invariance principle for arbitrary functions of LTFs and with a regularity lemma, we obtain a deterministic algorithm that approximately counts satisfying assignments of arbitrary functions of k general LTFs over {0,1}^n in time poly(n) * 2^{poly(k,1/epsilon)} for all values of k. This algorithm has a poly(n) runtime for k =(log n)^c for some absolute constant c>0, while the previous best poly(n)-time algorithms could only handle k = O(log log n). For intersections of LTFs, by combining these tools with a recent PRG due to [R. O'Donnell et al., 2018], we obtain a deterministic algorithm that can approximately count satisfying assignments of intersections of k general LTFs over {0,1}^n in time poly(n) * 2^{poly(log k, 1/epsilon)}. This algorithm has a poly(n) runtime for k =2^{(log n)^c} for some absolute constant c>0, while the previous best poly(n)-time algorithms for intersections of k LTFs, due to [Gopalan et al., 2010], could only handle k=O((log n)/(log log n)). 
    more » « less
  4. We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs). We make no assumptions on the structure of the network or the labels. Given sufficiently-strong polynomial eigenvalue decay, we obtain {\em fully}-polynomial time algorithms in {\em all} the relevant parameters with respect to square-loss. Milder decay assumptions also lead to improved algorithms. This is the first purely distributional assumption that leads to polynomial-time algorithms for networks of ReLUs, even with one hidden layer. Further, unlike prior distributional assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has been observed in practice on common data sets. 
    more » « less
  5. null (Ed.)
    Abstract Estimating the mean of a probability distribution using i.i.d. samples is a classical problem in statistics, wherein finite-sample optimal estimators are sought under various distributional assumptions. In this paper, we consider the problem of mean estimation when independent samples are drawn from $d$-dimensional non-identical distributions possessing a common mean. When the distributions are radially symmetric and unimodal, we propose a novel estimator, which is a hybrid of the modal interval, shorth and median estimators and whose performance adapts to the level of heterogeneity in the data. We show that our estimator is near optimal when data are i.i.d. and when the fraction of ‘low-noise’ distributions is as small as $\varOmega \left (\frac{d \log n}{n}\right )$, where $n$ is the number of samples. We also derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales of individual data points. Finally, we extend our theory to linear regression. In both the mean estimation and regression settings, we present computationally feasible versions of our estimators that run in time polynomial in the number of data points. 
    more » « less