Recent work of Klivans, Stavropoulos, and Vasilyan initiated the study of testable learning with distribution shift (TDS learning), where a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D′, and the goal is to output a classifier with low error on D′ whenever the training samples pass a corresponding test. Their model deviates from all prior work in that no assumptions are made on D′. Instead, the test must accept (with high probability) when the marginals of the training and test distributions are equal. Here we focus on the fundamental case of intersections of halfspaces with respect to Gaussian training distributions and prove a variety of new upper bounds including a 2(k/ϵ)O(1)poly(d)-time algorithm for TDS learning intersections of k homogeneous halfspaces to accuracy ϵ (prior work achieved d(k/ϵ)O(1)). We work under the mild assumption that the Gaussian training distribution contains at least an ϵ fraction of both positive and negative examples (ϵ-balanced). We also prove the first set of SQ lower-bounds for any TDS learning problem and show (1) the ϵ-balanced assumption is necessary for poly(d,1/ϵ)-time TDS learning for a single halfspace and (2) a dΩ~(log1/ϵ) lower bound for the intersection of two general halfspaces, even with the ϵ-balanced assumption. Our techniques significantly expand the toolkit for TDS learning. We use dimension reduction and coverings to give efficient algorithms for computing a localized version of discrepancy distance, a key metric from the domain adaptation literature.
more »
« less
Testable Learning with Distribution Shift
We revisit the fundamental problem of learning with distribution shift, in which a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D’ and is asked to output a classifier with low test error. The standard approach in this setting is to bound the loss of a classifier in terms of some notion of distance between D and D’. These distances, however, seem difficult to compute and do not lead to efficient algorithms. We depart from this paradigm and define a new model called testable learning with distribution shift, where we can obtain provably efficient algorithms for certifying the performance of a classifier on a test distribution. In this model, a learner outputs a classifier with low test error whenever samples from D and D’ pass an associated test; moreover, the test must accept (with high probability) if the marginal of D equals the marginal of D’. We give several positive results for learning well-studied concept classes such as halfspaces, intersections of halfspaces, and decision trees when the marginal of D is Gaussian or uniform on the hypercube. Prior to our work, no efficient algorithms for these basic cases were known without strong assumptions on D’. For halfspaces in the realizable case (where there exists a halfspace consistent with both D and D’), we combine a moment-matching approach with ideas from active learning to simulate an efficient oracle for estimating disagreement regions. To extend to the non-realizable setting, we apply recent work from testable (agnostic) learning. More generally, we prove that any function class with low-degree L2-sandwiching polynomial approximators can be learned in our model. Since we require L2- sandwiching (instead of the usual L1 loss), we cannot directly appeal to convex duality and instead apply constructions from the pseudorandomness literature to obtain the required approximators. We also provide lower bounds to show that the guarantees we obtain on the performance of our output hypotheses are best possible up to constant factors, as well as a separation showing that realizable learning in our model is incomparable to (ordinary) agnostic learning.
more »
« less
- Award ID(s):
- 2310818
- PAR ID:
- 10577199
- Publisher / Repository:
- The Thirty Seventh Annual Conference on Learning Theory (COLT 2024)
- Date Published:
- Page Range / eLocation ID:
- 2887-2943
- Format(s):
- Medium: X
- Location:
- Edmonton, Canada
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We give the first efficient algorithm for learning halfspaces in the testable learning model recently defined by Rubinfeld and Vasilyan [2022]. In this model, a learner certifies that the accuracy of its output hypothesis is near optimal whenever the training set passes an associated test, and training sets drawn from some target distribution must pass the test. This model is more challenging than distribution-specific agnostic or Massart noise models where the learner is allowed to fail arbitrarily if the distributional assumption does not hold. We consider the setting where the target distribution is the standard Gaussian in dimensions and the label noise is either Massart or adversarial (agnostic). For Massart noise, our tester-learner runs in polynomial time and outputs a hypothesis with (information-theoretically optimal) error (and extends to any fixed strongly log-concave target distribution). For adversarial noise, our tester-learner obtains error in polynomial time. Prior work on testable learning ignores the labels in the training set and checks that the empirical moments of the covariates are close to the moments of the base distribution. Here we develop new tests of independent interest that make critical use of the labels and combine them with the moment-matching approach of Gollakota et al. [2022]. This enables us to implement a testable variant of the algorithm of Diakonikolas et al. [2020a, 2020b] for learning noisy halfspaces using nonconvex SGD.more » « less
-
We give the first efficient algorithm for learning halfspaces in the testable learning model recently defined by Rubinfeld and Vasilyan [2022]. In this model, a learner certifies that the accuracy of its output hypothesis is near optimal whenever the training set passes an associated test, and training sets drawn from some target distribution must pass the test. This model is more challenging than distribution-specific agnostic or Massart noise models where the learner is allowed to fail arbitrarily if the distributional assumption does not hold. We consider the setting where the target distribution is the standard Gaussian in dimensions and the label noise is either Massart or adversarial (agnostic). For Massart noise, our tester-learner runs in polynomial time and outputs a hypothesis with (information-theoretically optimal) error (and extends to any fixed strongly log-concave target distribution). For adversarial noise, our tester-learner obtains error in polynomial time. Prior work on testable learning ignores the labels in the training set and checks that the empirical moments of the covariates are close to the moments of the base distribution. Here we develop new tests of independent interest that make critical use of the labels and combine them with the moment-matching approach of Gollakota et al. [2022]. This enables us to implement a testable variant of the algorithm of Diakonikolas et al. [2020a, 2020b] for learning noisy halfspaces using nonconvex SGD.more » « less
-
We study the problem of learning under arbitrary distribution shift, where the learner is trained on a labeled set from one distribution but evaluated on a different, potentially adversarially generated test distribution. We focus on two frameworks: PQ learning [GKKM'20], allowing abstention on adversarially generated parts of the test distribution, and TDS learning [KSV'23], permitting abstention on the entire test distribution if distribution shift is detected. All prior known algorithms either rely on learning primitives that are computationally hard even for simple function classes, or end up abstaining entirely even in the presence of a tiny amount of distribution shift. We address both these challenges for natural function classes, including intersections of halfspaces and decision trees, and standard training distributions, including Gaussians. For PQ learning, we give efficient learning algorithms, while for TDS learning, our algorithms can tolerate moderate amounts of distribution shift. At the core of our approach is an improved analysis of spectral outlier-removal techniques from learning with nasty noise. Our analysis can (1) handle arbitrarily large fraction of outliers, which is crucial for handling arbitrary distribution shifts, and (2) obtain stronger bounds on polynomial moments of the distribution after outlier removal, yielding new insights into polynomial regression under distribution shifts. Lastly, our techniques lead to novel results for tolerant testable learning [RV'23], and learning with nasty noise.more » « less
-
There are many important high dimensional function classes that have fast agnostic learning algorithms when strong assumptions on the distribution of examples can be made, such as Gaussianity or uniformity over the domain. But how can one be sufficiently confident that the data indeed satisfies the distributional assumption, so that one can trust in the output quality of the agnostic learning algorithm? We propose a model by which to systematically study the design of tester-learner pairs (A,T), such that if the distribution on examples in the data passes the tester T then one can safely trust the output of the agnostic learner A on the data. To demonstrate the power of the model, we apply it to the classical problem of agnostically learning halfspaces under the standard Gaussian distribution and present a tester-learner pair with a combined run-time of nÕ(1/є4). This qualitatively matches that of the best known ordinary agnostic learning algorithms for this task. In contrast, finite sample Gaussian distribution testers do not exist for the L1 and EMD distance measures. Previously it was known that half-spaces are well-approximated with low-degree polynomials relative to the Gaussian distribution. A key step in our analysis is showing that this is the case even relative to distributions whose low-degree moments approximately match those of a Gaussian. We also go beyond spherically-symmetric distributions, and give a tester-learner pair for halfspaces under the uniform distribution on {0,1}n with combined run-time of nÕ(1/є4). This is achieved using polynomial approximation theory and critical index machinery of [Diakonikolas, Gopalan, Jaiswal, Servedio, and Viola 2009]. Can one design agnostic learning algorithms under distributional assumptions and count on future technical work to produce, as a matter of course, tester-learner pairs with similar run-time? Our answer is a resounding no, as we show there exist some well-studied settings for which 2Õ(√n) run-time agnostic learning algorithms are available, yet the combined run-times of tester-learner pairs must be as high as 2Ω(n). On that account, the design of tester-learner pairs is a research direction in its own right independent of standard agnostic learning. To be specific, our lower bounds apply to the problems of agnostically learning convex sets under the Gaussian distribution and for monotone Boolean functions under the uniform distribution over {0,1}n.more » « less