skip to main content


Title: A new similarity measure for covariate shift with applications to nonparametric regression
We study covariate shift in the context of nonparametric regression. We introduce a new measure of distribution mismatch between the source and target distributions using the integrated ratio of probabilities of balls at a given radius. We use the scaling of this measure with respect to the radius to characterize the minimax rate of estimation over a family of H{ö}lder continuous functions under covariate shift. In comparison to the recently proposed notion of transfer exponent, this measure leads to a sharper rate of convergence and is more fine-grained. We accompany our theory with concrete instances of covariate shift that illustrate this sharp difference.  more » « less
Award ID(s):
2015454 1955450
NSF-PAR ID:
10343723
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the International Conference on Machine Learning
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Predicting sets of outcomes—instead of unique outcomes—is a promising solution to uncertainty quantification in statistical learning. Despite a rich literature on constructing prediction sets with statistical guarantees, adapting to unknown covariate shift—a prevalent issue in practice—poses a serious unsolved challenge. In this article, we show that prediction sets with finite-sample coverage guarantee are uninformative and propose a novel flexible distribution-free method, PredSet-1Step, to efficiently construct prediction sets with an asymptotic coverage guarantee under unknown covariate shift. We formally show that our method is asymptotically probably approximately correct, having well-calibrated coverage error with high confidence for large samples. We illustrate that it achieves nominal coverage in a number of experiments and a data set concerning HIV risk prediction in a South African cohort study. Our theory hinges on a new bound for the convergence rate of the coverage of Wald confidence intervals based on general asymptotically linear estimators.

     
    more » « less
  2. In the problem of domain adaptation for binary classification, the learner is presented with labeled examples from a source domain, and must correctly classify unlabeled examples from a target domain, which may differ from the source. Previous work on this problem has assumed that the performance measure of interest is the expected value of some loss function. We study a Neyman-Pearson-like criterion and argue that, for this optimality criterion, stronger domain adaptation results are possible than what has previously been established. In particular, we study a class of domain adaptation problems that generalizes both the covariate shift assumption and a model for feature-dependent label noise, and establish optimal classification on the target domain despite not having access to labelled data from this domain. 
    more » « less
  3. Covariate shift is a major roadblock in the reliability of image classifiers in the real world. Work on covariate shift has been focused on training classifiers to adapt or generalize to unseen domains. However, for transparent decision making, it is equally desirable to develop covariate shift detection methods that can indicate whether or not a test image belongs to an unseen domain. In this paper, we introduce a benchmark for covariate shift detection (CSD), that builds upon and complements previous work on domain generalization. We use state-of-the-art OOD detection methods as baselines and find them to be worse than simple confidence-based methods on our CSD benchmark. We propose an interpolation-based technique, Domain Interpolation Sensitivity (DIS), based on the simple hypothesis that interpolation between the test input and randomly sampled inputs from the training domain, offers sufficient information to distinguish between the training domain and unseen domains under covariate shift. DIS surpasses all OOD detection baselines for CSD on multiple domain generalization benchmarks. 
    more » « less
  4. Covariate shift is a prevalent setting for supervised learning in the wild when the training and test data are drawn from different time periods, different but related domains, or via different sampling strategies. This paper addresses a transfer learning setting, with covariate shift between source and target domains. Most existing methods for correcting covariate shift exploit density ratios of the features to reweight the source-domain data, and when the features are high-dimensional, the estimated density ratios may suffer large estimation variances, leading to poor performance of prediction under covariate shift. In this work, we investigate the dependence of covariate shift correction performance on the dimensionality of the features, and propose a correction method that finds a low-dimensional representation of the features, which takes into account feature relevant to the target Y, and exploits the density ratio of this representation for importance reweighting. We discuss the factors that affect the performance of our method, and demonstrate its capabilities on both pseudo-real data and real-world applications. 
    more » « less
  5. Abstract We measure the projected number density profiles of galaxies and the splashback feature in clusters selected by the Sunyaev–Zel’dovich effect from the Advanced Atacama Cosmology Telescope (AdvACT) survey using galaxies observed by the Dark Energy Survey (DES). The splashback radius is consistent with CDM-only simulations and is located at 2.4 − 0.4 + 0.3 Mpc h − 1 . We split the galaxies on color and find significant differences in their profile shapes. Red and green-valley galaxies show a splashback-like minimum in their slope profile consistent with theory, while the bluest galaxies show a weak feature at a smaller radius. We develop a mapping of galaxies to subhalos in simulations and assign colors based on infall time onto their hosts. We find that the shift in location of the steepest slope and different profile shapes can be mapped to the average time of infall of galaxies of different colors. The steepest slope traces a discontinuity in the phase space of dark matter halos. By relating spatial profiles to infall time, we can use splashback as a clock to understand galaxy quenching. We find that red galaxies have on average been in clusters over 3.2 Gyr, green galaxies about 2.2 Gyr, while blue galaxies have been accreted most recently and have not reached apocenter. Using the full radial profiles, we fit a simple quenching model and find that the onset of galaxy quenching occurs after a delay of about a gigayear and that galaxies quench rapidly thereafter with an exponential timescale of 0.6 Gyr. 
    more » « less