skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: High-dimensional semi-supervised learning: in search of optimal inference of the mean
Summary A fundamental challenge in semi-supervised learning lies in the observed data’s disproportional size when compared with the size of the data collected with missing outcomes. An implicit understanding is that the dataset with missing outcomes, being significantly larger, ought to improve estimation and inference. However, it is unclear to what extent this is correct. We illustrate one clear benefit: root-$$n$$ inference of the outcome’s mean is possible while only requiring a consistent estimation of the outcome, possibly at a rate slower than root $$n$$. This is achieved by a novel $$k$$-fold, cross-fitted, double robust estimator. We discuss both linear and nonlinear outcomes. Such an estimator is particularly suited for models that naturally do not admit root-$$n$$ consistency, such as high-dimensional, nonparametric or semiparametric models. We apply our methods to estimating heterogeneous treatment effects.  more » « less
Award ID(s):
1712481
PAR ID:
10345775
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Biometrika
Volume:
109
Issue:
2
ISSN:
0006-3444
Page Range / eLocation ID:
387 to 403
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We propose the use of U-statistics to reduce variance for gradient estimation in importance-weighted variational inference. The key observation is that, given a base gradient estimator that requires m > 1 samples and a total of n > m samples to be used for estimation, lower variance is achieved by averaging the base estimator on overlapping batches of size m than disjoint batches, as currently done. We use classical U-statistic theory to analyze the variance reduction, and propose novel approximations with theoretical guarantees to ensure computational efficiency. We find empirically that U-statistic variance reduction can lead to modest to significant improvements in inference performance on a range of models, with little computational cost. 
    more » « less
  2. null (Ed.)
    Summary The net reclassification improvement (NRI) and the integrated discrimination improvement (IDI) were originally proposed to characterize accuracy improvement in predicting a binary outcome, when new biomarkers are added to regression models. These two indices have been extended from binary outcomes to multi-categorical and survival outcomes. Working on an AIDS study where the onset of cognitive impairment is competing risk censored by death, we extend the NRI and the IDI to competing risk outcomes, by using cumulative incidence functions to quantify cumulative risks of competing events, and adopting the definitions of the two indices for multi-category outcomes. The “missing” category due to independent censoring is handled through inverse probability weighting. Various competing risk models are considered, such as the Fine and Gray, multistate, and multinomial logistic models. Estimation methods for the NRI and the IDI from competing risk data are presented. The inference for the NRI is constructed based on asymptotic normality of its estimator, and the bias-corrected and accelerated bootstrap procedure is used for the IDI. Simulations demonstrate that the proposed inferential procedures perform very well. The Multicenter AIDS Cohort Study is used to illustrate the practical utility of the extended NRI and IDI for competing risk outcomes. 
    more » « less
  3. Abstract Semi-supervised (SS) inference has received much attention in recent years. Apart from a moderate-sized labeled data, $$\mathcal L$$, the SS setting is characterized by an additional, much larger sized, unlabeled data, $$\mathcal U$$. The setting of $$|\mathcal U\ |\gg |\mathcal L\ |$$, makes SS inference unique and different from the standard missing data problems, owing to natural violation of the so-called ‘positivity’ or ‘overlap’ assumption. However, most of the SS literature implicitly assumes $$\mathcal L$$ and $$\mathcal U$$ to be equally distributed, i.e., no selection bias in the labeling. Inferential challenges in missing at random type labeling allowing for selection bias, are inevitably exacerbated by the decaying nature of the propensity score (PS). We address this gap for a prototype problem, the estimation of the response’s mean. We propose a double robust SS mean estimator and give a complete characterization of its asymptotic properties. The proposed estimator is consistent as long as either the outcome or the PS model is correctly specified. When both models are correctly specified, we provide inference results with a non-standard consistency rate that depends on the smaller size $$|\mathcal L\ |$$. The results are also extended to causal inference with imbalanced treatment groups. Further, we provide several novel choices of models and estimators of the decaying PS, including a novel offset logistic model and a stratified labeling model. We present their properties under both high- and low-dimensional settings. These may be of independent interest. Lastly, we present extensive simulations and also a real data application. 
    more » « less
  4. Summary For decades, $ N $-of-1 experiments, where a unit serves as its own control and treatment in different time windows, have been used in certain medical contexts. However, due to effects that accumulate over long time windows and interventions that have complex evolution, a lack of robust inference tools has limited the widespread applicability of such $ N $-of-1 designs. This work combines techniques from experimental design in causal inference and system identification from control theory to provide such an inference framework. We derive a model of the dynamic interference effect that arises in linear time-invariant dynamical systems. We show that a family of causal estimands analogous to those studied in potential outcomes are estimable via a standard estimator derived from the method of moments. We derive formulae for higher moments of this estimator and describe conditions under which $ N $-of-1 designs may provide faster ways to estimate the effects of interventions in dynamical systems. We also provide conditions under which our estimator is asymptotically normal and derive valid confidence intervals for this setting. 
    more » « less
  5. Abstract Calibration weighting has been widely used to correct selection biases in nonprobability sampling, missing data and causal inference. The main idea is to calibrate the biased sample to the benchmark by adjusting the subject weights. However, hard calibration can produce enormous weights when an exact calibration is enforced on a large set of extraneous covariates. This article proposes a soft calibration scheme, where the outcome and the selection indicator follow mixed-effect models. The scheme imposes an exact calibration on the fixed effects and an approximate calibration on the random effects. On the one hand, our soft calibration has an intrinsic connection with best linear unbiased prediction, which results in a more efficient estimation compared to hard calibration. On the other hand, soft calibration weighting estimation can be envisioned as penalized propensity score weight estimation, with the penalty term motivated by the mixed-effect structure. The asymptotic distribution and a valid variance estimator are derived for soft calibration. We demonstrate the superiority of the proposed estimator over other competitors in simulation studies and using a real-world data application on the effect of BMI screening on childhood obesity. 
    more » « less