skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on April 21, 2026

Title: The Decaying Missing-at-Random Framework: Model Doubly Robust Causal Inference with Partially Labeled Data
In modern large-scale observational studies, data collection constraints often result in partially labeled datasets, posing challenges for reliable causal inference, especially due to potential labeling bias and relatively small size of the labeled data. This paper introduces a decaying missing-at-random (decaying MAR) framework and associated approaches for doubly robust causal inference on treatment effects in such semi-supervised (SS) settings. This simultaneously addresses selection bias in the labeling mechanism and the extreme imbalance between labeled and unlabeled groups, bridging the gap between the standard SS and missing data literatures, while throughout allowing for confounded treatment assignment and high-dimensional confounders under appropriate sparsity conditions. To ensure robust causal conclusions, we propose a bias-reduced SS (BRSS) estimator for the average treatment effect, a type of 'model doubly robust' estimator appropriate for such settings, establishing asymptotic normality at the appropriate rate under decaying labeling propensity scores, provided that at least one nuisance model is correctly specified. Our approach also relaxes sparsity conditions beyond those required in existing methods, including standard supervised approaches. Recognizing the asymmetry between labeling and treatment mechanisms, we further introduce a de-coupled BRSS (DC-BRSS) estimator, which integrates inverse probability weighting (IPW) with bias-reducing techniques in nuisance estimation. This refinement further weakens model specification and sparsity requirements. Numerical experiments confirm the effectiveness and adaptability of our estimators in addressing labeling bias and model misspecification.  more » « less
Award ID(s):
2113768
PAR ID:
10650010
Author(s) / Creator(s):
; ;
Publisher / Repository:
arXiv.org
Date Published:
Journal Name:
arXiv.org
Edition / Version:
arXiv preprint arXiv:2305.12789v3
ISSN:
2331-8422
Page Range / eLocation ID:
1-72
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Semi-supervised (SS) inference has received much attention in recent years. Apart from a moderate-sized labeled data, $$\mathcal L$$, the SS setting is characterized by an additional, much larger sized, unlabeled data, $$\mathcal U$$. The setting of $$|\mathcal U\ |\gg |\mathcal L\ |$$, makes SS inference unique and different from the standard missing data problems, owing to natural violation of the so-called ‘positivity’ or ‘overlap’ assumption. However, most of the SS literature implicitly assumes $$\mathcal L$$ and $$\mathcal U$$ to be equally distributed, i.e., no selection bias in the labeling. Inferential challenges in missing at random type labeling allowing for selection bias, are inevitably exacerbated by the decaying nature of the propensity score (PS). We address this gap for a prototype problem, the estimation of the response’s mean. We propose a double robust SS mean estimator and give a complete characterization of its asymptotic properties. The proposed estimator is consistent as long as either the outcome or the PS model is correctly specified. When both models are correctly specified, we provide inference results with a non-standard consistency rate that depends on the smaller size $$|\mathcal L\ |$$. The results are also extended to causal inference with imbalanced treatment groups. Further, we provide several novel choices of models and estimators of the decaying PS, including a novel offset logistic model and a stratified labeling model. We present their properties under both high- and low-dimensional settings. These may be of independent interest. Lastly, we present extensive simulations and also a real data application. 
    more » « less
  2. In this article, we aim to provide a general and complete understanding of semi-supervised (SS) causal inference for treatment effects. Specifically, we consider two such estimands: (a) the average treatment effect and (b) the quantile treatment effect, as prototype cases, in an SS setting, characterized by two available data sets: (i) a labeled data set of size n, providing observations for a response and a set of high dimensional covariates, as well as a binary treatment indicator; and (ii) an unlabeled data set of size N, much larger than n, but without the response observed. Using these two data sets, we develop a family of SS estimators which are ensured to be: (1) more robust and (2) more efficient than their supervised counterparts based on the labeled data set only. Beyond the 'standard' double robustness results (in terms of consistency) that can be achieved by supervised methods as well, we further establish root-n consistency and asymptotic normality of our SS estimators whenever the propensity score in the model is correctly specified, without requiring specific forms of the nuisance functions involved. Such an improvement of robustness arises from the use of the massive unlabeled data, so it is generally not attainable in a purely supervised setting. In addition, our estimators are shown to be semi-parametrically efficient as long as all the nuisance functions are correctly specified. Moreover, as an illustration of the nuisance estimators, we consider inverse-probability-weighting type kernel smoothing estimators involving unknown covariate transformation mechanisms, and establish in high dimensional scenarios novel results on their uniform convergence rates, which should be of independent interest. Numerical results on both simulated and real data validate the advantage of our methods over their supervised counterparts with respect to both robustness and efficiency. 
    more » « less
  3. Inference in semi-supervised (SS) settings has gained substantial attention in recent years due to increased relevance in modern big-data problems. In a typical SS setting, there is a much larger-sized unlabeled data, containing only observations of predictors, and a moderately sized labeled data containing observations for both an outcome and the set of predictors. Such data naturally arises when the outcome, unlike the predictors, is costly or difficult to obtain. One of the primary statistical objectives in SS settings is to explore whether parameter estimation can be improved by exploiting the unlabeled data. We propose a novel Bayesian method for estimating the population mean in SS settings. The approach yields estimators that are both efficient and optimal for estimation and inference. The method itself has several interesting artifacts. The central idea behind the method is to model certain summary statistics of the data in a targeted manner, rather than the entire raw data itself, along with a novel Bayesian notion of debiasing. Specifying appropriate summary statistics crucially relies on a debiased representation of the population mean that incorporates unlabeled data through a flexible nuisance function while also learning its estimation bias. Combined with careful usage of sample splitting, this debiasing approach mitigates the effect of bias due to slow rates or misspecification of the nuisance parameter from the posterior of the final parameter of interest, ensuring its robustness and efficiency. Concrete theoretical results, via Bernstein--von Mises theorems, are established, validating all claims, and are further supported through extensive numerical studies. To our knowledge, this is possibly the first work on Bayesian inference in SS settings, and its central ideas also apply more broadly to other Bayesian semi-parametric inference problems. 
    more » « less
  4. Gustau Camps-Valls; Francisco J. R. Ruiz; Isabel Valera (Ed.)
    Robins et al. (2008) introduced a class of influence functions (IFs) which could be used to obtain doubly robust moment functions for the corresponding parameters. However, that class does not include the IF of parameters for which the nuisance functions are solutions to integral equations. Such parameters are particularly important in the field of causal inference, specifically in the recently proposed proximal causal inference framework of Tchetgen Tchetgen et al. (2020), which allows for estimating the causal effect in the presence of latent confounders. In this paper, we first extend the class of Robins et al. to include doubly robust IFs in which the nuisance functions are solutions to integral equations. Then we demonstrate that the double robustness property of these IFs can be leveraged to construct estimating equations for the nuisance functions, which enables us to solve the integral equations without resorting to parametric models. We frame the estimation of the nuisance functions as a minimax optimization problem. We provide convergence rates for the nuisance functions and conditions required for asymptotic linearity of the estimator of the parameter of interest. The experiment results demonstrate that our proposed methodology leads to robust and high-performance estimators for average causal effect in the proximal causal inference framework. 
    more » « less
  5. We propose a semiparametric Bayesian methodology for estimating the average treatment effect (ATE) within the potential outcomes framework using observational data with high-dimensional nuisance parameters. Our method introduces a Bayesian debiasing procedure that corrects for bias arising from nuisance estimation and employs a targeted modeling strategy based on summary statistics rather than the full data. These summary statistics are identified in a debiased manner, enabling the estimation of nuisance bias via weighted observables and facilitating hierarchical learning of the ATE. By combining debiasing with sample splitting, our approach separates nuisance estimation from inference on the target parameter, reducing sensitivity to nuisance model specification. We establish that, under mild conditions, the marginal posterior for the ATE satisfies a Bernstein-von Mises theorem when both nuisance models are correctly specified and remains consistent and robust when only one is correct, achieving Bayesian double robustness. This ensures asymptotic efficiency and frequentist validity. Extensive simulations confirm the theoretical results, demonstrating accurate point estimation and credible intervals with nominal coverage, even in high-dimensional settings. The proposed framework can also be extended to other causal estimands, and its key principles offer a general foundation for advancing Bayesian semiparametric inference more broadly. 
    more » « less