skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Nonparametric Mass Imputation for Data Integration
Abstract Data integration combining a probability sample with another nonprobability sample is an emerging area of research in survey sampling. We consider the case when the study variable of interest is measured only in the nonprobability sample, but comparable auxiliary information is available for both data sources. We consider mass imputation for the probability sample using the nonprobability data as the training set for imputation. The parametric mass imputation is sensitive to parametric model assumptions. To develop improved and robust methods, we consider nonparametric mass imputation for data integration. In particular, we consider kernel smoothing for a low-dimensional covariate and generalized additive models for a relatively high-dimensional covariate for imputation. Asymptotic theories and variance estimation are developed. Simulation studies and real applications show the benefits of our proposed methods over parametric counterparts.  more » « less
Award ID(s):
1811245 1733572
PAR ID:
10361995
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of Survey Statistics and Methodology
Volume:
10
Issue:
1
ISSN:
2325-0984
Page Range / eLocation ID:
p. 1-24
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Imputation is a popular technique for handling item nonresponse. Parametric imputation is based on a parametric model for imputation and is not robust against the failure of the imputation model. Nonparametric imputation is fully robust but is not applicable when the dimension of covariates is large due to the curse of dimensionality. Semiparametric imputation is another robust imputation based on a flexible model where the number of model parameters can increase with the sample size. In this paper, we propose a new semiparametric imputation based on a more flexible model assumption than the Gaussian mixture model. In the proposed mixture model, we assume a conditional Gaussian model for the study variable given the auxiliary variables, but the marginal distribution of the auxiliary variables is not necessarily Gaussian. The proposed mixture model is more flexible and achieves a better approximation than the Gaussian mixture models. The proposed method is applicable to high‐dimensional covariate problem by including a penalty function in the conditional log‐likelihood function. The proposed method is applied to the 2017 Korean Household Income and Expenditure Survey conducted by Statistics Korea. 
    more » « less
  2. Abstract Analysis of non-probability survey samples requires auxiliary information at the population level. Such information may also be obtained from an existing probability survey sample from the same finite population. Mass imputation has been used in practice for combining non-probability and probability survey samples and making inferences on the parameters of interest using the information collected only in the non-probability sample for the study variables. Under the assumption that the conditional mean function from the non-probability sample can be transported to the probability sample, we establish the consistency of the mass imputation estimator and derive its asymptotic variance formula. Variance estimators are developed using either linearization or bootstrap. Finite sample performances of the mass imputation estimator are investigated through simulation studies. We also address important practical issues of the method through the analysis of a real-world non-probability survey sample collected by the Pew Research Centre. 
    more » « less
  3. Abstract With advances in biomedical research, biomarkers are becoming increasingly important prognostic factors for predicting overall survival, while the measurement of biomarkers is often censored due to instruments' lower limits of detection. This leads to two types of censoring: random censoring in overall survival outcomes and fixed censoring in biomarker covariates, posing new challenges in statistical modeling and inference. Existing methods for analyzing such data focus primarily on linear regression ignoring censored responses or semiparametric accelerated failure time models with covariates under detection limits (DL). In this paper, we propose a quantile regression for survival data with covariates subject to DL. Comparing to existing methods, the proposed approach provides a more versatile tool for modeling the distribution of survival outcomes by allowing covariate effects to vary across conditional quantiles of the survival time and requiring no parametric distribution assumptions for outcome data. To estimate the quantile process of regression coefficients, we develop a novel multiple imputation approach based on another quantile regression for covariates under DL, avoiding stringent parametric restrictions on censored covariates as often assumed in the literature. Under regularity conditions, we show that the estimation procedure yields uniformly consistent and asymptotically normal estimators. Simulation results demonstrate the satisfactory finite‐sample performance of the method. We also apply our method to the motivating data from a study of genetic and inflammatory markers of Sepsis. 
    more » « less
  4. We consider quantile estimation in a semi-supervised setting, characterized by two available data sets: (i) a small or moderate sized labeled data set containing observations for a response and a set of possibly high dimensional covariates, and (ii) a much larger unlabeled data set where only the covariates are observed. We propose a family of semi-supervised estimators for the response quantile(s) based on the two data sets, to improve the estimation accuracy compared to the supervised estimator, i.e., the sample quantile from the labeled data. These estimators use a flexible imputation strategy applied to the estimating equation along with a debiasing step that allows for full robustness against misspecification of the imputation model. Further, a one-step update strategy is adopted to enable easy implementation of our method and handle the complexity from the non-linear nature of the quantile estimating equation. Under mild assumptions, our estimators are fully robust to the choice of the nuisance imputation model, in the sense of always maintaining root-n consistency and asymptotic normality, while having improved efficiency relative to the supervised estimator. They also attain semi-parametric optimality if the relation between the response and the covariates is correctly specified via the imputation model. As an illustration of estimating the nuisance imputation function, we consider kernel smoothing type estimators on lower dimensional and possibly estimated transformations of the high dimensional covariates, and we establish novel results on their uniform convergence rates in high dimensions, involving responses indexed by a function class and usage of dimension reduction techniques. These results may be of independent interest. Numerical results on both simulated and real data confirm our semi-supervised approach's improved performance, in terms of both estimation and inference. 
    more » « less
  5. Abstract The generalized semiparametric mixed varying‐coefficient effects model for longitudinal data can accommodate a variety of link functions and flexibly model different types of covariate effects, including time‐constant, time‐varying and covariate‐varying effects. The time‐varying effects are unspecified functions of time and the covariate‐varying effects are nonparametric functions of a possibly time‐dependent exposure variable. A semiparametric estimation procedure is developed that uses local linear smoothing and profile weighted least squares, which requires smoothing in the two different and yet connected domains of time and the time‐dependent exposure variable. The asymptotic properties of the estimators of both nonparametric and parametric effects are investigated. In addition, hypothesis testing procedures are developed to examine the covariate effects. The finite‐sample properties of the proposed estimators and testing procedures are examined through simulations, indicating satisfactory performances. The proposed methods are applied to analyze the AIDS Clinical Trial Group 244 clinical trial to investigate the effects of antiretroviral treatment switching in HIV‐infected patients before and after developing the T215Y antiretroviral drug resistance mutation.The Canadian Journal of Statistics47: 352–373; 2019 © 2019 Statistical Society of Canada 
    more » « less