Title: Nearest Neighbor Imputation for General Parameter Estimation in Survey Sampling
Nearest neighbor imputation has a long tradition for handling item nonresponse in survey sampling. In this article, we study the asymptotic properties of the nearest neighbor imputation estimator for general population parameters, including population means, proportions and quantiles. For variance estimation, we propose novel replication variance estimation, which is asymptotically valid and straightforward to implement. The main idea is to construct replicates of the estimator directly based on its asymptotically linear terms, instead of individual records of variables. The simulation results show that nearest neighbor imputation and the proposed variance estimation provide valid inferences for general population parameters. more »« less
Corder, Nathan; Yang, Shu
(, Journal of Causal Inference)
null
(Ed.)
Abstract The problem of missingness in observational data is ubiquitous. When the confounders are missing at random, multiple imputation is commonly used; however, the method requires congeniality conditions for valid inferences, which may not be satisfied when estimating average causal treatment effects. Alternatively, fractional imputation, proposed by Kim 2011, has been implemented to handling missing values in regression context. In this article, we develop fractional imputation methods for estimating the average treatment effects with confounders missing at random. We show that the fractional imputation estimator of the average treatment effect is asymptotically normal, which permits a consistent variance estimate. Via simulation study, we compare fractional imputation’s accuracy and precision with that of multiple imputation.
Demirkaya, Emre; Fan, Yingying; Gao, Lan; Lv, Jinchi; Vossler, Patrick; Wang, Jingbo
(, Journal of the American Statistical Association)
The weighted nearest neighbors (WNN) estimator has been popularly used as a flexible and easy-to-implement nonparametric tool for mean regression estimation. The bagging technique is an elegant way to form WNN estimators with weights automatically generated to the nearest neighbors (Steele, 2009; Biau et al., 2010); we name the resulting estimator as the distributional nearest neighbors (DNN) for easy reference. Yet, there is a lack of distributional results for such estimator, limiting its application to statistical inference. Moreover, when the mean regression function has higher-order smoothness, DNN does not achieve the optimal nonparametric convergence rate, mainly because of the bias issue. In this work, we provide an in-depth technical analysis of the DNN, based on which we suggest a bias reduction approach for the DNN estimator by linearly combining two DNN estimators with different subsampling scales, resulting in the novel two-scale DNN (TDNN) estimator. The two-scale DNN estimator has an equivalent representation of WNN with weights admitting explicit forms and some being negative. We prove that, thanks to the use of negative weights, the two-scale DNN estimator enjoys the optimal nonparametric rate of convergence in estimating the regression function under the fourth order smoothness condition. We further go beyond estimation and establish that the DNN and two-scale DNN are both asymptotically normal as the subsampling scales and sample size diverge to infinity. For the practical implementation, we also provide variance estimators and a distribution estimator using the jackknife and bootstrap techniques for the two-scale DNN. These estimators can be exploited for constructing valid confidence intervals for nonparametric inference of the regression function. The theoretical results and appealing nite-sample performance of the suggested two-scale DNN method are illustrated with several simulation examples and a real data application.
Kim, Jae Kwang; Park, Seho; Chen, Yilin; Wu, Changbao
(, Journal of the Royal Statistical Society Series A: Statistics in Society)
Abstract Analysis of non-probability survey samples requires auxiliary information at the population level. Such information may also be obtained from an existing probability survey sample from the same finite population. Mass imputation has been used in practice for combining non-probability and probability survey samples and making inferences on the parameters of interest using the information collected only in the non-probability sample for the study variables. Under the assumption that the conditional mean function from the non-probability sample can be transported to the probability sample, we establish the consistency of the mass imputation estimator and derive its asymptotic variance formula. Variance estimators are developed using either linearization or bootstrap. Finite sample performances of the mass imputation estimator are investigated through simulation studies. We also address important practical issues of the method through the analysis of a real-world non-probability survey sample collected by the Pew Research Centre.
Abstract While researchers commonly use the bootstrap to quantify the uncertainty of an estimator, it has been noticed that the standard bootstrap, in general, does not work for Chatterjee’s rank correlation. In this paper, we provide proof of this issue under an additional independence assumption, and complement our theory with simulation evidence for general settings. Chatterjee’s rank correlation thus falls into a category of statistics that are asymptotically normal, but bootstrap inconsistent. Valid inferential methods in this case are Chatterjee’s original proposal for testing independence and the analytic asymptotic variance estimator of Lin & Han (2022) for more general purposes. [Received on 5 April 2023. Editorial decision on 10 January 2024]
Fairfax, Sarah R; Yang, Shu
(, Statistics in Medicine)
Longitudinal clinical trials for which recurrent events endpoints are of interest are commonly subject to missing event data. Primary analyses in such trials are often performed assuming events are missing at random, and sensitivity analyses are necessary to assess robustness of primary analysis conclusions to missing data assumptions. Control‐based imputation is an attractive approach in superiority trials for imposing conservative assumptions on how data may be missing not at random. A popular approach to implementing control‐based assumptions for recurrent events is multiple imputation (MI), but Rubin's variance estimator is often biased for the true sampling variability of the point estimator in the control‐based setting. We propose distributional imputation (DI) with corresponding wild bootstrap variance estimation procedure for control‐based sensitivity analyses of recurrent events. We apply control‐based DI to a type I diabetes trial. In the application and simulation studies, DI produced more reasonable standard error estimates than MI with Rubin's combining rules in control‐based sensitivity analyses of recurrent events.
Yang, Shu, and Kim, Jae Kwang. Nearest Neighbor Imputation for General Parameter Estimation in Survey Sampling. Retrieved from https://par.nsf.gov/biblio/10089567. Advances in econometrics 39. Web. doi:10.1108/S0731-905320190000039012.
Yang, Shu, & Kim, Jae Kwang. Nearest Neighbor Imputation for General Parameter Estimation in Survey Sampling. Advances in econometrics, 39 (). Retrieved from https://par.nsf.gov/biblio/10089567. https://doi.org/10.1108/S0731-905320190000039012
@article{osti_10089567,
place = {Country unknown/Code not available},
title = {Nearest Neighbor Imputation for General Parameter Estimation in Survey Sampling},
url = {https://par.nsf.gov/biblio/10089567},
DOI = {10.1108/S0731-905320190000039012},
abstractNote = {Nearest neighbor imputation has a long tradition for handling item nonresponse in survey sampling. In this article, we study the asymptotic properties of the nearest neighbor imputation estimator for general population parameters, including population means, proportions and quantiles. For variance estimation, we propose novel replication variance estimation, which is asymptotically valid and straightforward to implement. The main idea is to construct replicates of the estimator directly based on its asymptotically linear terms, instead of individual records of variables. The simulation results show that nearest neighbor imputation and the proposed variance estimation provide valid inferences for general population parameters.},
journal = {Advances in econometrics},
volume = {39},
author = {Yang, Shu and Kim, Jae Kwang},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.