skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Improving Survey Inference Using Administrative Records Without Releasing Individual‐Level Continuous Data
ABSTRACT Probability surveys are challenged by increasing nonresponse rates, resulting in biased statistical inference. Auxiliary information about populations can be used to reduce bias in estimation. Often continuous auxiliary variables in administrative records are first discretized before releasing to the public to avoid confidentiality breaches. This may weaken the utility of the administrative records in improving survey estimates, particularly when there is a strong relationship between continuous auxiliary information and the survey outcome. In this paper, we propose a two‐step strategy, where the confidential continuous auxiliary data in the population are first utilized to estimate the response propensity score of the survey sample by statistical agencies, which is then included in a modified population data for data users. In the second step, data users who do not have access to confidential continuous auxiliary data conduct predictive survey inference by including discretized continuous variables and the propensity score as predictors using splines in a Bayesian model. We show by simulation that the proposed method performs well, yielding more efficient estimates of population means with 95% credible intervals providing better coverage than alternative approaches. We illustrate the proposed method using the Ohio Army National Guard Mental Health Initiative (OHARNG‐MHI). The methods developed in this work are readily available in the R packageAuxSurvey.  more » « less
Award ID(s):
2217456
PAR ID:
10637216
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Statistics in Medicine
Date Published:
Journal Name:
Statistics in Medicine
Volume:
43
Issue:
30
ISSN:
0277-6715
Page Range / eLocation ID:
5803 to 5813
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Many variables of interest in agricultural or economical surveys have skewed distributions and can equal zero. Our data are measures of sheet and rill erosion called Revised Universal Soil Loss Equation‐2 (RUSLE2). Small area estimates of mean RUSLE2 erosion are of interest. We use a zero‐inflated lognormal mixed effects model for small area estimation. The model combines a unit‐level lognormal model for the positive RUSLE2 responses with a unit‐level logistic mixed effects model for the binary indicator that the response is nonzero. In the Conservation Effects Assessment Project (CEAP) data, counties with a higher probability of nonzero responses also tend to have a higher mean among the positive RUSLE2 values. We capture this property of the data through an assumption that the pair of random effects for a county are correlated. We develop empirical Bayes (EB) small area predictors and a bootstrap estimator of the mean squared error (MSE). In simulations, the proposed predictor is superior to simpler alternatives. We then apply the method to construct EB predictors of mean RUSLE2 erosion for South Dakota counties. To obtain auxiliary variables for the population of cropland in South Dakota, we integrate a satellite‐derived land cover map with a geographic database of soil properties. We provide an R Shiny application calledviscover(available athttps://lyux.shinyapps.io/viscover/) to visualize the overlay operations required to construct the covariates. On the basis of bootstrap estimates of the mean square error, we conclude that the EB predictors of mean RUSLE2 erosion are superior to direct estimators. 
    more » « less
  2. Abstract Often, government agencies and survey organizations know the population counts or percentages for some of the variables in a survey. These may be available from auxiliary sources, for example administrative databases or other high-quality surveys. We present and illustrate a model-based framework for leveraging such auxiliary marginal information when handling unit and item nonresponse. We show how one can use the margins to specify different missingness mechanisms for each type of nonresponse. We use the framework to impute missing values in voter turnout in a subset of data from the US Current Population Survey. In doing so, we examine the sensitivity of results to different assumptions about the unit and item nonresponse. 
    more » « less
  3. Abstract Propensity score weighting is a tool for causal inference to adjust for measured confounders in observational studies. In practice, data often present complex structures, such as clustering, which make propensity score modeling and estimation challenging. In addition, for clustered data, there may be unmeasured cluster-level covariates that are related to both the treatment assignment and outcome. When such unmeasured cluster-specific confounders exist and are omitted in the propensity score model, the subsequent propensity score adjustment may be biased. In this article, we propose a calibration technique for propensity score estimation under the latent ignorable treatment assignment mechanism, i. e., the treatment-outcome relationship is unconfounded given the observed covariates and the latent cluster-specific confounders. We impose novel balance constraints which imply exact balance of the observed confounders and the unobserved cluster-level confounders between the treatment groups. We show that the proposed calibrated propensity score weighting estimator is doubly robust in that it is consistent for the average treatment effect if either the propensity score model is correctly specified or the outcome follows a linear mixed effects model. Moreover, the proposed weighting method can be combined with sampling weights for an integrated solution to handle confounding and sampling designs for causal inference with clustered survey data. In simulation studies, we show that the proposed estimator is superior to other competitors. We estimate the effect of School Body Mass Index Screening on prevalence of overweight and obesity for elementary schools in Pennsylvania. 
    more » « less
  4. Abstract Resource selection functions (RSFs) are among the most commonly used statistical tools in both basic and applied animal ecology. They are typically parameterized using animal tracking data, and advances in animal tracking technology have led to increasing levels of autocorrelation between locations in such data sets. Because RSFs assume that data are independent and identically distributed, such autocorrelation can cause misleadingly narrow confidence intervals and biased parameter estimates.Data thinning, generalized estimating equations and step selection functions (SSFs) have been suggested as techniques for mitigating the statistical problems posed by autocorrelation, but these approaches have notable limitations that include statistical inefficiency, unclear or arbitrary targets for adequate levels of statistical independence, constraints in input data and (in the case of SSFs) scale‐dependent inference. To remedy these problems, we introduce a method for likelihood weighting of animal locations to mitigate the negative consequences of autocorrelation on RSFs.In this study, we demonstrate that this method weights each observed location in an animal's movement track according to its level of non‐independence, expanding confidence intervals and reducing bias that can arise when there are missing data in the movement track.Ecologists and conservation biologists can use this method to improve the quality of inferences derived from RSFs. We also provide a complete, annotated analytical workflow to help new users apply our method to their own animal tracking data using thectmm Rpackage. 
    more » « less
  5. Nonparametric model-assisted estimators have been proposed to improve estimates of finite population parameters. Flexible nonparametric models provide more reliable estimators when a parametric model is misspecified. In this article, we propose an information criterion to select appropriate auxiliary variables to use in an additive model-assisted method. We approximate the additive nonparametric components using polynomial splines and extend the Bayesian Information Criterion (BIC) for finite populations. By removing irrelevant auxiliary variables, our method reduces model complexity and decreases estimator variance. We establish that the proposed BIC is asymptotically consistent in selecting the important explanatory variables when the true model is additive without interactions, a result supported by our numerical study. Our proposed method is easier to implement and better justified theoretically than the existing method proposed in the literature. 
    more » « less