skip to main content


Title: Multiple Imputation with Massive Data: An Application to the Panel Study of Income Dynamics
Abstract Multiple imputation (MI) is a popular and well-established method for handling missing data in multivariate data sets, but its practicality for use in massive and complex data sets has been questioned. One such data set is the Panel Study of Income Dynamics (PSID), a longstanding and extensive survey of household income and wealth in the United States. Missing data for this survey are currently handled using traditional hot deck methods because of the simple implementation; however, the univariate hot deck results in large random wealth fluctuations. MI is effective but faced with operational challenges. We use a sequential regression/chained-equation approach, using the software IVEware, to multiply impute cross-sectional wealth data in the 2013 PSID, and compare analyses of the resulting imputed data with those from the current hot deck approach. Practical difficulties, such as non-normally distributed variables, skip patterns, categorical variables with many levels, and multicollinearity, are described together with our approaches to overcoming them. We evaluate the imputation quality and validity with internal diagnostics and external benchmarking data. MI produces improvements over the existing hot deck approach by helping preserve correlation structures, such as the associations between PSID wealth components and the relationships between the household net worth and sociodemographic factors, and facilitates completed data analyses with general purposes. MI incorporates highly predictive covariates into imputation models and increases efficiency. We recommend the practical implementation of MI and expect greater gains when the fraction of missing information is large.  more » « less
Award ID(s):
2042875
NSF-PAR ID:
10335599
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Journal of Survey Statistics and Methodology
ISSN:
2325-0984
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Machine learning (ML) advancements hinge upon data - the vital ingredient for training. Statistically-curing the missing data is called imputation, and there are many imputation theories and tools. Butthey often require difficult statistical and/or discipline-specific assumptions, lacking general tools capable of curing large data. Fractional hot deck imputation (FHDI) can cure data by filling nonresponses with observed values (thus, hot-deck) without resorting to assumptions. The review paper summarizes how FHDI evolves to ultra dataoriented parallel version (UP-FHDI).Here, ultra data have concurrently large instances (bign) and high dimensionality (big-p). The evolution is made possible with specialized parallelism and fast variance estimation technique. Validations with scientific and engineering data confirm that UP-FHDI can cure ultra data(p >10,000& n > 1M), and the cured data sets can improve the prediction accuracy of subsequent ML. The evolved FHDI will help promote reliable ML with cured big data.

     
    more » « less
  2. Assessing several individuals intensively over time yields intensive longitudinal data (ILD). Even though ILD provide rich information, they also bring other data analytic challenges. One of these is the increased occurrence of missingness with increased study length, possibly under non-ignorable missingness scenarios. Multiple imputation (MI) handles missing data by creating several imputed data sets, and pooling the estimation results across imputed data sets to yield final estimates for inferential purposes. In this article, we introduce dynr.mi(), a function in the R package, Dynamic Modeling in R (dynr). The package dynr provides a suite of fast and accessible functions for estimating and visualizing the results from fitting linear and nonlinear dynamic systems models in discrete as well as continuous time. By integrating the estimation functions in dynr and the MI procedures available from the R package, Multivariate Imputation by Chained Equations (MICE), the dynr.mi() routine is designed to handle possibly non-ignorable missingness in the dependent variables and/or covariates in a user-specified dynamic systems model via MI, with convergence diagnostic check. We utilized dynr.mi() to examine, in the context of a vector autoregressive model, the relationships among individuals’ ambulatory physiological measures, and self-report affect valence and arousal. The results from MI were compared to those from listwise deletion of entries with missingness in the covariates. When we determined the number of iterations based on the convergence diagnostics available from dynr.mi(), differences in the statistical significance of the covariate parameters were observed between the listwise deletion and MI approaches. These results underscore the importance of considering diagnostic information in the implementation of MI procedures. 
    more » « less
  3. Abstract

    The Rocky Mountain Biological Laboratory (RMBL; Colorado, USA) is the site for many research projects spanning decades, taxa, and research fields from ecology to evolutionary biology to hydrology and beyond. Climate is the focus of much of this work and provides important context for the rest. There are five major sources of data on climate in the RMBL vicinity, each with unique variables, formats, and temporal coverage. These data sources include (1) RMBL resident billy barr, (2) the National Oceanic and Atmospheric Administration (NOAA), (3) the United States Geological Survey (USGS), (4) the United States Department of Agriculture (USDA), and (5) Oregon State University's PRISM Climate Group. Both the NOAA and the USGS have automated meteorological stations in Crested Butte, CO, ~10 km from the RMBL, while the USDA has an automated meteorological station on Snodgrass Mountain, ~2.5 km from the RMBL. Each of these data sets has unique spatial and temporal coverage and formats. Despite the wealth of work on climate‐related questions using data from the RMBL, previous researchers have each had to access and format their own climate records, make decisions about handling missing data, and recreate data summaries. Here we provide a single curated climate data set of daily observations covering the years 1975–2022 that blends information from all five sources and includes annotated scripts documenting decisions for handling data. These synthesized climate data will facilitate future research, reduce duplication of effort, and increase our ability to compare results across studies. The data set includes information on precipitation (water and snow), snowmelt date, temperature, wind speed, soil moisture and temperature, and stream flows, all publicly available from a combination of sources. In addition to the formatted raw data, we provide several new variables that are commonly used in ecological analyses, including growing degree days, growing season length, a cold severity index, hard frost days, an index of El Niño‐Southern Oscillation, and aridity (standardized precipitation evapotranspiration index). These new variables are calculated from the daily weather records. As appropriate, data are also presented as minima, maxima, means, residuals, and cumulative measures for various time scales including days, months, seasons, and years. The RMBL is a global research hub. Scientists on site at the RMBL come from many countries and produce about 50 peer‐reviewed publications each year. Researchers from around the world also routinely use data from the RMBL for synthetic work, and educators around the United States use data from the RMBL for teaching modules. This curated and combined data set will be useful to a wide audience. Along with the synthesized combined data set we include the raw data and the R code for cleaning the raw data and creating the monthly and yearly data sets, which facilitate adding additional years or data using the same standardized protocols. No copyright or proprietary restrictions are associated with using this data set; please cite this data paper when the data are used in publications or scientific events.

     
    more » « less
  4. Social science approaches to missing values predict avoided, unrequested, or lost information from dense data sets, typically surveys. The authors propose a matrix factorization approach to missing data imputation that (1) identifies underlying factors to model similarities across respondents and responses and (2) regularizes across factors to reduce their overinfluence for optimal data reconstruction. This approach may enable social scientists to draw new conclusions from sparse data sets with a large number of features, for example, historical or archival sources, online surveys with high attrition rates, or data sets created from Web scraping, which confound traditional imputation techniques. The authors introduce matrix factorization techniques and detail their probabilistic interpretation, and they demonstrate these techniques’ consistency with Rubin’s multiple imputation framework. The authors show via simulations using artificial data and data from real-world subsets of the General Social Survey and National Longitudinal Study of Youth cases for which matrix factorization techniques may be preferred. These findings recommend the use of matrix factorization for data reconstruction in several settings, particularly when data are Boolean and categorical and when large proportions of the data are missing.

     
    more » « less
  5. A key issue for panel surveys is the relationship between changes in respondent burden and resistance or attrition in future waves. In this chapter, the authors use data from multiple waves of the Panel Study of Income Dynamics (PSID) from 1997 to 2015 to examine the effects on attrition and on various other measures of respondent cooperation of being invited to take part in a major supplemental study to PSID, namely the 1997 PSID Child Development Supplement (CDS). They describe their conceptual framework and previous research. The authors also describe the data and methods. The PSID is the world’s longest-running household panel survey. PSID has a number of supplemental studies, which began in 1997 with the original CDS. To describe and analyse the effects of CDS on sample attrition in PSID, the authors also use survival curves and univariate and multivariate discrete time hazard models. 
    more » « less