skip to main content

Title: Multiple Imputation with Massive Data: An Application to the Panel Study of Income Dynamics
Abstract Multiple imputation (MI) is a popular and well-established method for handling missing data in multivariate data sets, but its practicality for use in massive and complex data sets has been questioned. One such data set is the Panel Study of Income Dynamics (PSID), a longstanding and extensive survey of household income and wealth in the United States. Missing data for this survey are currently handled using traditional hot deck methods because of the simple implementation; however, the univariate hot deck results in large random wealth fluctuations. MI is effective but faced with operational challenges. We use a sequential regression/chained-equation approach, using the software IVEware, to multiply impute cross-sectional wealth data in the 2013 PSID, and compare analyses of the resulting imputed data with those from the current hot deck approach. Practical difficulties, such as non-normally distributed variables, skip patterns, categorical variables with many levels, and multicollinearity, are described together with our approaches to overcoming them. We evaluate the imputation quality and validity with internal diagnostics and external benchmarking data. MI produces improvements over the existing hot deck approach by helping preserve correlation structures, such as the associations between PSID wealth components and the relationships between the household net worth more » and sociodemographic factors, and facilitates completed data analyses with general purposes. MI incorporates highly predictive covariates into imputation models and increases efficiency. We recommend the practical implementation of MI and expect greater gains when the fraction of missing information is large. « less
Authors:
; ; ; ; ; ;
Award ID(s):
2042875
Publication Date:
NSF-PAR ID:
10335599
Journal Name:
Journal of Survey Statistics and Methodology
ISSN:
2325-0984
Sponsoring Org:
National Science Foundation
More Like this
  1. Assessing several individuals intensively over time yields intensive longitudinal data (ILD). Even though ILD provide rich information, they also bring other data analytic challenges. One of these is the increased occurrence of missingness with increased study length, possibly under non-ignorable missingness scenarios. Multiple imputation (MI) handles missing data by creating several imputed data sets, and pooling the estimation results across imputed data sets to yield final estimates for inferential purposes. In this article, we introduce dynr.mi(), a function in the R package, Dynamic Modeling in R (dynr). The package dynr provides a suite of fast and accessible functions for estimating and visualizing the results from fitting linear and nonlinear dynamic systems models in discrete as well as continuous time. By integrating the estimation functions in dynr and the MI procedures available from the R package, Multivariate Imputation by Chained Equations (MICE), the dynr.mi() routine is designed to handle possibly non-ignorable missingness in the dependent variables and/or covariates in a user-specified dynamic systems model via MI, with convergence diagnostic check. We utilized dynr.mi() to examine, in the context of a vector autoregressive model, the relationships among individuals’ ambulatory physiological measures, and self-report affect valence and arousal. The results from MI weremore »compared to those from listwise deletion of entries with missingness in the covariates. When we determined the number of iterations based on the convergence diagnostics available from dynr.mi(), differences in the statistical significance of the covariate parameters were observed between the listwise deletion and MI approaches. These results underscore the importance of considering diagnostic information in the implementation of MI procedures.« less
  2. Social science approaches to missing values predict avoided, unrequested, or lost information from dense data sets, typically surveys. The authors propose a matrix factorization approach to missing data imputation that (1) identifies underlying factors to model similarities across respondents and responses and (2) regularizes across factors to reduce their overinfluence for optimal data reconstruction. This approach may enable social scientists to draw new conclusions from sparse data sets with a large number of features, for example, historical or archival sources, online surveys with high attrition rates, or data sets created from Web scraping, which confound traditional imputation techniques. The authors introduce matrix factorization techniques and detail their probabilistic interpretation, and they demonstrate these techniques’ consistency with Rubin’s multiple imputation framework. The authors show via simulations using artificial data and data from real-world subsets of the General Social Survey and National Longitudinal Study of Youth cases for which matrix factorization techniques may be preferred. These findings recommend the use of matrix factorization for data reconstruction in several settings, particularly when data are Boolean and categorical and when large proportions of the data are missing.

  3. A key issue for panel surveys is the relationship between changes in respondent burden and resistance or attrition in future waves. In this chapter, the authors use data from multiple waves of the Panel Study of Income Dynamics (PSID) from 1997 to 2015 to examine the effects on attrition and on various other measures of respondent cooperation of being invited to take part in a major supplemental study to PSID, namely the 1997 PSID Child Development Supplement (CDS). They describe their conceptual framework and previous research. The authors also describe the data and methods. The PSID is the world’s longest-running household panel survey. PSID has a number of supplemental studies, which began in 1997 with the original CDS. To describe and analyse the effects of CDS on sample attrition in PSID, the authors also use survival curves and univariate and multivariate discrete time hazard models.
  4. Abstract In recent years, household surveys have expended significant effort to counter well-documented increases in direct refusals and greater difficulty contacting survey respondents. A substantial amount of fieldwork effort in panel surveys using telephone interviewing is devoted to the task of contacting the respondent to schedule the day and time of the interview. Higher fieldwork effort leads to greater costs and is associated with lower response rates. A new approach was experimentally evaluated in the 2017 wave of the Panel Study of Income Dynamics (PSID) Transition into Adulthood Supplement (TAS) that allowed a randomly selected subset of respondents to choose their own day and time of their telephone interview through the use of an online appointment scheduler. TAS is a nationally representative study of US young adults aged 18–28 years embedded within the worlds’ longest running panel study, the PSID. This paper experimentally evaluates the effect of offering the online appointment scheduler on fieldwork outcomes, including number of interviewer contact attempts and interview sessions, number of days to complete the interview, and response rates. We describe panel study members’ characteristics associated with uptake of the online scheduler and examine differences in the effectiveness of the treatment across subgroups. Finally, potential cost-savingsmore »of fieldwork effort due to the online appointment scheduler are evaluated.« less
  5. Abstract

    As researchers collect large amounts of data in the social sciences through household surveys, challenges may arise in how best to analyze such datasets, especially where motivating theories are unclear or conflicting. New analytical methods may be necessary to extract information from these datasets. Machine learning techniques are promising methods for identifying patterns in large datasets, but have not yet been widely used to identify important variables in social surveys with many questions. To demonstrate the potential of machine learning to analyze large social datasets, we apply machine learning techniques to the study of migration in Bangladesh. The complexity of migration decisions makes them suitable for analysis with machine learning techniques, which enable pattern identification in large datasets with many covariates. In this paper, we apply random forest methods to analyzing a large survey which captures approximately 2000 variables from approximately 1700 households in southwestern Bangladesh. Our analysis ranked the covariates in the dataset in terms of their predictive power for migration decisions. The results identified the most important covariates, but there exists a tradeoff between predictive ability and interpretability. To address this tradeoff, random forests and other machine learning algorithms may be especially useful in combination with moremore »traditional regression methods. To develop insights into how the important variables identified by the random forest algorithm impact migration, we performed a survival analysis of household time to first migration. With this combined analysis, we found that variables related to wealth and household composition are important predictors of migration. Such multi-methods approaches may help to shed light on factors contributing to migration and non-migration.

    « less