Machine learning (ML) advancements hinge upon data - the vital ingredient for training. Statistically-curing the missing data is called imputation, and there are many imputation theories and tools. Butthey often require difficult statistical and/or discipline-specific assumptions, lacking general tools capable of curing large data. Fractional hot deck imputation (FHDI) can cure data by filling nonresponses with observed values (thus, hot-deck) without resorting to assumptions. The review paper summarizes how FHDI evolves to ultra dataoriented parallel version (UP-FHDI).Here, ultra data have concurrently large instances (bign) and high dimensionality (big-p). The evolution is made possible with specialized parallelism and fast variance estimation technique. Validations with scientific and engineering data confirm that UP-FHDI can cure ultra data(p >10,000& n > 1M), and the cured data sets can improve the prediction accuracy of subsequent ML. The evolved FHDI will help promote reliable ML with cured big data. 
                        more » 
                        « less   
                    
                            
                            Multiple Imputation with Massive Data: An Application to the Panel Study of Income Dynamics
                        
                    
    
            Abstract Multiple imputation (MI) is a popular and well-established method for handling missing data in multivariate data sets, but its practicality for use in massive and complex data sets has been questioned. One such data set is the Panel Study of Income Dynamics (PSID), a longstanding and extensive survey of household income and wealth in the United States. Missing data for this survey are currently handled using traditional hot deck methods because of the simple implementation; however, the univariate hot deck results in large random wealth fluctuations. MI is effective but faced with operational challenges. We use a sequential regression/chained-equation approach, using the software IVEware, to multiply impute cross-sectional wealth data in the 2013 PSID, and compare analyses of the resulting imputed data with those from the current hot deck approach. Practical difficulties, such as non-normally distributed variables, skip patterns, categorical variables with many levels, and multicollinearity, are described together with our approaches to overcoming them. We evaluate the imputation quality and validity with internal diagnostics and external benchmarking data. MI produces improvements over the existing hot deck approach by helping preserve correlation structures, such as the associations between PSID wealth components and the relationships between the household net worth and sociodemographic factors, and facilitates completed data analyses with general purposes. MI incorporates highly predictive covariates into imputation models and increases efficiency. We recommend the practical implementation of MI and expect greater gains when the fraction of missing information is large. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2042875
- PAR ID:
- 10335599
- Date Published:
- Journal Name:
- Journal of Survey Statistics and Methodology
- ISSN:
- 2325-0984
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            A key issue for panel surveys is the relationship between changes in respondent burden and resistance or attrition in future waves. In this chapter, the authors use data from multiple waves of the Panel Study of Income Dynamics (PSID) from 1997 to 2015 to examine the effects on attrition and on various other measures of respondent cooperation of being invited to take part in a major supplemental study to PSID, namely the 1997 PSID Child Development Supplement (CDS). They describe their conceptual framework and previous research. The authors also describe the data and methods. The PSID is the world’s longest-running household panel survey. PSID has a number of supplemental studies, which began in 1997 with the original CDS. To describe and analyse the effects of CDS on sample attrition in PSID, the authors also use survival curves and univariate and multivariate discrete time hazard models.more » « less
- 
            Longitudinal clinical trials for which recurrent events endpoints are of interest are commonly subject to missing event data. Primary analyses in such trials are often performed assuming events are missing at random, and sensitivity analyses are necessary to assess robustness of primary analysis conclusions to missing data assumptions. Control‐based imputation is an attractive approach in superiority trials for imposing conservative assumptions on how data may be missing not at random. A popular approach to implementing control‐based assumptions for recurrent events is multiple imputation (MI), but Rubin's variance estimator is often biased for the true sampling variability of the point estimator in the control‐based setting. We propose distributional imputation (DI) with corresponding wild bootstrap variance estimation procedure for control‐based sensitivity analyses of recurrent events. We apply control‐based DI to a type I diabetes trial. In the application and simulation studies, DI produced more reasonable standard error estimates than MI with Rubin's combining rules in control‐based sensitivity analyses of recurrent events.more » « less
- 
            Social science approaches to missing values predict avoided, unrequested, or lost information from dense data sets, typically surveys. The authors propose a matrix factorization approach to missing data imputation that (1) identifies underlying factors to model similarities across respondents and responses and (2) regularizes across factors to reduce their overinfluence for optimal data reconstruction. This approach may enable social scientists to draw new conclusions from sparse data sets with a large number of features, for example, historical or archival sources, online surveys with high attrition rates, or data sets created from Web scraping, which confound traditional imputation techniques. The authors introduce matrix factorization techniques and detail their probabilistic interpretation, and they demonstrate these techniques’ consistency with Rubin’s multiple imputation framework. The authors show via simulations using artificial data and data from real-world subsets of the General Social Survey and National Longitudinal Study of Youth cases for which matrix factorization techniques may be preferred. These findings recommend the use of matrix factorization for data reconstruction in several settings, particularly when data are Boolean and categorical and when large proportions of the data are missing.more » « less
- 
            Summary Deployment of the recently licensed tetravalent dengue vaccine based on a chimeric yellow fever virus, CYD-TDV, requires understanding of how the risk of dengue disease in vaccine recipients depends jointly on a host biomarker measured after vaccination (neutralization titre—neutralizing antibodies) and on a ‘mark’ feature of the dengue disease failure event (the amino acid sequence distance of the dengue virus to the dengue sequence represented in the vaccine). The CYD14 phase 3 trial of CYD-TDV measured neutralizing antibodies via case–cohort sampling and the mark in dengue disease failure events, with about a third missing marks. We addressed the question of interest by developing inferential procedures for the stratified mark-specific proportional hazards model with missing covariates and missing marks. Two hybrid approaches are investigated that leverage both augmented inverse probability weighting and nearest neighbourhood hot deck multiple imputation. The two approaches differ in how the imputed marks are pooled in estimation. Our investigation shows that nearest neighbourhood hot deck imputation can lead to biased estimation without properly selected neighbourhoods. Simulations show that the hybrid methods developed perform well with unbiased nearest neighbourhood hot deck imputations from proper neighbourhood selection. The new methods applied to CYD14 show that neutralizing antibody level is strongly inversely associated with the risk of dengue disease in vaccine recipients, more strongly against dengue viruses with shorter distances.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    