skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Review: Evolution of Fractional Hot Deck Imputation for Curing Incomplete Data-From Small to Ultra Large Sizes
Machine learning (ML) advancements hinge upon data - the vital ingredient for training. Statistically-curing the missing data is called imputation, and there are many imputation theories and tools. Butthey often require difficult statistical and/or discipline-specific assumptions, lacking general tools capable of curing large data. Fractional hot deck imputation (FHDI) can cure data by filling nonresponses with observed values (thus, hot-deck) without resorting to assumptions. The review paper summarizes how FHDI evolves to ultra dataoriented parallel version (UP-FHDI).Here, ultra data have concurrently large instances (bign) and high dimensionality (big-p). The evolution is made possible with specialized parallelism and fast variance estimation technique. Validations with scientific and engineering data confirm that UP-FHDI can cure ultra data(p >10,000& n > 1M), and the cured data sets can improve the prediction accuracy of subsequent ML. The evolved FHDI will help promote reliable ML with cured big data.  more » « less
Award ID(s):
1931380
PAR ID:
10447116
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
International Conference on Computer Science and Information Technology
Page Range / eLocation ID:
191 to 204
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Multiple imputation (MI) is a popular and well-established method for handling missing data in multivariate data sets, but its practicality for use in massive and complex data sets has been questioned. One such data set is the Panel Study of Income Dynamics (PSID), a longstanding and extensive survey of household income and wealth in the United States. Missing data for this survey are currently handled using traditional hot deck methods because of the simple implementation; however, the univariate hot deck results in large random wealth fluctuations. MI is effective but faced with operational challenges. We use a sequential regression/chained-equation approach, using the software IVEware, to multiply impute cross-sectional wealth data in the 2013 PSID, and compare analyses of the resulting imputed data with those from the current hot deck approach. Practical difficulties, such as non-normally distributed variables, skip patterns, categorical variables with many levels, and multicollinearity, are described together with our approaches to overcoming them. We evaluate the imputation quality and validity with internal diagnostics and external benchmarking data. MI produces improvements over the existing hot deck approach by helping preserve correlation structures, such as the associations between PSID wealth components and the relationships between the household net worth and sociodemographic factors, and facilitates completed data analyses with general purposes. MI incorporates highly predictive covariates into imputation models and increases efficiency. We recommend the practical implementation of MI and expect greater gains when the fraction of missing information is large. 
    more » « less
  2. Summary Deployment of the recently licensed tetravalent dengue vaccine based on a chimeric yellow fever virus, CYD-TDV, requires understanding of how the risk of dengue disease in vaccine recipients depends jointly on a host biomarker measured after vaccination (neutralization titre—neutralizing antibodies) and on a ‘mark’ feature of the dengue disease failure event (the amino acid sequence distance of the dengue virus to the dengue sequence represented in the vaccine). The CYD14 phase 3 trial of CYD-TDV measured neutralizing antibodies via case–cohort sampling and the mark in dengue disease failure events, with about a third missing marks. We addressed the question of interest by developing inferential procedures for the stratified mark-specific proportional hazards model with missing covariates and missing marks. Two hybrid approaches are investigated that leverage both augmented inverse probability weighting and nearest neighbourhood hot deck multiple imputation. The two approaches differ in how the imputed marks are pooled in estimation. Our investigation shows that nearest neighbourhood hot deck imputation can lead to biased estimation without properly selected neighbourhoods. Simulations show that the hybrid methods developed perform well with unbiased nearest neighbourhood hot deck imputations from proper neighbourhood selection. The new methods applied to CYD14 show that neutralizing antibody level is strongly inversely associated with the risk of dengue disease in vaccine recipients, more strongly against dengue viruses with shorter distances. 
    more » « less
  3. Ozden, O. (Ed.)
    Adhesively bonded composite joints can help reduce weight in structures and avoid material damage from fastener holes, but stress concentrations formed at the edges of the adhesive bond line are a main cause of failure. Stress concentrations within the adhesive can be reduced by lowering the stiffness at these edges and increasing the stiffness in the center of the joint. This may be achieved using a dual-cure adhesive system, where conventional curing is first used to bond a lap joint, after which high energy radiation is applied to the joint to induce additional crosslinking in specific regions. Anhydride-cured epoxy resins have been formulated to include a radiation sensitizer enabling the desired cure behavior. Tensile testing was performed on cured systems containing varying levels of radiation sensitizer in order to evaluate its effects on young’s modulus as a function of radiation dose. 
    more » « less
  4. Abstract This study investigates the effect of autoclave curing variables on the glass transition temperature of and the degree of cure and strength of epoxy film adhesive single lap joints (SLJs) under static tensile shear loading. Studied autoclave variables include the cure temperature, cure pressure, temperature, and pressure ramp rates on the glass transition temperature as well as the cure time duration. Test joints are made of Aluminum substrates that are autoclave-bonded using epoxy film adhesive (AF163-2k). For each variable combination of the autoclave process, the corresponding glass transition temperature of cured Epoxy film adhesive is obtained using Dynamic Mechanical Analysis (DMA-Q800). Test data are generated for both baseline joints [uncycled] as well as for joints that have been heat-cycled in an environmental chamber after initial autoclave bonding. Results show a strong correlation between the autoclave process variable combinations and the corresponding glass transition temperature bond strength, and the failure mode of test joints. 
    more » « less
  5. Abstract. The Vacuum-Assisted Resin Infusion Molding (VARIM) process is widely used in wind turbine blade manufacturing due to its cost-effectiveness and reliability. However, challenges such as prolonged curing cycles and defects caused by non-uniform cure remain persistent. To address these issues, multizone heating systems have been developed to enable independent temperature control across blade sections. Yet, optimizing the temperature profile in each zone is computationally intensive, requiring detailed modelling of curing kinetics and heat transfer mechanisms. To overcome these challenges, in this work, a machine learning (ML) based digital twin of the VARIM process was developed using a time-distributed long short-term memory (LSTM) network trained on data generated by a high-fidelity multiphysics solver. The model achieved a predictive accuracy of 96.7 % in replicating the resin curing behavior. Its time-distributed architecture effectively captures the spatial – temporal dependencies across multiple zones, allowing precise prediction of the degree-of-cure evolution. Paired with a gradient-free optimization algorithm, the digital twin reduced curing time by 12.5 % while improving cure uniformity. This AI-driven framework eliminates costly trial-and-error experimentation, and provides a scalable, adaptive solution for improving both quality and productivity in wind turbine blade manufacturing, with strong potential for extension to other composite manufacturing processes. 
    more » « less