skip to main content

Title: REMIAN: Real-Time and Error-Tolerant Missing Value Imputation
Missing value (MV) imputation is a critical preprocessing means for data mining. Nevertheless, existing MV imputation methods are mostly designed for batch processing, and thus are not applicable to streaming data, especially those with poor quality. In this article, we propose a framework, called Real-time and Error-tolerant Missing vAlue ImputatioN (REMAIN), to impute MVs in poor-quality streaming data. Instead of imputing MVs based on all the observed data, REMAIN first initializes the MV imputation model based on a-RANSAC which is capable of detecting and rejecting anomalies in an efficient manner, and then incrementally updates the model parameters upon the arrival of new data to support real-time MV imputation. As the correlations among attributes of the data may change over time in unforseenable ways, we devise a deterioration detection mechanism to capture the deterioration of the imputation model to further improve the imputation accuracy. Finally, we conduct an extensive evaluation on the proposed algorithms using real-world and synthetic datasets. Experimental results demonstrate that REMAIN achieves significantly higher imputation accuracy over existing solutions. Meanwhile, REMAIN improves up to one order of magnitude in time cost compared with existing approaches.  more » « less
Award ID(s):
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  
Date Published:
Journal Name:
ACM Transactions on Knowledge Discovery from Data
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    The human microbiome, which is linked to various diseases by growing evidence, has a profound impact on human health. Since changes in the composition of the microbiome across time are associated with disease and clinical outcomes, microbiome analysis should be performed in a longitudinal study. However, due to limited sample sizes and differing numbers of timepoints for different subjects, a significant amount of data cannot be utilized, directly affecting the quality of analysis results. Deep generative models have been proposed to address this lack of data issue. Specifically, a generative adversarial network (GAN) has been successfully utilized for data augmentation to improve prediction tasks. Recent studies have also shown improved performance of GAN-based models for missing value imputation in a multivariate time series dataset compared with traditional imputation methods.


    This work proposes DeepMicroGen, a bidirectional recurrent neural network-based GAN model, trained on the temporal relationship between the observations, to impute the missing microbiome samples in longitudinal studies. DeepMicroGen outperforms standard baseline imputation methods, showing the lowest mean absolute error for both simulated and real datasets. Finally, the proposed model improved the predicted clinical outcome for allergies, by providing imputation for an incomplete longitudinal dataset used to train the classifier.

    Availability and implementation

    DeepMicroGen is publicly available at

    more » « less
  2. Abstract

    Assessment of mitral valve (MV) function is important in many diagnostic, prognostic, and surgical planning applications for treatment of MV disease. Yet, to date, there are no accepted noninvasive methods for determination of MV leaflet deformation, which is a critical metric of MV function. In this study, we present a novel, completely noninvasive computational method to estimate MV leaflet in‐plane strains from clinical‐quality real‐time three‐dimensional echocardiography (rt‐3DE) images. The images were first segmented to produce meshed medial‐surface leaflet geometries of the open and closed states. To establish material point correspondence between the two states, an image‐based morphing pipeline was implemented within a finite element (FE) modeling framework in which MV closure was simulated by pressurizing the open‐state geometry, and local corrective loads were applied to enforce the actual MV closed shape. This resulted in a complete map of local systolic leaflet membrane strains, obtained from the final FE mesh configuration. To validate the method, we utilized an extant in vitro database of fiducially labeled MVs, imaged in conditions mimicking both the healthy and diseased states. Our method estimated local anisotropic in vivo strains with less than 10% error and proved to be robust to changes in boundary conditions similar to those observed in ischemic MV disease. Next, we applied our methodology to ovine MVs imaged in vivo with rt‐3DE and compared our results to previously published findings of in vivo MV strains in the same type of animal as measured using surgically sutured fiducial marker arrays. In regions encompassed by fiducial markers, we found no significant differences in circumferential(P = 0.240) or radial (P = 0.808) strain estimates between the marker‐based measurements and our novel noninvasive method. This method can thus be used for model validation as well as for studies of MV disease and repair.

    more » « less
  3. Abstract

    The edge computing paradigm has recently drawn significant attention from industry and academia. Due to the advantages in quality-of-service metrics, namely, latency, bandwidth, energy efficiency, privacy, and security, deploying artificial intelligence (AI) models at the network edge has attracted widespread interest. Edge-AI has seen applications in diverse domains that involve large amounts of data. However, poor dataset quality plagues this compute regime owing to numerous data corruption sources, including missing data. As such systems are increasingly being deployed in mission-critical applications, mitigating the effects of corrupted data becomes important. In this work, we propose a strategy based on data imputation using neural inversion, DINI. It trains a surrogate model and runs data imputation in an interleaved fashion. Unlike previous works, DINI is a model-agnostic framework applicable to diverse deep learning architectures. DINI outperforms state-of-the-art methods by at least 10.7% in average imputation error. Applying DINI to mission-critical applications can increase prediction accuracy to up to 99% (F1 score of 0.99), resulting in significant gains compared to baseline methods.

    more » « less
  4. Multivariate time-series data are frequently observed in critical care settings and are typically characterized by sparsity (missing information) and irregular time intervals. Existing approaches for learning representations in this domain handle these challenges by either aggregation or imputation of values, which in-turn suppresses the fine-grained information and adds undesirable noise/overhead into the machine learning model. To tackle this problem, we propose a S elf-supervised Tra nsformer for T ime- S eries (STraTS) model, which overcomes these pitfalls by treating time-series as a set of observation triplets instead of using the standard dense matrix representation. It employs a novel Continuous Value Embedding technique to encode continuous time and variable values without the need for discretization. It is composed of a Transformer component with multi-head attention layers, which enable it to learn contextual triplet embeddings while avoiding the problems of recurrence and vanishing gradients that occur in recurrent architectures. In addition, to tackle the problem of limited availability of labeled data (which is typically observed in many healthcare applications), STraTS utilizes self-supervision by leveraging unlabeled data to learn better representations by using time-series forecasting as an auxiliary proxy task. Experiments on real-world multivariate clinical time-series benchmark datasets demonstrate that STraTS has better prediction performance than state-of-the-art methods for mortality prediction, especially when labeled data is limited. Finally, we also present an interpretable version of STraTS, which can identify important measurements in the time-series data. Our data preprocessing and model implementation codes are available at . 
    more » « less
  5. null (Ed.)
    Spatial classification with limited observations is important in geographical applications where only a subset of sensors are deployed at certain spots or partial responses are collected in field surveys. For example, in observation-based flood inundation mapping, there is a need to map the full flood extent on geographic terrains based on earth imagery that partially covers a region. Existing research mostly focuses on addressing incomplete or missing data through data cleaning and imputation or modeling missing values as hidden variables in the EM algorithm. These methods, however, assume that missing feature observations are rare and thus are ineffective in problems whereby the vast majority of feature observations are missing. To address this issue, we recently proposed a new approach that incorporates physics-aware structural constraint into the model representation. We design efficient learning and inference algorithms. This paper extends our recent approach by allowing feature values of samples in each class to follow a multi-modal distribution. Evaluations on real-world flood mapping applications show that our approach significantly outperforms baseline methods in classification accuracy, and the multi-modal extension is more robust than our early single-modal version. Computational experiments show that the proposed solution is computationally efficient on large datasets. 
    more » « less