Missing value (MV) imputation is a critical preprocessing means for data mining. Nevertheless, existing MV imputation methods are mostly designed for batch processing, and thus are not applicable to streaming data, especially those with poor quality. In this article, we propose a framework, called Real-time and Error-tolerant Missing vAlue ImputatioN (REMAIN), to impute MVs in poor-quality streaming data. Instead of imputing MVs based on all the observed data, REMAIN first initializes the MV imputation model based on a-RANSAC which is capable of detecting and rejecting anomalies in an efficient manner, and then incrementally updates the model parameters upon the arrival of new data to support real-time MV imputation. As the correlations among attributes of the data may change over time in unforseenable ways, we devise a deterioration detection mechanism to capture the deterioration of the imputation model to further improve the imputation accuracy. Finally, we conduct an extensive evaluation on the proposed algorithms using real-world and synthetic datasets. Experimental results demonstrate that REMAIN achieves significantly higher imputation accuracy over existing solutions. Meanwhile, REMAIN improves up to one order of magnitude in time cost compared with existing approaches.
more »
« less
SpectraNet: multivariate forecasting and imputation under distribution shifts and missing data
In this work, we tackle two widespread challenges in real applications for time series forecasting that have been largely understudied: distribution shifts and missing data. We propose SpectraNet, a novel multivariate time-series forecasting model that dynamically infers a latent space spectral decomposition to capture current temporal dynamics and correlations on the recent observed history. A Convolution Neural Network maps the learned representation by sequentially mixing its components and refining the output. Our proposed approach can simultaneously produce forecasts and interpolate past observations and can, therefore, greatly simplify production systems by unifying imputation and forecasting tasks into a single model. SpectraNet achieves SoTA performance simultaneously on both tasks on five benchmark datasets, compared to forecasting and imputation models, with up to 92% fewer parameters and comparable training times. On settings with up to 80% missing data, SpectraNet has average performance improvements of almost 50% over the second-best alternative.
more »
« less
- Award ID(s):
- 2015577
- PAR ID:
- 10469355
- Publisher / Repository:
- Workshop on Machine Learning for IoT: Datasets, Perception, and Understanding, International Conference on Learning Representations (ICLR 2023)
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract MotivationThe human microbiome, which is linked to various diseases by growing evidence, has a profound impact on human health. Since changes in the composition of the microbiome across time are associated with disease and clinical outcomes, microbiome analysis should be performed in a longitudinal study. However, due to limited sample sizes and differing numbers of timepoints for different subjects, a significant amount of data cannot be utilized, directly affecting the quality of analysis results. Deep generative models have been proposed to address this lack of data issue. Specifically, a generative adversarial network (GAN) has been successfully utilized for data augmentation to improve prediction tasks. Recent studies have also shown improved performance of GAN-based models for missing value imputation in a multivariate time series dataset compared with traditional imputation methods. ResultsThis work proposes DeepMicroGen, a bidirectional recurrent neural network-based GAN model, trained on the temporal relationship between the observations, to impute the missing microbiome samples in longitudinal studies. DeepMicroGen outperforms standard baseline imputation methods, showing the lowest mean absolute error for both simulated and real datasets. Finally, the proposed model improved the predicted clinical outcome for allergies, by providing imputation for an incomplete longitudinal dataset used to train the classifier. Availability and implementationDeepMicroGen is publicly available at https://github.com/joungmin-choi/DeepMicroGen.more » « less
-
Multivariate time series with missing values are common in areas such as healthcare and finance, and have grown in number and complexity over the years. This raises the question whether deep learning methodologies can outperform classical data imputation methods in this domain. However, naïve applications of deep learning fall short in giving reliable confidence estimates and lack interpretability. We propose a new deep sequential latent variable model for dimensionality reduction and data imputation. Our modeling assumption is simple and interpretable: the high dimensional time series has a lower-dimensional representation which evolves smoothly in time according to a Gaussian process. The nonlinear dimensionality reduction in the presence of missing data is achieved using a VAE approach with a novel structured variational approximation. We demonstrate that our approach outperforms several classical and deep learning-based data imputation methods on high-dimensional data from the domains of computer vision and healthcare, while additionally improving the smoothness of the imputations and providing interpretable uncertainty estimates.more » « less
-
Multivariate time series with missing values are common in areas such as healthcare and finance, and have grown in number and complexity over the years. This raises the question whether deep learning methodologies can outperform classical data imputation methods in this domain. However, naïve applications of deep learning fall short in giving reliable confidence estimates and lack interpretability. We propose a new deep sequential latent variable model for dimensionality reduction and data imputation. Our modeling assumption is simple and interpretable: the high dimensional time series has a lower-dimensional representation which evolves smoothly in time according to a Gaussian process. The nonlinear dimensionality reduction in the presence of missing data is achieved using a VAE approach with a novel structured variational approximation. We demonstrate that our approach outperforms several classical and deep learning-based data imputation methods on high-dimensional data from the domains of computer vision and healthcare, while additionally improving the smoothness of the imputations and providing interpretable uncertainty estimates.more » « less
-
Multivariate time-series data are frequently observed in critical care settings and are typically characterized by sparsity (missing information) and irregular time intervals. Existing approaches for learning representations in this domain handle these challenges by either aggregation or imputation of values, which in-turn suppresses the fine-grained information and adds undesirable noise/overhead into the machine learning model. To tackle this problem, we propose aSelf-supervisedTransformer forTime-Series (STraTS) model, which overcomes these pitfalls by treating time-series as a set of observation triplets instead of using the standard dense matrix representation. It employs a novel Continuous Value Embedding technique to encode continuous time and variable values without the need for discretization. It is composed of a Transformer component with multi-head attention layers, which enable it to learn contextual triplet embeddings while avoiding the problems of recurrence and vanishing gradients that occur in recurrent architectures. In addition, to tackle the problem of limited availability of labeled data (which is typically observed in many healthcare applications), STraTS utilizes self-supervision by leveraging unlabeled data to learn better representations by using time-series forecasting as an auxiliary proxy task. Experiments on real-world multivariate clinical time-series benchmark datasets demonstrate that STraTS has better prediction performance than state-of-the-art methods for mortality prediction, especially when labeled data is limited. Finally, we also present an interpretable version of STraTS, which can identify important measurements in the time-series data. Our data preprocessing and model implementation codes are available athttps://github.com/sindhura97/STraTS.more » « less
An official website of the United States government
