This content will become publicly available on May 2, 2024
- Award ID(s):
- 2015577
- NSF-PAR ID:
- 10469355
- Publisher / Repository:
- Workshop on Machine Learning for IoT: Datasets, Perception, and Understanding, International Conference on Learning Representations (ICLR 2023)
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Multivariate time-series data are frequently observed in critical care settings and are typically characterized by sparsity (missing information) and irregular time intervals. Existing approaches for learning representations in this domain handle these challenges by either aggregation or imputation of values, which in-turn suppresses the fine-grained information and adds undesirable noise/overhead into the machine learning model. To tackle this problem, we propose a S elf-supervised Tra nsformer for T ime- S eries (STraTS) model, which overcomes these pitfalls by treating time-series as a set of observation triplets instead of using the standard dense matrix representation. It employs a novel Continuous Value Embedding technique to encode continuous time and variable values without the need for discretization. It is composed of a Transformer component with multi-head attention layers, which enable it to learn contextual triplet embeddings while avoiding the problems of recurrence and vanishing gradients that occur in recurrent architectures. In addition, to tackle the problem of limited availability of labeled data (which is typically observed in many healthcare applications), STraTS utilizes self-supervision by leveraging unlabeled data to learn better representations by using time-series forecasting as an auxiliary proxy task. Experiments on real-world multivariate clinical time-series benchmark datasets demonstrate that STraTS has better prediction performance than state-of-the-art methods for mortality prediction, especially when labeled data is limited. Finally, we also present an interpretable version of STraTS, which can identify important measurements in the time-series data. Our data preprocessing and model implementation codes are available at https://github.com/sindhura97/STraTS .more » « less
-
Abstract Motivation The human microbiome, which is linked to various diseases by growing evidence, has a profound impact on human health. Since changes in the composition of the microbiome across time are associated with disease and clinical outcomes, microbiome analysis should be performed in a longitudinal study. However, due to limited sample sizes and differing numbers of timepoints for different subjects, a significant amount of data cannot be utilized, directly affecting the quality of analysis results. Deep generative models have been proposed to address this lack of data issue. Specifically, a generative adversarial network (GAN) has been successfully utilized for data augmentation to improve prediction tasks. Recent studies have also shown improved performance of GAN-based models for missing value imputation in a multivariate time series dataset compared with traditional imputation methods.
Results This work proposes DeepMicroGen, a bidirectional recurrent neural network-based GAN model, trained on the temporal relationship between the observations, to impute the missing microbiome samples in longitudinal studies. DeepMicroGen outperforms standard baseline imputation methods, showing the lowest mean absolute error for both simulated and real datasets. Finally, the proposed model improved the predicted clinical outcome for allergies, by providing imputation for an incomplete longitudinal dataset used to train the classifier.
Availability and implementation DeepMicroGen is publicly available at https://github.com/joungmin-choi/DeepMicroGen.
-
We propose an algorithm to impute and forecast a time series by transforming the observed time series into a matrix, utilizing matrix estimation to recover missing values and de-noise observed entries, and performing linear regression to make predictions. At the core of our analysis is a representation result, which states that for a large class of models, the transformed time series matrix is (approximately) low-rank. In effect, this generalizes the widely used Singular Spectrum Analysis (SSA) in the time series literature, and allows us to establish a rigorous link between time series analysis and matrix estimation. The key to establishing this link is constructing a Page matrix with non-overlapping entries rather than a Hankel matrix as is commonly done in the literature (e.g., SSA). This particular matrix structure allows us to provide finite sample analysis for imputation and prediction, and prove the asymptotic consistency of our method. Another salient feature of our algorithm is that it is model agnostic with respect to both the underlying time dynamics and the noise distribution in the observations. The noise agnostic property of our approach allows us to recover the latent states when only given access to noisy and partial observations a la a Hidden Markov Model; e.g., recovering the time-varying parameter of a Poisson process without knowing that the underlying process is Poisson. Furthermore, since our forecasting algorithm requires regression with noisy features, our approach suggests a matrix estimation based method-coupled with a novel, non-standard matrix estimation error metric-to solve the error-in-variable regression problem, which could be of interest in its own right. Through synthetic and real-world datasets, we demonstrate that our algorithm outperforms standard software packages (including R libraries) in the presence of missing data as well as high levels of noise.more » « less
-
We propose an algorithm to impute and forecast a time series by transforming the observed time series into a matrix, utilizing matrix estimation to recover missing values and de-noise observed entries, and performing linear regression to make predictions. At the core of our analysis is a representation result, which states that for a large class of models, the transformed time series matrix is (approximately) low-rank. In effect, this generalizes the widely used Singular Spectrum Analysis (SSA) in the time series literature, and allows us to establish a rigorous link between time series analysis and matrix estimation. The key to establishing this link is constructing a Page matrix with non-overlapping entries rather than a Hankel matrix as is commonly done in the literature (e.g., SSA). This particular matrix structure allows us to provide finite sample analysis for imputation and prediction, and prove the asymptotic consistency of our method. Another salient feature of our algorithm is that it is model agnostic with respect to both the underlying time dynamics and the noise distribution in the observations. The noise agnostic property of our approach allows us to recover the latent states when only given access to noisy and partial observations a la a Hidden Markov Model; e.g., recovering the time-varying parameter of a Poisson process without knowing that the underlying process is Poisson. Furthermore, since our forecasting algorithm requires regression with noisy features, our approach suggests a matrix estimation based method—coupled with a novel, non-standard matrix estimation error metric—to solve the error-in-variable regression problem, which could be of interest in its own right. Through synthetic and real-world datasets, we demonstrate that our algorithm outperforms standard software packages (including R libraries) in the presence of missing data as well as high levels of noise.more » « less
-
Abstract Ecological analyses typically involve many interacting variables. Ecologists often specify lagged interactions in community dynamics (i.e. vector‐autoregressive models) or simultaneous interactions (e.g. structural equation models), but there is less familiarity with dynamic structural equation models (DSEM) that can include any simultaneous or lagged effect in multivariate time‐series analysis.
We propose a novel approach to parameter estimation for DSEM, which involves constructing a Gaussian Markov random field (GMRF) representing simultaneous and lagged path coefficients, and then fitting this as a generalized linear mixed model to missing and/or non‐normal data. We provide a new R‐package
dsem , which extends the ‘arrow interface’ from path analysis to represent user‐specified lags when constructing the GMRF. We also outline how the resulting nonseparable precision matrix can generalize existing separable models, for example, for time‐series and species interactions in a vector‐autoregressive model.We first demonstrate
dsem by simulating a two‐species vector‐autoregressive model based on wolf–moose interactions on Isle Royale. We show that DSEM has improved precision when data are missing relative to a conventional dynamic linear model. We then demonstrate DSEM via two contrasting case studies. The first identifies a trophic cascade where decreased sunflower starfish has increased urchin and decreased kelp densities, while sea otters have a simultaneous positive effect on kelp in the California Current from 1999 to 2018. The second estimates how declining sea ice has decreased cold‐water habitats, driving a decreased density for fall copepod predation and inhibiting early‐life survival for Alaska pollock from 1963 to 2023.We conclude that DSEM can be fitted efficiently as a GLMM involving missing data, while allowing users to specify both simultaneous and lagged effects in a time‐series structural model. DSEM then allows conceptual models (developed with stakeholder input or from ecological expertise) to be fitted to incomplete time series and provides a simple interface for granular control over the number of estimated time‐series parameters. Finally, computational methods are sufficiently simple that DSEM can be embedded as component within larger (e.g. integrated population) models. We therefore recommend greater exploration and performance testing for DSEM relative to familiar time‐series forecasting methods.