skip to main content


Title: Multi-series Time-aware Sequence Partitioning for Disease Progression Modeling.
Electronic healthcare records (EHRs) are comprehensive longitudinal collections of patient data that play a critical role in modeling the disease progression to facilitate clinical decision-making. Based on EHRs, in this work, we focus on sepsis – a broad syndrome that can develop from nearly all types of infections (e.g., influenza, pneumonia). The symptoms of sepsis, such as elevated heart rate, fever, and shortness of breath, are vague and common to other illnesses, making the modeling of its progression extremely challenging. Motivated by the recent success of a novel subsequence clustering approach: Toeplitz Inverse Covariance-based Clustering (TICC), we model the sepsis progression as a subsequence partitioning problem and propose a Multi-series Time-aware TICC (MT-TICC), which incorporates multi-series nature and irregular time intervals of EHRs. The effectiveness of MT-TICC is first validated via a case study using a real-world hand gesture dataset with ground-truth labels. Then we further apply it for sepsis progression modeling using EHRs. The results suggest that MT-TICC can significantly outperform competitive baseline models, including the TICC. More importantly, it unveils interpretable patterns, which sheds some light on better understanding the sepsis progression.  more » « less
Award ID(s):
1916417
PAR ID:
10324504
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI)
Page Range / eLocation ID:
3581-3587
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Modeling patient disease progression using Electronic Health Records (EHRs) is critical to assist clinical decision making. Long-Short Term Memory (LSTM) is an effective model to handle sequential data, such as EHRs, but it encounters two major limitations when applied to EHRs: it is unable to interpret the prediction results and it ignores the irregular time intervals between consecutive events. To tackle these limitations, we propose an attention-based time-aware LSTM Networks (ATTAIN), to improve the interpretability of LSTM and to identify the critical previous events for current diagnosis by modeling the inherent time irregularity. We validate ATTAIN on modeling the progression of an extremely challenging disease, septic shock, by using real-world EHRs. Our results demonstrate that the proposed framework outperforms the state-of-the-art models such as RETAIN and T-LSTM. Also, the generated interpretative time-aware attention weights shed some lights on the progression behaviors of septic shock. 
    more » « less
  2. Deep neural network models, especially Long Short Term Memory (LSTM), have shown great success in analyzing Electronic Health Records (EHRs) due to their ability to capture temporal dependencies in time series data. When applying the deep learning models to EHRs, we are generally confronted with two major challenges: high rate of missingness and time irregularity. Motivated by the original PACIFIER framework which utilized matrix decomposition for data imputation, we applied and further extended it by including three components: forecasting future events, a time-aware mechanism, and a subgroup basis approach. We evaluated the proposed framework with real-world EHRs which consists of 52,919 visits and 4,224,567 events on a task of early prediction of septic shock. We compared our work against multiple baselines including the original PACIFIER using both LSTM and Time-aware LSTM (T-LSTM). Experimental results showed that our proposed framework significantly outperformed all competitive baseline approaches. More importantly, the extracted interpretative latent patterns from subgroups could shed some lights for clinicians to discover the progression of septic shock patients. 
    more » « less
  3. Abstract

    Clustering large financial time series data enables pattern extraction that facilitates risk management. The knowledge gathered from unsupervised learning is useful for improving portfolio optimization and making stock trading recommendations. Most methods available in the literature for clustering financial time series are based on exploiting linear relationships between time series. However, prices of different assets (stocks) may have non‐linear relationships which may be quantified using information based measures such as mutual information (MI). To estimate the empirical mutual information between time series of stock returns, we employ a novel kernel density estimator (KDE) based jackknife mutual information estimation (JMI), and compare it with the widely‐used binning method. We then propose an average distance gradient change algorithm and an algorithm based on the average silhouette criterion that use pairwise and groupwise MI of high‐frequency financial stock returns. Through numerical studies, we provide insights into the impact of the clustering on asset allocation and risk management based on the nonlinear information structure of the US stock market.

     
    more » « less
  4. Abstract Overly restrictive eligibility criteria for clinical trials may limit the generalizability of the trial results to their target real-world patient populations. We developed a novel machine learning approach using large collections of real-world data (RWD) to better inform clinical trial eligibility criteria design. We extracted patients’ clinical events from electronic health records (EHRs), which include demographics, diagnoses, and drugs, and assumed certain compositions of these clinical events within an individual’s EHRs can determine the subphenotypes—homogeneous clusters of patients, where patients within each subgroup share similar clinical characteristics. We introduced an outcome-guided probabilistic model to identify those subphenotypes, such that the patients within the same subgroup not only share similar clinical characteristics but also at similar risk levels of encountering severe adverse events (SAEs). We evaluated our algorithm on two previously conducted clinical trials with EHRs from the OneFlorida+ Clinical Research Consortium. Our model can clearly identify the patient subgroups who are more likely to suffer or not suffer from SAEs as subphenotypes in a transparent and interpretable way. Our approach identified a set of clinical topics and derived novel patient representations based on them. Each clinical topic represents a certain clinical event composition pattern learned from the patient EHRs. Tested on both trials, patient subgroup (#SAE=0) and patient subgroup (#SAE>0) can be well-separated by k-means clustering using the inferred topics. The inferred topics characterized as likely to align with the patient subgroup (#SAE>0) revealed meaningful combinations of clinical features and can provide data-driven recommendations for refining the exclusion criteria of clinical trials. The proposed supervised topic modeling approach can infer the clinical topics from the subphenotypes with or without SAEs. The potential rules for describing the patient subgroups with SAEs can be further derived to inform the design of clinical trial eligibility criteria. 
    more » « less
  5. Modeling corrosion growth for complex systems such as the oil refinery system is a major challenge since the corrosion process of oil and gas pipelines are inherently stochastic and depends on many factors including exposures to environmental conditions, operating conditions, and electrochemical reactions. Moreover, the number of sensors is usually limited, and sensor data are incomplete and scattering, which hinders the capability of capturing the corrosion growth behaviors. Therefore, this paper proposes Multi-sensor Corrosion Growth Model with Latent Variables to predict the corrosion growth process in oil refinery piping. The proposed model is a combination of the hierarchical clustering algorithm and the vector autoregression (VAR) model. The clustering algorithm aims to find the hidden (i.e., latent) data clusters of the measured time series data, from which the time series from the same cluster will be included in the VAR model to predict the corrosion depth from multiple sensors. The model can capture the relationship between sensor time series data and identify latent variables. A real case study of an oil refinery system, in which in-line inspection (ILI) data were collected, was utilized to validate model. Regarding corrosion growth prediction, the paper compared the prediction accuracy of VAR model with other three forms of power law model, which is widely accepted to expect the time-dependent depth of corrosion such as power function (PF), PF with initiation time of corrosion (PFIT), and PF with initiation time of corrosion and covariates (PFCOV). The results showed that VAR model has the lowest prediction error based on the mean absolute percentage error (MAPE) evaluation for test data. Finally, the proposed model is believed to be useful for dealing with a complex system that has a variety of corrosion growth behaviors, such as the oil refinery system, as well as it can be applied in other real-time applications. 
    more » « less