skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: tspDB: Time Series Predict DB
A major bottleneck of the current Machine Learning (ML) workflow is the time consuming, error prone engineering required to get data from a datastore or a database (DB) to the point an ML algorithm can be applied to it. This is further exacerbated since ML algorithms are now trained on large volumes of data, yet we need predictions in real-time, especially in a variety of time-series applications such as finance and real-time control systems. Hence, we explore the feasibility of directly integrating prediction functionality on top of a data store or DB. Such a system ideally: (i) provides an intuitive prediction query interface which alleviates the unwieldy data engineering; (ii) provides state-of-the-art statistical accuracy while ensuring incremental model update, low model training time and low latency for making predictions. As the main contribution we explicitly instantiate a proof-of-concept, tspDB which directly integrates with PostgreSQL. We rigorously test tspDB’s statistical and computational performance against the state-of-the-art time series algorithms, including a Long-Short-Term-Memory (LSTM) neural network and DeepAR (industry standard deep learning library by Amazon). Statistically, on standard time series benchmarks, tspDB outperforms LSTM and DeepAR with 1.1-1.3x higher relative accuracy. Computationally, tspDB is 59-62x and 94-95x faster compared to LSTM and DeepAR in terms of median ML model training time and prediction query latency, respectively. Further, compared to PostgreSQL’s bulk insert time and its SELECT query latency, tspDB is slower only by 1.3x and 2.6x respectively. That is, tspDB is a real-time prediction system in that its model training / prediction query time is similar to just inserting, reading data from a DB. As an algorithmic contribution, we introduce an incremental multivariate matrix factorization based time series method, which tspDB is built off. We show this method also allows one to produce reliable prediction intervals by accurately estimating the time-varying variance of a time series, thereby addressing an important problem in time series analysis.  more » « less
Award ID(s):
1634259 1523546
PAR ID:
10313410
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of Machine Learning Research
Volume:
133
ISSN:
2640-3498
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We present a novel multi-level representation of time series called OM3 that facilitates efficient interactive progressive visualization of large data stored in a database and supports various interactions such as resizing, panning, zooming, and visual query. Based on our proposed line-segment aggregation, this representation can produce error-free line visualizations that preserve the shape of a time series in windows of arbitrary sizes. To reduce the interaction latency, we develop an incremental tree-based query strategy to support progressive visualizations, allowing a finer control on the accuracy-time tradeoff. We quantitatively compare OM3 with state-of-the-art methods, including a method implemented on a leading time-series database InfluxDB, in two settings with databases residing either in the local area network or on the cloud. Results show that OM^3 maintains a low latency within 300~ms on the web browser and a high data reduction ratio regardless of the data size (ranging from millions to billions of records), achieving around 1,000 times faster than the state-of-the-art methods on the largest dataset experimented with. 
    more » « less
  2. Machine and deep learning-based algorithms are the emerging approaches in addressing prediction problems in time series. These techniques have been shown to produce more accurate results than conventional regression-based modeling. It has been reported that artificial Recurrent Neural Networks (RNN) with memory, such as Long Short-Term Memory (LSTM), are superior compared to Autoregressive Integrated Moving Average (ARIMA) with a large margin. The LSTM-based models incorporate additional “gates” for the purpose of memorizing longer sequences of input data. The major question is that whether the gates incorporated in the LSTM architecture already offers a good prediction and whether additional training of data would be necessary to further improve the prediction. Bidirectional LSTMs (BiLSTMs) enable additional training by traversing the input data twice (i.e., 1) left-to-right, and 2) right-to-left). The research question of interest is then whether BiLSTM, with additional training capability, outperforms regular unidirectional LSTM. This paper reports a behavioral analysis and comparison of BiLSTM and LSTM models. The objective is to explore to what extend additional layers of training of data would be beneficial to tune the involved parameters. The results show that additional training of data and thus BiLSTM-based modeling offers better predictions than regular LSTM-based models. More specifically, it was observed that BiLSTM models provide better predictions compared to ARIMA and LSTM models. It was also observed that BiLSTM models reach the equilibrium much slower than LSTM-based models. 
    more » « less
  3. Abstract Machine learning (ML) is emerging as a powerful tool to predict the properties of materials, including glasses. Informing ML models with knowledge of how glass composition affects short-range atomic structure has the potential to enhance the ability of composition-property models to extrapolate accurately outside of their training sets. Here, we introduce an approach wherein statistical mechanics informs a ML model that can predict the non-linear composition-structure relations in oxide glasses. This combined model offers an improved prediction compared to models relying solely on statistical physics or machine learning individually. Specifically, we show that the combined model accurately both interpolates and extrapolates the structure of Na2O–SiO2glasses. Importantly, the model is able to extrapolate predictions outside its training set, which is evidenced by the fact that it is able to predict the structure of a glass series that was kept fully hidden from the model during its training. 
    more » « less
  4. We propose a machine learning (ML) non-Markovian closure modelling framework for accurate predictions of statistical responses of turbulent dynamical systems subjected to external forcings. One of the difficulties in this statistical closure problem is the lack of training data, which is a configuration that is not desirable in supervised learning with neural network models. In this study with the 40-dimensional Lorenz-96 model, the shortage of data is due to the stationarity of the statistics beyond the decorrelation time. Thus, the only informative content in the training data is from the short-time transient statistics. We adopt a unified closure framework on various truncation regimes, including and excluding the detailed dynamical equations for the variances. The closure framework employs a Long-Short-Term-Memory architecture to represent the higher-order unresolved statistical feedbacks with a choice of ansatz that accounts for the intrinsic instability yet produces stable long-time predictions. We found that this unified agnostic ML approach performs well under various truncation scenarios. Numerically, it is shown that the ML closure model can accurately predict the long-time statistical responses subjected to various time-dependent external forces that have larger maximum forcing amplitudes and are not in the training dataset. This article is part of the theme issue ‘Data-driven prediction in dynamical systems’. 
    more » « less
  5. With the increasing impact of climate change and relative sea level rise, low-lying coastal communities face growing risks from recurrent nuisance flooding and storm tides. Thus, timely and reliable predictions of coastal water levels are critical to resilience in vulnerable coastal areas. Over the past decade, there has been increasing interest in utilizing machine learning (ML) based models for emulation and prediction of coastal water levels. However, flood advisory systems still rely on running computationally demanding hydrodynamic models. To alleviate the computational burden, these physics-based models are either run at small scales with high resolution or at large scales with low resolution. While ML-based models are very fast, they face challenges in terms of ensuring reliability and ability to capture any surge levels. In this paper, we develop a deep neural network for spatiotemporal prediction of water levels in coastal areas of the Chesapeake Bay in the U.S. Our model relies on data from numerical weather prediction models as the atmospheric input and astronomical tide levels, while its outputs are time series of predicted water levels at several tide gauge locations across the Chesapeake Bay. We utilized a CNN-LSTM setting as the architecture of the model. The CNN part extracts the features from a sequence of gridded wind fields and fuses its output to several independent LSTM units. The LSTM units concatenate the atmospheric features with respective astronomical tide levels and produce water level time series. The novel contribution of the present work is in spatiotemporality and in prioritization of the physical relationships in the model to maintain a high analogy to hydrodynamic modeling, either in the network architecture or in the selection of predictors and predictands. The results show that this setting yields a strong performance in predicting coastal water levels that cause flooding from minor to major levels. We also show that the model stands up successfully to the rigorous comparison with a high-fidelity ADCIRC model, yielding mean RMSE and correlation coefficient of 14.3 cm and 0.94, respectively, in two extreme cases, versus 12.30 cm and 0.96 for the ADCIRC model. The results highlight the practical feasibility of employing fast yet inexpensive data-driven models for resilient coastal management. 
    more » « less