skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: Introducing Mplots: scaling time series recurrence plots to massive datasets
Abstract

Time series similarity matrices (informally, recurrence plots or dot-plots), are useful tools for time series data mining. They can be used to guide data exploration, and various useful features can be derived from them and then fed into downstream analytics. However, time series similarity matrices suffer from very poor scalability, taxing both time and memory requirements. In this work, we introduce novel ideas that allow us to scale the largest time series similarity matrices that can be examined by several orders of magnitude. The first idea is a novel algorithm to compute the matrices in a way that removes dependency on the subsequence length. This algorithm is so fast that it allows us to now address datasets where the memory limitations begin to dominate. Our second novel contribution is a multiscale algorithm that computes an approximation of the matrix appropriate for the limitations of the user’s memory/screen-resolution, then performs a local, just-in-time recomputation of any region that the user wishes to zoom-in on. Given that this largely removes time and space barriers, human visual attention then becomes the bottleneck. We further introduce algorithms that search massive matrices with quadrillions of cells and then prioritize regions for later examination by either humans or algorithms. We will demonstrate the utility of our ideas for data exploration, segmentation, and classification in domains as diverse as astronomy, bioinformatics, entomology, and wildlife monitoring.

 
more » « less
NSF-PAR ID:
10525166
Author(s) / Creator(s):
; ; ; ; ; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Journal of Big Data
Volume:
11
Issue:
1
ISSN:
2196-1115
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Editor in Chief: Johannes Fürnkranz (Ed.)
    Time series data remains a perennially important datatype considered in data mining. In the last decade there has been an increasing realization that time series data can be best understood by reasoning about time series subsequences on the basis of their similarity to other subsequences: the two most familiar such time series concepts being motifs and discords. Time series motifs refer to two particularly close subsequences, whereas time series discords indicate subsequences that are far from their nearest neighbors. However, we argue that it can sometimes be useful to simultaneously reason about a subsequence’s closeness to certain data and its distance to other data. In this work we introduce a novel primitive called the Contrast Profile that allows us to efficiently compute such a definition in a principled way. As we will show, the Contrast Profile has many downstream uses, including anomaly detection, data exploration, and preprocessing unstructured data for classification. We demonstrate the utility of the Contrast Profile by showing how it allows end-to-end classification in datasets with tens of billions of datapoints, and how it can be used to explore datasets and reveal subtle patterns that might otherwise escape our attention. Moreover, we demonstrate the generality of the Contrast Profile by presenting detailed case studies in domains as diverse as seismology, animal behavior, and cardiology. 
    more » « less
  2. We study algorithms for approximating pairwise similarity matrices that arise in natural language processing. Generally, computing a similarity matrix for $n$ data points requires $\Omega(n^2)$ similarity computations. This quadratic scaling is a significant bottleneck, especially when similarities are computed via expensive functions, e.g., via transformer models. Approximation methods reduce this quadratic complexity, often by using a small subset of exactly computed similarities to approximate the remainder of the complete pairwise similarity matrix. Significant work focuses on the efficient approximation of positive semidefinite (PSD) similarity matrices, which arise e.g., in kernel methods. However, much less is understood about indefinite (non-PSD) similarity matrices, which often arise in NLP. Motivated by the observation that many of these matrices are still somewhat close to PSD, we introduce a generalization of the popular \emph{Nystrom method} to the indefinite setting. Our algorithm can be applied to any similarity matrix and runs in sublinear time in the size of the matrix, producing a rank-$s$ approximation with just $O(ns)$ similarity computations. We show that our method, along with a simple variant of CUR decomposition, performs very well in approximating a variety of similarity matrices arising in NLP tasks. We demonstrate high accuracy of the approximated similarity matrices in tasks of document classification, sentence similarity, and cross-document coreference. 
    more » « less
  3. Given a long time series, the distance profile of a query time series computes distances between the query and every possible subsequence of a long time series. MASS (Mueen’s Algorithm for Similarity Search) is an algorithm to efficiently compute distance profile under z-normalized Euclidean distance (Mueen et al. in The fastest similarity search algorithm for time series subsequences under Euclidean distance. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html, 2017). MASS is recognized as a useful tool in many data mining works. However, complete documentation of the increasingly efficient versions of the algorithm does not exist. In this paper, we formalize the notion of a distance profile, describe four versions of the MASS algorithm, show several extensions of distance profiles under various operating conditions, describe how MASS improves performances of existing data mining algorithms, and finally, show utility of MASS in domains including seismology, robotics and power grids. 
    more » « less
  4. Abstract

    Change‐point detection studies the problem of detecting the changes in the underlying distribution of the data stream as soon as possible after the change happens. Modern large‐scale, high‐dimensional, and complex streaming data call for computationally (memory) efficient sequential change‐point detection algorithms that are also statistically powerful. This gives rise to a computation versus statistical power trade‐off, an aspect less emphasized in the past in classic literature. This tutorial takes this new perspective and reviews several sequential change‐point detection procedures, ranging from classic sequential change‐point detection algorithms to more recent non‐parametric procedures that consider computation, memory efficiency, and model robustness in the algorithm design. Our survey also contains classic performance analysis, which provides useful techniques for analyzing new procedures.

    This article is categorized under:

    Statistical Models > Time Series Models

    Algorithms and Computational Methods > Algorithms

    Data: Types and Structure > Time Series, Stochastic Processes, and Functional Data

     
    more » « less
  5. Many applications generate and/or consume multi-variate temporal data, and experts often lack the means to adequately and systematically search for and interpret multi-variate observations. In this article, we first observe that multi-variate time series often carry localized multi-variate temporal features that are robust against noise. We then argue that these multi-variate temporal features can be extracted by simultaneously considering, at multiple scales, temporal characteristics of the time seriesalong with external knowledge, including variate relationships that are known a priori. Relying on these observations, we develop data models and algorithms to detectrobust multi-variate temporal(RMT) features that can be indexed for efficient and accurate retrieval and can be used for supporting data exploration and analysis tasks. Experiments confirm that the proposed RMT algorithm is highly effective and efficient in identifyingrobustmulti-scale temporal features of multi-variate time series.

     
    more » « less