skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: MERLIN++: parameter-free discovery of time series anomalies
The burgeoning age of IoT has reinforced the need for robust time series anomaly detection. While there are hundreds of anomaly detection methods in the literature, one definition, time series discords, has emerged as a competitive and popular choice for practitioners. Time series discords are subsequences of a time series that are maximally far away from their nearest neighbors. Perhaps the most attractive feature of discords is their simplicity. Unlike many of the parameter-laden methods proposed, discords require only a single parameter to be set by the user: the subsequence length. We believe that the utility of discords is reduced by sensitivity to even this single user choice. The obvious solution to this problem, computing discords of all lengths then selecting the best anomalies (under some measure), appears at first glance to be computationally untenable. However, in this work we discuss MERLIN, a recently introduced algorithm that can efficiently and exactly find discords of all lengths in massive time series archives. By exploiting computational redundancies, MERLIN is two orders of magnitude faster than comparable algorithms. Moreover, we show that by exploiting a little-known indexing technique called Orchard’s algorithm, we can create a new algorithm called MERLIN++, which is an order of magnitude faster than MERLIN, yet produces identical results. We demonstrate the utility of our ideas on a large and diverse set of experiments and show that MERLIN++ can discover subtle anomalies that defy existing algorithms or even careful human inspection. We further compare to five state-of-the-art rival methods, on the largest benchmark dataset for this task, and show that MERLIN++ is superior in terms of accuracy and speed.  more » « less
Award ID(s):
2103976
PAR ID:
10548866
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Data Mining and Knowledge Discovery
Date Published:
Journal Name:
Data Mining and Knowledge Discovery
Volume:
37
Issue:
2
ISSN:
1384-5810
Page Range / eLocation ID:
670 to 709
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Time series anomaly detection remains a perennially important research topic. If anything, it is a task that has become increasingly important in the burgeoning age of IoT. While there are hundreds of anomaly detection methods in the literature, one definition, time series discords, has emerged as a competitive and popular choice for practitioners. Time series discords are subsequences of a time series that are maximally far away from their nearest neighbors. Perhaps the most attractive feature of discords is their simplicity. Unlike many parameter laden methods, discords require only a single parameter to be set by the user: the subsequence length. In this work we argue that the utility of discords is reduced by sensitivity to this single user choice. The obvious solution to this problem, computing discords of all lengths then selecting the best anomalies (under some measure), seems to be computationally untenable. However, in this work we introduce MERLIN, an algorithm that can efficiently and exactly find discords of all lengths in massive time series archives. 
    more » « less
  2. We present an innovative approach to auto-annotate Expert Defined Linguistic Features (EDLFs) as subsequences in audio time series to improve audio deepfake discernment. In our prior work, these linguistic features – namely pitch, pause, breath, consonant release bursts, and overall audio quality, labeled by experts on the entire audio signal – have been shown to improve detection of audio deepfakes with AI algorithms. We now expand our approach to pilot a way to auto annotate subsequences in the time series that correspond to each EDLF. We developed an ensemble of discords, i.e. anomalies in time series, detected using matrix profiles across multiple discord lengths to identify multiple types of EDLFs. Working closely with linguistic experts, we evaluated where discords overlapped with EDLFs in the audio signal data. Our ensemble method to detect discords across multiple discord lengths achieves much higher accuracy than using individual discord lengths to detect EDLFs. With this approach and domain validation we establish the feasibility of using time series subsequences to capture EDLFs to supplement annotation by domain experts, for improved audio deepfake detection. 
    more » « less
  3. Editor in Chief: Johannes Fürnkranz (Ed.)
    Time series data remains a perennially important datatype considered in data mining. In the last decade there has been an increasing realization that time series data can be best understood by reasoning about time series subsequences on the basis of their similarity to other subsequences: the two most familiar such time series concepts being motifs and discords. Time series motifs refer to two particularly close subsequences, whereas time series discords indicate subsequences that are far from their nearest neighbors. However, we argue that it can sometimes be useful to simultaneously reason about a subsequence’s closeness to certain data and its distance to other data. In this work we introduce a novel primitive called the Contrast Profile that allows us to efficiently compute such a definition in a principled way. As we will show, the Contrast Profile has many downstream uses, including anomaly detection, data exploration, and preprocessing unstructured data for classification. We demonstrate the utility of the Contrast Profile by showing how it allows end-to-end classification in datasets with tens of billions of datapoints, and how it can be used to explore datasets and reveal subtle patterns that might otherwise escape our attention. Moreover, we demonstrate the generality of the Contrast Profile by presenting detailed case studies in domains as diverse as seismology, animal behavior, and cardiology. 
    more » « less
  4. In this paper, we address the problem of detecting and learning anomalies in high-dimensional data-streams in real-time. Following a data-driven approach, we propose an online and multivariate anomaly detection method that is suitable for the timely and accurate detection of anomalies. We propose our method for both semi-supervised and supervised settings. By combining the semi-supervised and supervised algorithms, we present a self-supervised online learning algorithm in which the semi-supervised algorithm trains the supervised algorithm to improve its detection performance over time. The methods are comprehensively analyzed in terms of computational complexity, asymptotic optimality, and false alarm rate. The performances of the proposed algorithms are also evaluated using real-world cybersecurity datasets, that show a significant improvement over the state-of-the-art results. 
    more » « less
  5. A fuzzy clustering algorithm with the ability to learn unsupervised can be used to detect objects of interest in semi-structured data. An online application of a fuzzy clustering algorithm with merging was implemented in both software and hardware to test anomaly detection in atomic force microscopy (AFM). The requisite components of the algorithm were all estimated, measured, and verified to meet real time constraints for incoming data. After clusters have been formed, representing the background of an image, any new cluster is an abnormality to the surface which is of interest to the user. This real-time detection of anomalies is important for identifying regions of interest for faster and higher resolution scanning. Results show this application is successfully capable of detecting anomalies in AFM topographic images. The approach taken in this paper is generic and can be applied to other applications with a continuous data stream. 
    more » « less