skip to main content

Title: Mining Multivariate Discrete Event Sequences for Knowledge Discovery and Anomaly Detection
Modern physical systems deploy large numbers of sensors to record at different time-stamps the status of different systems components via measurements such as temperature, pressure, speed, but also the component's categorical state. Depending on the measurement values, there are two kinds of sequences: continuous and discrete. For continuous sequences, there is a host of state-of-the-art algorithms for anomaly detection based on time-series analysis, but there is a lack of effective methodologies that are tailored specifically to discrete event sequences. This paper proposes an analytics framework for discrete event sequences for knowledge discovery and anomaly detection. During the training phase, the framework extracts pairwise relationships among discrete event sequences using a neural machine translation model by viewing each discrete event sequence as a "natural language". The relationship between sequences is quantified by how well one discrete event sequence is "translated" into another sequence. These pairwise relationships among sequences are aggregated into a multivariate relationship graph that clusters the structural knowledge of the underlying system and essentially discovers the hidden relationships among discrete sequences. This graph quantifies system behavior during normal operation. During testing, if one or more pairwise relationships are violated, an anomaly is detected. The proposed framework is evaluated on two real-world datasets: a proprietary dataset collected from a physical plant where it is shown to be effective in extracting sensor pairwise relationships for knowledge discovery and anomaly detection, and a public hard disk drive dataset where its ability to effectively predict upcoming disk failures is illustrated.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the 50th IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
Page Range / eLocation ID:
552 to 563
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    Protein–protein interaction (PPI) is vital for life processes, disease treatment, and drug discovery. The computational prediction of PPI is relatively inexpensive and efficient when compared to traditional wet-lab experiments. Given a new protein, one may wish to find whether the protein has any PPI relationship with other existing proteins. Current computational PPI prediction methods usually compare the new protein to existing proteins one by one in a pairwise manner. This is time consuming.


    In this work, we propose a more efficient model, called deep hash learning protein-and-protein interaction (DHL-PPI), to predict all-against-all PPI relationships in a database of proteins. First, DHL-PPI encodes a protein sequence into a binary hash code based on deep features extracted from the protein sequences using deep learning techniques. This encoding scheme enables us to turn the PPI discrimination problem into a much simpler searching problem. The binary hash code for a protein sequence can be regarded as a number. Thus, in the pre-screening stage of DHL-PPI, the string matching problem of comparing a protein sequence against a database withMproteins can be transformed into a much more simpler problem: to find a number inside a sorted array of lengthM. This pre-screening process narrows down the search to a much smaller set of candidate proteins for further confirmation. As a final step, DHL-PPI uses the Hamming distance to verify the final PPI relationship.


    The experimental results confirmed that DHL-PPI is feasible and effective. Using a dataset with strictly negative PPI examples of four species, DHL-PPI is shown to be superior or competitive when compared to the other state-of-the-art methods in terms of precision, recall or F1 score. Furthermore, in the prediction stage, the proposed DHL-PPI reduced the time complexity from$$O(M^2)$$O(M2)to$$O(M\log M)$$O(MlogM)for performing an all-against-all PPI prediction for a database withMproteins. With the proposed approach, a protein database can be preprocessed and stored for later search using the proposed encoding scheme. This can provide a more efficient way to cope with the rapidly increasing volume of protein datasets.

    more » « less
  2. Lateral movement is a key stage of system compromise used by advanced persistent threats. Detecting it is no simple task. When network host logs are abstracted into discrete temporal graphs, the problem can be reframed as anomalous edge detection in an evolving network. Research in modern deep graph learning techniques has produced many creative and complicated models for this task. However, as is the case in many machine learning fields, the generality of models is of paramount importance for accuracy and scalability during training and inference. In this article, we propose a formalized approach to this problem with a framework we call Euler . It consists of a model-agnostic graph neural network stacked upon a model-agnostic sequence encoding layer such as a recurrent neural network. Models built according to the Euler framework can easily distribute their graph convolutional layers across multiple machines for large performance improvements. Additionally, we demonstrate that Euler -based models are as good, or better, than every state-of-the-art approach to anomalous link detection and prediction that we tested. As anomaly-based intrusion detection systems, our models efficiently identified anomalous connections between entities with high precision and outperformed all other unsupervised techniques for anomalous lateral movement detection. Additionally, we show that as a piece of a larger anomaly detection pipeline, Euler models perform well enough for use in real-world systems. With more advanced, yet still lightweight, alerting mechanisms ingesting the embeddings produced by Euler models, precision is boosted from 0.243, to 0.986 on real-world network traffic. 
    more » « less
  3. Abstract

    Deep generative learning cannot only be used for generating new data with statistical characteristics derived from input data but also for anomaly detection, by separating nominal and anomalous instances based on their reconstruction quality. In this paper, we explore the performance of three unsupervised deep generative models—variational autoencoders (VAEs) with Gaussian, Bernoulli, and Boltzmann priors—in detecting anomalies in multivariate time series of commercial-flight operations. We created two VAE models with discrete latent variables (DVAEs), one with a factorized Bernoulli prior and one with a restricted Boltzmann machine (RBM) with novel positive-phase architecture as prior, because of the demand for discrete-variable models in machine-learning applications and because the integration of quantum devices based on two-level quantum systems requires such models. To the best of our knowledge, our work is the first that applies DVAE models to anomaly-detection tasks in the aerospace field. The DVAE with RBM prior, using a relatively simple—and classically or quantum-mechanically enhanceable—sampling technique for the evolution of the RBM’s negative phase, performed better in detecting anomalies than the Bernoulli DVAE and on par with the Gaussian model, which has a continuous latent space. The transfer of a model to an unseen dataset with the same anomaly but without re-tuning of hyperparameters or re-training noticeably impaired anomaly-detection performance, but performance could be improved by post-training on the new dataset. The RBM model was robust to change of anomaly type and phase of flight during which the anomaly occurred. Our studies demonstrate the competitiveness of a discrete deep generative model with its Gaussian counterpart on anomaly-detection problems. Moreover, the DVAE model with RBM prior can be easily integrated with quantum sampling by outsourcing its generative process to measurements of quantum states obtained from a quantum annealer or gate-model device.

    more » « less
  4. null (Ed.)
    In successful enterprise attacks, adversaries often need to gain access to additional machines beyond their initial point of compromise, a set of internal movements known as lateral movement. We present Hopper, a system for detecting lateral movement based on commonly available enterprise logs. Hopper constructs a graph of login activity among internal machines and then identifies suspicious sequences of logins that correspond to lateral movement. To understand the larger context of each login, Hopper employs an inference algorithm to identify the broader path(s) of movement that each login belongs to and the causal user responsible for performing a path's logins. Hopper then leverages this path inference algorithm, in conjunction with a set of detection rules and a new anomaly scoring algorithm, to surface the login paths most likely to reflect lateral movement. On a 15-month enterprise dataset consisting of over 780 million internal logins, Hopper achieves a 94.5% detection rate across over 300 realistic attack scenarios, including one red team attack, while generating an average of < 9 alerts per day. In contrast, to detect the same number of attacks, prior state-of-the-art systems would need to generate nearly 8× as many false positives. 
    more » « less
  5. null (Ed.)
    The marine-based West Antarctic Ice Sheet (WAIS) is currently retreating due to shifting wind-driven oceanic currents that transport warm waters toward the ice margin, resulting in ice shelf thinning and accelerated mass loss of the WAIS. Previous results from geologic drilling on Antarctica’s continental margins show significant variability in marine-based ice sheet extent during the late Neogene and Quaternary. Numerical models indicate a fundamental role for oceanic heat in controlling this variability over at least the past 20 My. Although evidence for past ice sheet variability has been collected in marginal settings, sedimentologic sequences from the outer continental shelf are required to evaluate the extent of past ice sheet variability and the associated oceanic forcings and feedbacks. International Ocean Discovery Program Expedition 374 drilled a latitudinal and depth transect of five drill sites from the outer continental shelf to rise in the eastern Ross Sea to resolve the relationship between climatic and oceanic change and WAIS evolution through the Neogene and Quaternary. This location was selected because numerical ice sheet models indicate that this sector of Antarctica is highly sensitive to changes in ocean heat flux. The expedition was designed for optimal data-model integration and will enable an improved understanding of the sensitivity of Antarctic Ice Sheet (AIS) mass balance during warmer-than-present climates (e.g., the Pleistocene “super interglacials,” the mid-Pliocene, and the late early to middle Miocene). The principal goals of Expedition 374 were to • Evaluate the contribution of West Antarctica to far-field ice volume and sea level estimates; • Reconstruct ice-proximal atmospheric and oceanic temperatures to identify past polar amplification and assess its forcings and feedbacks; • Assess the role of oceanic forcing (e.g., sea level and temperature) on AIS stability/instability; • Identify the sensitivity of the AIS to Earth’s orbital configuration under a variety of climate boundary conditions; and • Reconstruct eastern Ross Sea paleobathymetry to examine relationships between seafloor geometry, ice sheet stability/instability, and global climate. To achieve these objectives, we will • Use data and models to reconcile intervals of maximum Neogene and Quaternary Antarctic ice advance with far-field records of eustatic sea level change; • Reconstruct past changes in oceanic and atmospheric temperatures using a multiproxy approach; • Reconstruct Neogene and Quaternary sea ice margin fluctuations in datable marine continental slope and rise records and correlate these records to existing inner continental shelf records; • Examine relationships among WAIS stability/instability, Earth’s orbital configuration, oceanic temperature and circulation, and atmospheric pCO2; and • Constrain the timing of Ross Sea continental shelf overdeepening and assess its impact on Neogene and Quaternary ice dynamics. Expedition 374 was carried out from January to March 2018, departing from Lyttelton, New Zealand. We recovered 1292.70 m of high-quality cores from five sites spanning the early Miocene to late Quaternary. Three sites were cored on the continental shelf (Sites U1521, U1522, and U1523). At Site U1521, we cored a 650 m thick sequence of interbedded diamictite, mudstone, and diatomite, penetrating the Ross Sea seismic Unconformity RSU4. The depositional reconstructions of past glacial and open-marine conditions at this site will provide unprecedented insight into environmental change on the Antarctic continental shelf during the early and middle Miocene. At Site U1522, we cored a discontinuous upper Miocene to Pleistocene sequence of glacial and glaciomarine strata from the outer shelf, with the primary objective to penetrate and date seismic Unconformity RSU3, which is interpreted to represent the first major continental shelf–wide expansion and coalescing of marine-based ice streams from both East and West Antarctica. At Site U1523, we cored a sediment drift located beneath the westerly flowing Antarctic Slope Current (ASC). Cores from this site will provide a record of the changing vigor of the ASC through time. Such a reconstruction will enable testing of the hypothesis that changes in the vigor of the ASC represent a key control on regulating heat flux onto the continental shelf, resulting in the ASC playing a fundamental role in ice sheet mass balance. We also cored two sites on the continental slope and rise. At Site U1524, we cored a Plio–Pleistocene sedimentary sequence on the continental rise on the levee of the Hillary Canyon, which is one of the largest conduits of Antarctic Bottom Water delivery from the Antarctic continental shelf into the abyssal ocean. Drilling at Site U1524 was intended to penetrate into middle Miocene and older strata but was initially interrupted by drifting sea ice that forced us to abandon coring in Hole U1524A at 399.5 m drilling depth below seafloor (DSF). We moved to a nearby alternate site on the continental slope (U1525) to core a single hole with a record complementary to the upper part of the section recovered at Site U1524. We returned to Site U1524 3 days later, after the sea ice cleared. We then cored Hole U1524C with the rotary core barrel with the intention of reaching the target depth of 1000 m DSF. However, we were forced to terminate Hole U1524C at 441.9 m DSF due to a mechanical failure with the vessel that resulted in termination of all drilling operations and a return to Lyttelton 16 days earlier than scheduled. The loss of 39% of our operational days significantly impacted our ability to achieve all Expedition 374 objectives as originally planned. In particular, we were not able to obtain the deeper time record of the middle Miocene on the continental rise or abyssal sequences that would have provided a continuous and contemporaneous archive to the high-quality (but discontinuous) record from Site U1521 on the continental shelf. The mechanical failure also meant we could not recover sediment cores from proposed Site RSCR-19A, which was targeted to obtain a high-fidelity, continuous record of upper Neogene and Quaternary pelagic/hemipelagic sedimentation. Despite our failure to recover a shelf-to-rise transect for the Miocene, a continental shelf-to-rise transect for the Pliocene to Pleistocene interval is possible through comparison of the high-quality records from Site U1522 with those from Site U1525 and legacy cores from the Antarctic Geological Drilling Project (ANDRILL). 
    more » « less