skip to main content


Title: Calibrated Nonparametric Scan Statistics for Anomalous Pattern Detection in Graphs
We propose a new approach, the calibrated nonparametric scan statistic (CNSS), for more accurate detection of anomalous patterns in large-scale, real-world graphs. Scan statistics identify connected subgraphs that are interesting or unexpected through maximization of a likelihood ratio statistic; in particular, nonparametric scan statistics (NPSSs) identify subgraphs with a higher than expected proportion of individually significant nodes. However, we show that recently proposed NPSS methods are miscalibrated, failing to account for the maximization of the statistic over the multiplicity of subgraphs. This results in both reduced detection power for subtle signals, and low precision of the detected subgraph even for stronger signals. Thus we develop a new statistical approach to recalibrate NPSSs, correctly adjusting for multiple hypothesis testing and taking the underlying graph structure into account. While the recalibration, based on randomization testing, is computationally expensive, we propose both an efficient (approximate) algorithm and new, closed-form lower bounds (on the expected maximum proportion of significant nodes for subgraphs of a given size, under the null hypothesis of no anomalous patterns). These advances, along with the integration of recent core-tree decomposition methods, enable CNSS to scale to large real-world graphs, with substantial improvement in the accuracy of detected subgraphs. Extensive experiments on both semi-synthetic and real-world datasets are demonstrated to validate the effectiveness of our proposed methods, in comparison with state-of-the-art counterparts.  more » « less
Award ID(s):
1954409
NSF-PAR ID:
10355236
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Volume:
36
Issue:
4
ISSN:
2159-5399
Page Range / eLocation ID:
4201 to 4209
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We generalize the spatial and subset scan statistics from the single to the multiple subset case. The two main approaches to defining the log-likelihood ratio statistic in the single subset case—the population-based and expectation-based scan statistics—are considered, leading to risk partitioning and multiple cluster detection scan statistics, respectively. We show that, for distributions in a separable exponential family, the risk partitioning scan statistic can be expressed as a scaled f-divergence of the normalized count and baseline vectors, and the multiple cluster detection scan statistic as a sum of scaled Bregman divergences. In either case, however, maximization of the scan statistic by exhaustive search over all partitionings of the data requires exponential time. To make this optimization computationally feasible, we prove sufficient conditions under which the optimal partitioning is guaranteed to be consecutive. This Consecutive Partitions Property generalizes the linear-time subset scanning property from two partitions (the detected subset and the remaining data elements) to the multiple partition case. While the number of consecutive partitionings of n elements into t partitions scales as O(n^(t−1)), making it computationally expensive for large t, we present a dynamic programming approach which identifies the optimal consecutive partitioning in O(n^2 t) time, thus allowing for the exact and efficient solution of large-scale risk partitioning and multiple cluster detection problems. Finally, we demonstrate the detection performance and practical utility of partition scan statistics using simulated and real-world data. Supplementary materials for this article are available online. 
    more » « less
  2. Leveraging protein-protein interaction networks to identify groups of proteins and their common functionality is an important problem in bioinformatics. Systems-level analysis of protein-protein interactions is made possible through network science and modeling of high-throughput data. From these analyses, small protein complexes are traditionally represented graphically as complete graphs or dense clusters of nodes. However, there are certain graph theoretic properties that have not been extensively studied in PPI networks, especially as they pertain to cluster discovery, such as planarity. Planarity of graphs have been used to reflect the physical constraints of real-world systems outside of bioinformatics, in areas such as mapping and imaging. Here, we investigate the planarity property in network models of protein complexes. We hypothesize that complexes represented as PPI subgraphs will tend to be planar, reflecting the actual physical interface and limits of components in the complex. When testing the planarity of known complex subgraphs in S. cerevisiae and selected mammalian PPIs, we find that a majority of validated complexes possess this planar property. We discuss the biological motivation of planar versus nonplanar subgraphs, observing that planar subgraphs tend to have longer protein components. Functional classification of planar versus nonplanar complex subgraphs reveals differences in annotation of these groups relating to cellular component organization, structural molecule activity, catalytic activity, and nucleic acid binding. These results provide a new quantitative and biologically motivated measure of real protein complexes in the network model, important for the development of future complex-finding algorithms in PPIs. Accounting for this property paves the way to new means for discovering new protein complexes and uncovering the functionality of unknown or novel proteins. s 
    more » « less
  3. Containerized microservices have been widely deployed in industry. Meanwhile, security issues also arise. Many security enhancement mechanisms for containerized microservices require predefined rules and policies. However, it is challenging when it comes to thousands of microservices and a massive amount of real-time unstructured data. Hence, automatic policy generation becomes indispensable. In this paper, we focus on the automatic solution for the security problem: irregular traffic detection for RPCs. We propose Informer, which is a two-phase machine learning framework to track the traffic of each RPC and report anomalous points automatically. Firstly, we identify RPC chain patterns by density-based clustering techniques and build a graph for each critical pattern. Next, we solve the irregular RPC traffic detection problem as a prediction problem for time-series of attributed graphs by leveraging spatial-temporal graph convolution networks. Since the framework builds multiple models and makes individual predictions for each RPC chain pattern, it can be efficiently updated upon legitimate changes in any of the graphs. In evaluations, we applied Informer to a dataset containing more than 7 billion lines of raw RPC logs sampled from an large Kubernetes system for two weeks. We provide two case studies of detected real-world threats. As a result, our framework found fine-grained RPC chain patterns and accurately captured the anomalies in a dynamic and complicated microservice production scenario, which demonstrates the effectiveness of Informer. 
    more » « less
  4. This paper addresses detecting anomalous patterns in images, time-series, and tensor data when the location and scale of the pattern and the pattern itself is unknown a priori. The multiscale scan statistic convolves the proposed pattern with the image at various scales and returns the maximum of the resulting tensor. Scale corrected multiscale scan statistics apply different standardizations at each scale, and the limiting distribution under the null hypothesis---that the data is only noise---is known for smooth patterns. We consider the problem of simultaneously learning and detecting the anomalous pattern from a dictionary of smooth patterns and a database of many tensors. To this end, we show that the multiscale scan statistic is a subexponential random variable, and prove a chaining lemma for standardized suprema, which may be of independent interest. Then by averaging the statistics over the database of tensors we can learn the pattern and obtain Bernstein-type error bounds. We will also provide a construction of an epsilon-net of the location and scale parameters, providing a computationally tractable approximation with similar error bounds. 
    more » « less
  5. Many network/graph structures are continuously monitored by various sensors that are placed at a subset of nodes and edges. The multidimensional data collected from these sensors over time create large-scale graph data in which the data points are highly dependent. Monitoring large-scale attributed networks with thousands of nodes and heterogeneous sensor data to detect anomalies and unusual events is a complex and computationally expensive process. This paper introduces a new generic approach inspired by state-space models for network anomaly detection that can utilize the information from the network topology, the node attributes (sensor data), and the anomaly propagation sets in an integrated manner to analyze the entire network all at once. This article presents how heterogeneous network sensor data can be analyzed to locate the sources of anomalies as well as the anomalous regions in a network, which can be impacted by one or multiple anomalies at any time instance. Experimental results demonstrate the superior performance of our proposed framework in detecting anomalies in attributed graphs. Summary of Contribution: With the increasing availability of large-scale network sensors and rapid advances in artificial intelligence methods, fundamentally new analytical tools are needed that can integrate data collected from sensors across the networks for decision making while taking into account the stochastic and topological dependencies between nodes, sensors, and anomalies. This paper develops a framework to intelligently and efficiently analyze complex and highly dependent data collected from disparate sensors across large-scale network/graph structures to detect anomalies and abnormal behavior in real time. Unlike general purpose (often black-box) machine learning models, this paper proposes a unique framework for network/graph structures that incorporates the complexities of networks and interdependencies between network entities and sensors. Because of the multidisciplinary nature of the paper that involves optimization, machine learning, and system monitoring and control, it can help researchers in both operations research and computer science domains to develop new network-specific computing tools and machine learning frameworks to efficiently manage large-scale network data. 
    more » « less