skip to main content

This content will become publicly available on June 30, 2023

Title: Calibrated Nonparametric Scan Statistics for Anomalous Pattern Detection in Graphs
We propose a new approach, the calibrated nonparametric scan statistic (CNSS), for more accurate detection of anomalous patterns in large-scale, real-world graphs. Scan statistics identify connected subgraphs that are interesting or unexpected through maximization of a likelihood ratio statistic; in particular, nonparametric scan statistics (NPSSs) identify subgraphs with a higher than expected proportion of individually significant nodes. However, we show that recently proposed NPSS methods are miscalibrated, failing to account for the maximization of the statistic over the multiplicity of subgraphs. This results in both reduced detection power for subtle signals, and low precision of the detected subgraph even for stronger signals. Thus we develop a new statistical approach to recalibrate NPSSs, correctly adjusting for multiple hypothesis testing and taking the underlying graph structure into account. While the recalibration, based on randomization testing, is computationally expensive, we propose both an efficient (approximate) algorithm and new, closed-form lower bounds (on the expected maximum proportion of significant nodes for subgraphs of a given size, under the null hypothesis of no anomalous patterns). These advances, along with the integration of recent core-tree decomposition methods, enable CNSS to scale to large real-world graphs, with substantial improvement in the accuracy of detected subgraphs. Extensive experiments more » on both semi-synthetic and real-world datasets are demonstrated to validate the effectiveness of our proposed methods, in comparison with state-of-the-art counterparts. « less
; ;
Award ID(s):
Publication Date:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Page Range or eLocation-ID:
4201 to 4209
Sponsoring Org:
National Science Foundation
More Like this
  1. We generalize the spatial and subset scan statistics from the single to the multiple subset case. The two main approaches to defining the log-likelihood ratio statistic in the single subset case—the population-based and expectation-based scan statistics—are considered, leading to risk partitioning and multiple cluster detection scan statistics, respectively. We show that, for distributions in a separable exponential family, the risk partitioning scan statistic can be expressed as a scaled f-divergence of the normalized count and baseline vectors, and the multiple cluster detection scan statistic as a sum of scaled Bregman divergences. In either case, however, maximization of the scan statistic by exhaustive search over all partitionings of the data requires exponential time. To make this optimization computationally feasible, we prove sufficient conditions under which the optimal partitioning is guaranteed to be consecutive. This Consecutive Partitions Property generalizes the linear-time subset scanning property from two partitions (the detected subset and the remaining data elements) to the multiple partition case. While the number of consecutive partitionings of n elements into t partitions scales as O(n^(t−1)), making it computationally expensive for large t, we present a dynamic programming approach which identifies the optimal consecutive partitioning in O(n^2 t) time, thus allowing for themore »exact and efficient solution of large-scale risk partitioning and multiple cluster detection problems. Finally, we demonstrate the detection performance and practical utility of partition scan statistics using simulated and real-world data. Supplementary materials for this article are available online.« less
  2. This paper addresses detecting anomalous patterns in images, time-series, and tensor data when the location and scale of the pattern and the pattern itself is unknown a priori. The multiscale scan statistic convolves the proposed pattern with the image at various scales and returns the maximum of the resulting tensor. Scale corrected multiscale scan statistics apply different standardizations at each scale, and the limiting distribution under the null hypothesis---that the data is only noise---is known for smooth patterns. We consider the problem of simultaneously learning and detecting the anomalous pattern from a dictionary of smooth patterns and a database of many tensors. To this end, we show that the multiscale scan statistic is a subexponential random variable, and prove a chaining lemma for standardized suprema, which may be of independent interest. Then by averaging the statistics over the database of tensors we can learn the pattern and obtain Bernstein-type error bounds. We will also provide a construction of an epsilon-net of the location and scale parameters, providing a computationally tractable approximation with similar error bounds.
  3. Many network/graph structures are continuously monitored by various sensors that are placed at a subset of nodes and edges. The multidimensional data collected from these sensors over time create large-scale graph data in which the data points are highly dependent. Monitoring large-scale attributed networks with thousands of nodes and heterogeneous sensor data to detect anomalies and unusual events is a complex and computationally expensive process. This paper introduces a new generic approach inspired by state-space models for network anomaly detection that can utilize the information from the network topology, the node attributes (sensor data), and the anomaly propagation sets in an integrated manner to analyze the entire network all at once. This article presents how heterogeneous network sensor data can be analyzed to locate the sources of anomalies as well as the anomalous regions in a network, which can be impacted by one or multiple anomalies at any time instance. Experimental results demonstrate the superior performance of our proposed framework in detecting anomalies in attributed graphs. Summary of Contribution: With the increasing availability of large-scale network sensors and rapid advances in artificial intelligence methods, fundamentally new analytical tools are needed that can integrate data collected from sensors across the networksmore »for decision making while taking into account the stochastic and topological dependencies between nodes, sensors, and anomalies. This paper develops a framework to intelligently and efficiently analyze complex and highly dependent data collected from disparate sensors across large-scale network/graph structures to detect anomalies and abnormal behavior in real time. Unlike general purpose (often black-box) machine learning models, this paper proposes a unique framework for network/graph structures that incorporates the complexities of networks and interdependencies between network entities and sensors. Because of the multidisciplinary nature of the paper that involves optimization, machine learning, and system monitoring and control, it can help researchers in both operations research and computer science domains to develop new network-specific computing tools and machine learning frameworks to efficiently manage large-scale network data.« less
  4. Summary A nonparametric framework for changepoint detection, based on scan statistics utilizing graphs that represent similarities among observations, is gaining attention owing to its flexibility and good performance for high-dimensional and non-Euclidean data sequences. However, this graph-based framework faces challenges when there are repeated observations in the sequence, which is often the case for discrete data such as network data. In this article we extend the graph-based framework to solve this problem by averaging or taking the union of all possible optimal graphs resulting from repeated observations. We consider both the single-changepoint alternative and the changed-interval alternative, and derive analytical formulas to control the Type I error for the new methods, making them readily applicable to large datasets. The extended methods are illustrated on an application in detecting changes in a sequence of dynamic networks over time. All proposed methods are implemented in an $\texttt{R}$ package $\texttt{gSeg}$ available on CRAN.
  5. Leveraging protein-protein interaction networks to identify groups of proteins and their common functionality is an important problem in bioinformatics. Systems-level analysis of protein-protein interactions is made possible through network science and modeling of high-throughput data. From these analyses, small protein complexes are traditionally represented graphically as complete graphs or dense clusters of nodes. However, there are certain graph theoretic properties that have not been extensively studied in PPI networks, especially as they pertain to cluster discovery, such as planarity. Planarity of graphs have been used to reflect the physical constraints of real-world systems outside of bioinformatics, in areas such as mapping and imaging. Here, we investigate the planarity property in network models of protein complexes. We hypothesize that complexes represented as PPI subgraphs will tend to be planar, reflecting the actual physical interface and limits of components in the complex. When testing the planarity of known complex subgraphs in S. cerevisiae and selected mammalian PPIs, we find that a majority of validated complexes possess this planar property. We discuss the biological motivation of planar versus nonplanar subgraphs, observing that planar subgraphs tend to have longer protein components. Functional classification of planar versus nonplanar complex subgraphs reveals differences in annotation ofmore »these groups relating to cellular component organization, structural molecule activity, catalytic activity, and nucleic acid binding. These results provide a new quantitative and biologically motivated measure of real protein complexes in the network model, important for the development of future complex-finding algorithms in PPIs. Accounting for this property paves the way to new means for discovering new protein complexes and uncovering the functionality of unknown or novel proteins. s« less