Graph scan statistics have become popular for event detection in networks. This methodology involves finding connected subgraphs that maximize a certain anomaly function, but maximizing these functions is computationally hard in general. We develop a novel approach for graph scan statistics with connectivity constraints. Our algorithm Approx-MultilinearScan relies on an algebraic technique called multilinear detection, and it improves over prior methods for large networks. We also develop a Pregel-based parallel version of this algorithm in Giraph, MultilinearScanGiraph, that allows us to solve instances with over 40 million edges, which is more than one order of magnitude larger than existing methods.
more »
« less
Graph scan statistics with uncertainty
Scan statistics is one of the most popular approaches for anomaly detection in spatial and network data. In practice, there are numerous sources of uncertainty in the observed data. However, most prior works have overlooked such uncertainty, which can affect the accuracy and inferences of such meth- ods. In this paper, we develop the first systematic approach to incorporating uncertainty in scan statistics. We study two formulations for robust scan statistics, one based on the sam- ple average approximation and the other using a max-min objective. We show that uncertainty significantly increases the computational complexity of these problems. Rigorous algorithms and efficient heuristics for both formulations are developed with justification of theoretical bounds. We evaluate our proposed methods on synthetic and real datasets, and we observe that our methods give significant improvement in the detection power as well as optimization objective, relative to a baseline.
more »
« less
- Award ID(s):
- 1633028
- PAR ID:
- 10067757
- Date Published:
- Journal Name:
- Proc. Association for the Advancement of Articial Intelligence (AAAI),
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We generalize the spatial and subset scan statistics from the single to the multiple subset case. The two main approaches to defining the log-likelihood ratio statistic in the single subset case—the population-based and expectation-based scan statistics—are considered, leading to risk partitioning and multiple cluster detection scan statistics, respectively. We show that, for distributions in a separable exponential family, the risk partitioning scan statistic can be expressed as a scaled f-divergence of the normalized count and baseline vectors, and the multiple cluster detection scan statistic as a sum of scaled Bregman divergences. In either case, however, maximization of the scan statistic by exhaustive search over all partitionings of the data requires exponential time. To make this optimization computationally feasible, we prove sufficient conditions under which the optimal partitioning is guaranteed to be consecutive. This Consecutive Partitions Property generalizes the linear-time subset scanning property from two partitions (the detected subset and the remaining data elements) to the multiple partition case. While the number of consecutive partitionings of n elements into t partitions scales as O(n^(t−1)), making it computationally expensive for large t, we present a dynamic programming approach which identifies the optimal consecutive partitioning in O(n^2 t) time, thus allowing for the exact and efficient solution of large-scale risk partitioning and multiple cluster detection problems. Finally, we demonstrate the detection performance and practical utility of partition scan statistics using simulated and real-world data. Supplementary materials for this article are available online.more » « less
-
https://youtu.be/79Py8KU4_k0 (Ed.)We consider statistical methods that invoke a min-max distributionally robust formulation to extract good out-of-sample performance in data-driven optimization and learning problems. Acknowledging the distributional uncertainty in learning from limited samples, the min-max formulations introduce an adversarial inner player to explore unseen covariate data. The resulting distributionally robust optimization (DRO) formulations, which include Wasserstein DRO formulations (our main focus), are specified using optimal transportation phenomena. Upon describing how these infinite-dimensional min-max problems can be approached via a finite-dimensional dual reformulation, this tutorial moves into its main component, namely, explaining a generic recipe for optimally selecting the size of the adversary’s budget. This is achieved by studying the limit behavior of an optimal transport projection formulation arising from an inquiry on the smallest confidence region that includes the unknown population risk minimizer. Incidentally, this systematic prescription coincides with those in specific examples in high-dimensional statistics and results in error bounds that are free from the curse of dimensions. Equipped with this prescription, we present a central limit theorem for the DRO estimator and provide a recipe for constructing compatible confidence regions that are useful for uncertainty quantification. The rest of the tutorial is devoted to insights into the nature of the optimizers selected by the min-max formulations and additional applications of optimal transport projections.more » « less
-
We propose a new approach, the calibrated nonparametric scan statistic (CNSS), for more accurate detection of anomalous patterns in large-scale, real-world graphs. Scan statistics identify connected subgraphs that are interesting or unexpected through maximization of a likelihood ratio statistic; in particular, nonparametric scan statistics (NPSSs) identify subgraphs with a higher than expected proportion of individually significant nodes. However, we show that recently proposed NPSS methods are miscalibrated, failing to account for the maximization of the statistic over the multiplicity of subgraphs. This results in both reduced detection power for subtle signals, and low precision of the detected subgraph even for stronger signals. Thus we develop a new statistical approach to recalibrate NPSSs, correctly adjusting for multiple hypothesis testing and taking the underlying graph structure into account. While the recalibration, based on randomization testing, is computationally expensive, we propose both an efficient (approximate) algorithm and new, closed-form lower bounds (on the expected maximum proportion of significant nodes for subgraphs of a given size, under the null hypothesis of no anomalous patterns). These advances, along with the integration of recent core-tree decomposition methods, enable CNSS to scale to large real-world graphs, with substantial improvement in the accuracy of detected subgraphs. Extensive experiments on both semi-synthetic and real-world datasets are demonstrated to validate the effectiveness of our proposed methods, in comparison with state-of-the-art counterparts.more » « less
-
Summary A nonparametric framework for changepoint detection, based on scan statistics utilizing graphs that represent similarities among observations, is gaining attention owing to its flexibility and good performance for high-dimensional and non-Euclidean data sequences. However, this graph-based framework faces challenges when there are repeated observations in the sequence, which is often the case for discrete data such as network data. In this article we extend the graph-based framework to solve this problem by averaging or taking the union of all possible optimal graphs resulting from repeated observations. We consider both the single-changepoint alternative and the changed-interval alternative, and derive analytical formulas to control the Type I error for the new methods, making them readily applicable to large datasets. The extended methods are illustrated on an application in detecting changes in a sequence of dynamic networks over time. All proposed methods are implemented in an $$\texttt{R}$$ package $$\texttt{gSeg}$$ available on CRAN.more » « less
An official website of the United States government

