Fast graph scan statistics optimization using algebraic fingerprints
Graph scan statistics have become popular for event detection in networks. This methodology involves finding connected subgraphs that maximize a certain anomaly function, but maximizing these functions is computationally hard in general. We develop a novel approach for graph scan statistics with connectivity constraints. Our algorithm Approx-MultilinearScan relies on an algebraic technique called multilinear detection, and it improves over prior methods for large networks. We also develop a Pregel-based parallel version of this algorithm in Giraph, MultilinearScanGiraph, that allows us to solve instances with over 40 million edges, which is more than one order of magnitude larger than existing methods.
Authors:
; ;
Award ID(s):
Publication Date:
NSF-PAR ID:
10067734
Journal Name:
2017 IEEE International Conference on Big Data (Big Data)
Page Range or eLocation-ID:
905 to 910
2. Summary A nonparametric framework for changepoint detection, based on scan statistics utilizing graphs that represent similarities among observations, is gaining attention owing to its flexibility and good performance for high-dimensional and non-Euclidean data sequences. However, this graph-based framework faces challenges when there are repeated observations in the sequence, which is often the case for discrete data such as network data. In this article we extend the graph-based framework to solve this problem by averaging or taking the union of all possible optimal graphs resulting from repeated observations. We consider both the single-changepoint alternative and the changed-interval alternative, and derive analytical formulas to control the Type I error for the new methods, making them readily applicable to large datasets. The extended methods are illustrated on an application in detecting changes in a sequence of dynamic networks over time. All proposed methods are implemented in an $\texttt{R}$ package $\texttt{gSeg}$ available on CRAN.