skip to main content


Title: Fast Subset Scan for Spatial Pattern Detection
Summary

We propose a new ‘fast subset scan’ approach for accurate and computationally efficient event detection in massive data sets. We treat event detection as a search over subsets of data records, finding the subset which maximizes some score function. We prove that many commonly used functions (e.g. Kulldorff’s spatial scan statistic and extensions) satisfy the ‘linear time subset scanning’ property, enabling exact and efficient optimization over subsets. In the spatial setting, we demonstrate that proximity-constrained subset scans substantially improve the timeliness and accuracy of event detection, detecting emerging outbreaks of disease 2 days faster than existing methods.

 
more » « less
NSF-PAR ID:
10401321
Author(s) / Creator(s):
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
74
Issue:
2
ISSN:
1369-7412
Page Range / eLocation ID:
p. 337-360
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We generalize the spatial and subset scan statistics from the single to the multiple subset case. The two main approaches to defining the log-likelihood ratio statistic in the single subset case—the population-based and expectation-based scan statistics—are considered, leading to risk partitioning and multiple cluster detection scan statistics, respectively. We show that, for distributions in a separable exponential family, the risk partitioning scan statistic can be expressed as a scaled f-divergence of the normalized count and baseline vectors, and the multiple cluster detection scan statistic as a sum of scaled Bregman divergences. In either case, however, maximization of the scan statistic by exhaustive search over all partitionings of the data requires exponential time. To make this optimization computationally feasible, we prove sufficient conditions under which the optimal partitioning is guaranteed to be consecutive. This Consecutive Partitions Property generalizes the linear-time subset scanning property from two partitions (the detected subset and the remaining data elements) to the multiple partition case. While the number of consecutive partitionings of n elements into t partitions scales as O(n^(t−1)), making it computationally expensive for large t, we present a dynamic programming approach which identifies the optimal consecutive partitioning in O(n^2 t) time, thus allowing for the exact and efficient solution of large-scale risk partitioning and multiple cluster detection problems. Finally, we demonstrate the detection performance and practical utility of partition scan statistics using simulated and real-world data. Supplementary materials for this article are available online. 
    more » « less
  2. Abstract

    Modern scanning microscopes can image materials with up to sub-atomic spatial and sub-picosecond time resolutions, but these capabilities come with large volumes of data, which can be difficult to store and analyze. We report the Fast Autonomous Scanning Toolkit (FAST) that addresses this challenge by combining a neural network, route optimization, and efficient hardware controls to enable a self-driving experiment that actively identifies and measures a sparse but representative data subset in lieu of the full dataset. FAST requires no prior information about the sample, is computationally efficient, and uses generic hardware controls with minimal experiment-specific wrapping. We test FAST in simulations and a dark-field X-ray microscopy experiment of a WSe2film. Our studies show that a FAST scan of <25% is sufficient to accurately image and analyze the sample. FAST is easy to adapt for any scanning microscope; its broad adoption will empower general multi-level studies of materials evolution with respect to time, temperature, or other parameters.

     
    more » « less
  3. Abstract

    The exodus of flying animals from their roosting locations is often visible as expanding ring‐shaped patterns in weather radar data. The NEXRAD network, for example, archives more than 25 years of data across 143 contiguous US radar stations, providing opportunities to study roosting locations and times and the ecosystems of birds and bats. However, access to this information is limited by the cost of manually annotating millions of radar scans. We develop and deploy an AI‐assisted system to annotate roosts in radar data. We build datasets with roost annotations to support the training and evaluation of automated detection models. Roosts are detected, tracked, and incorporated into our developed web‐based interface for human screening to produce research‐grade annotations. We deploy the system to collect swallow and martin roost information from 12 radar stations around the Great Lakes spanning 21 years. After verifying the practical value of the system, we propose to improve the detector by incorporating both spatial and temporal channels from volumetric radar scans. The deployment on Great Lakes radar scans allows accelerated annotation of 15 628 roost signatures in 612 786 radar scans with 183.6 human screening hours, or 1.08 s per radar scan. We estimate that the deployed system reduces human annotation time by ~7×. The temporal detector model improves the average precision at intersection‐over‐union threshold 0.5 (APIoU = .50) by 8% over the previous model (48%→56%), further reducing human screening time by 2.3× in its pilot deployment. These data contain critical information about phenology and population trends of swallows and martins, aerial insectivore species experiencing acute declines, and have enabled novel research. We present error analyses, lay the groundwork for continent‐scale historical investigation about these species, and provide a starting point for automating the detection of other family‐specific phenomena in radar data, such as bat roosts and mayfly hatches.

     
    more » « less
  4. null (Ed.)
    We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements [n]={1,...,n} with corresponding data values x1,...,xn. We observe the values for a "sample" set A \subset [n] and wish to estimate some statistic of the values for a "target" set B \subset [n] where B could be the entire set. Crucially, we assume that the sets A and B are drawn according to some known distribution P over pairs of subsets of [n]. A given estimation algorithm is evaluated based on its "worst-case, expected error" where the expectation is with respect to the distribution P from which the sample A and target sets B are drawn, and the worst-case is with respect to the data values x1,...,xn. Within this framework, we give an efficient algorithm for estimating the target mean that returns a weighted combination of the sample values–-where the weights are functions of the distribution P and the sample and target sets A, B--and show that the worst-case expected error achieved by this algorithm is at most a multiplicative pi/2 factor worse than the optimal of such algorithms. The algorithm and proof leverage a surprising connection to the Grothendieck problem. We also extend these results to the linear regression setting where each datapoint is not a scalar but a labeled vector (xi,yi). This framework, which makes no distributional assumptions on the data values but rather relies on knowledge of the data collection process via the distribution P, is a significant departure from the typical statistical estimation framework and introduces a uniform analysis for the many natural settings where membership in a sample may be correlated with data values, such as when individuals are recruited into a sample through their social networks as in "snowball/chain" sampling or when samples have chronological structure as in "selective prediction". 
    more » « less
  5. Abstract

    GIS analyses use moving window methods and hotspot detection to identify point patterns within a given area. Such methods can detect clusters of point events such as crime or disease incidences. Yet, these methods do not account forconnectionsbetween entities, and thus, areas with relatively sparse event concentrations but high network connectivity may go undetected. We develop two scan methods (i.e., moving window or focal processes), EdgeScan and NDScan, for detecting local spatial‐social connections. These methods capture edges and network density, respectively, for each node in a given focal area. We apply methods to a social network of Mafia members in New York City in the 1960s and to a 2019 spatial network of home‐to‐restaurant visits in Atlanta, Georgia. These methods successfully capture focal areas where Mafia members are highly connected and where restaurant visitors are highly local; these results differ from those derived using traditional spatial hotspot analysis using the Getis–Ord Gi* statistic. Finally, we describe how these methods can be adapted to weighted, directed, and bipartite networks and suggest future improvements.

     
    more » « less