Given a collection of geo-distributed points, we aim to detect statistically significant clusters of varying shapes and densities. Spatial clustering has been widely used many important societal applications, including public health and safety, transportation, environment, etc. The problem is challenging because many application domains have low-tolerance to false positives (e.g., falsely claiming a crime cluster in a community can have serious negative impacts on the residents) and clusters often have irregular shapes. In related work, the spatial scan statistic is a popular technique that can detect significant clusters but it requires clusters to have certain predefined shapes (e.g., circles, rings). In contrast, density-based methods (e.g., DBSCAN) can find clusters of arbitrary shape efficiently but do not consider statistical significance, making them susceptible to spurious patterns. To address these limitations, we first propose a modeling of statistical significance in DBSCAN based clustering. Then, we propose a baseline Monte Carlo method to estimate the significance of clusters and a Dual-Convergence algorithm to accelerate the computation. Experiment results show that significant DBSCAN is very effective in removing chance patterns and the Dual-Convergence algorithm can greatly reduce execution time.
more »
« less
A Unified Framework for Robust and Efficient Hotspot Detection in Smart Cities
Given N geo-located point instances (e.g., crime or disease cases) in a spatial domain, we aim to detect sub-regions (i.e., hotspots) that have a higher probability density of generating such instances than the others. Hotspot detection has been widely used in a variety of important urban applications, including public safety, public health, urban planning, equity, etc. The problem is challenging because its societal applications often have low-tolerance for false positives, and require significance testing which is computationally intensive. In related work, the spatial scan statistic introduced a likelihood ratio based framework for hotspot evaluation and significance testing. However, it fails to consider the effect of spatial nondeterminism, causing many missing detections. Our previous work introduced a nondeterministic normalization based scan statistic to mitigate this issue. However, its robustness against false positives is not stably controlled. To address these limitations, we propose a unified framework which can improve the completeness of results without incurring more false positives. We also propose a reduction algorithm to improve the computational efficiency. Experiment results confirm that the unified framework can greatly improve the recall of hotspot detection without increasing the number of false positives, and the reduction algorithm can greatly reduce execution time.
more »
« less
- NSF-PAR ID:
- 10170787
- Date Published:
- Journal Name:
- ACM/IMS Transactions on Data Science
- ISSN:
- 2691-1922
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Spatial hotspot discovery aims at discovering regions with statistically significant concentration of activities. It has shown great value in many important societal applications such as transportation engineering, public health, and public safety. This paper formulates the problem of Linear Hotspot Detection on All Simple Paths (LHDA) which identifies hotspots from the complete set of simple paths enumerated from a given spatial network. LHDA overcomes the limitations of existing methods which miss hotspots that naturally occur along linear simple paths on a road network. To address the computational challenges, we propose a novel algorithm named bidirectional fragment-multi-graph traversal (ASP_FMGT) and two path reduction approaches ASP_NR and ASP_HD. Experimental analyses show that ASP_FMGT has substantially improved performance over state-of-the-art approach (ASP_Base) while keeping the solution complete and correct. Moreover, a case study on real-world datasets showed that ASP_FMGT outperforms existing approaches.more » « less
-
Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single “hotspot” of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10–20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.more » « less
-
Summary We propose a new ‘fast subset scan’ approach for accurate and computationally efficient event detection in massive data sets. We treat event detection as a search over subsets of data records, finding the subset which maximizes some score function. We prove that many commonly used functions (e.g. Kulldorff’s spatial scan statistic and extensions) satisfy the ‘linear time subset scanning’ property, enabling exact and efficient optimization over subsets. In the spatial setting, we demonstrate that proximity-constrained subset scans substantially improve the timeliness and accuracy of event detection, detecting emerging outbreaks of disease 2 days faster than existing methods.
-
Brain development in adolescence is synthetically influenced by various factors such as age, education, and socioeconomic conditions. To identify an independent effect from a variable of interest (e.g., socioeconomic conditions), statistical models such as General Linear Model (GLM) are typically adopted to account for covariates (e.g., age and gender). However, statistical models may be vulnerable with insufficient sample size and outliers, and multiple tests for a whole brain analysis lead to inevitable false-positives without sufficient sensitivity. Hence, it is necessary to develop a unified framework for multiple tests that robustly fits the observation and increases sensitivity. We therefore propose a unified flexible neural network that optimizes on the contribution from the main variable of interest as introduced in original GLM, which leads to improved statistical outcomes. The results on group analysis with fractional anisotropy (FA) from Diffusion Tensor Images from Adolescent Brain Cognitive Development (ABCD) study demonstrate that the proposed method provides much more selective and meaningful detection of ROIs related to socioeconomic status over conventional methods.more » « less