skip to main content


Title: Statistically-Robust Clustering Techniques for Mapping Spatial Hotspots: A Survey
Mapping of spatial hotspots, i.e., regions with significantly higher rates of generating cases of certain events (e.g., disease or crime cases), is an important task in diverse societal domains, including public health, public safety, transportation, agriculture, environmental science, and so on. Clustering techniques required by these domains differ from traditional clustering methods due to the high economic and social costs of spurious results (e.g., false alarms of crime clusters). As a result, statistical rigor is needed explicitly to control the rate of spurious detections. To address this challenge, techniques for statistically-robust clustering (e.g., scan statistics) have been extensively studied by the data mining and statistics communities. In this survey, we present an up-to-date and detailed review of the models and algorithms developed by this field. We first present a general taxonomy for statistically-robust clustering, covering key steps of data and statistical modeling, region enumeration and maximization, and significance testing. We further discuss different paradigms and methods within each of the key steps. Finally, we highlight research gaps and potential future directions, which may serve as a stepping stone in generating new ideas and thoughts in this growing field and beyond.  more » « less
Award ID(s):
1916252 2105133 2126474 1901099 1737633
NSF-PAR ID:
10326099
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
ACM Computing Surveys
Volume:
55
Issue:
2
ISSN:
0360-0300
Page Range / eLocation ID:
1 to 38
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single “hotspot” of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10–20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data. 
    more » « less
  2. Given a collection of geo-distributed points, we aim to detect statistically significant clusters of varying shapes and densities. Spatial clustering has been widely used many important societal applications, including public health and safety, transportation, environment, etc. The problem is challenging because many application domains have low-tolerance to false positives (e.g., falsely claiming a crime cluster in a community can have serious negative impacts on the residents) and clusters often have irregular shapes. In related work, the spatial scan statistic is a popular technique that can detect significant clusters but it requires clusters to have certain predefined shapes (e.g., circles, rings). In contrast, density-based methods (e.g., DBSCAN) can find clusters of arbitrary shape efficiently but do not consider statistical significance, making them susceptible to spurious patterns. To address these limitations, we first propose a modeling of statistical significance in DBSCAN based clustering. Then, we propose a baseline Monte Carlo method to estimate the significance of clusters and a Dual-Convergence algorithm to accelerate the computation. Experiment results show that significant DBSCAN is very effective in removing chance patterns and the Dual-Convergence algorithm can greatly reduce execution time. 
    more » « less
  3. Residential burglary is a social problem in every major urban area. As such, progress has been to develop quantitative, informative and applicable models for this type of crime: (1) the Deterministic-time-step (DTS) model [Short, D’Orsogna, Pasour, Tita, Brantingham, Bertozzi & Chayes (2008) Math. Models Methods Appl. Sci. 18 , 1249–1267], a pioneering agent-based statistical model of residential burglary criminal behaviour, with deterministic time steps assumed for arrivals of events in which the residential burglary aggregate pattern formation is quantitatively studied for the first time; (2) the SSRB model (agent-based stochastic-statistical model of residential burglary crime) [Wang, Zhang, Bertozzi & Short (2019) Active Particles , Vol. 2 , Springer Nature Switzerland AG, in press], in which the stochastic component of the model is theoretically analysed by introduction of a Poisson clock with time steps turned into exponentially distributed random variables. To incorporate independence of agents, in this work, five types of Poisson clocks are taken into consideration. Poisson clocks (I), (II) and (III) govern independent agent actions of burglary behaviour, and Poisson clocks (IV) and (V) govern interactions of agents with the environment. All the Poisson clocks are independent. The time increments are independently exponentially distributed, which are more suitable to model individual actions of agents. Applying the method of merging and splitting of Poisson processes, the independent Poisson clocks can be treated as one, making the analysis and simulation similar to the SSRB model. A Martingale formula is derived, which consists of a deterministic and a stochastic component. A scaling property of the Martingale formulation with varying burglar population is found, which provides a theory to the finite size effects . The theory is supported by quantitative numerical simulations using the pattern-formation quantifying statistics. Results presented here will be transformative for both elements of application and analysis of agent-based models for residential burglary or in other domains. 
    more » « less
  4. Differential Privacy (DP) formalizes privacy in mathematical terms and provides a robust concept for privacy protection. DIfferentially Private Data Synthesis (DIPS) techniques produce and release synthetic individual-level data in the DP framework. One key challenge to develop DIPS methods is the preservation of the statistical utility of synthetic data, especially in high-dimensional settings. We propose a new DIPS approach, STatistical Election to Partition Sequentially (STEPS) that partitions data by attributes according to their importance ranks according to either a practical or statistical importance measure. STEPS aims to achieve better original information preservation for the attributes with higher importance ranks and produce thus more useful synthetic data overall. We present an algorithm to implement the STEPS procedure and employ the privacy budget composability to ensure the overall privacy cost is controlled at the pre-specified value. We apply the STEPS procedure to both simulated data and the 2000–2012 Current Population Survey youth voter data. The results suggest STEPS can better preserve the population-level information and the original information for some analyses compared to PrivBayes, a modified Uniform histogram approach, and the flat Laplace sanitizer. 
    more » « less
  5. Summary

    We present new models and methods for the posterior drift problem where the regression function in the target domain is modelled as a linear adjustment, on an appropriate scale, of that in the source domain, and study the theoretical properties of our proposed estimators in the binary classification problem. The core idea of our model inherits the simplicity and the usefulness of generalized linear models and accelerated failure time models from the classical statistics literature. Our approach is shown to be flexible and applicable in a variety of statistical settings, and can be adopted for transfer learning problems in various domains including epidemiology, genetics and biomedicine. As concrete applications, we illustrate the power of our approach (i) through mortality prediction for British Asians by borrowing strength from similar data from the larger pool of British Caucasians, using the UK Biobank data, and (ii) in overcoming a spurious correlation present in the source domain of the Waterbirds dataset.

     
    more » « less