skip to main content


Title: Significant DBSCAN+: Statistically Robust Density-based Clustering
Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single “hotspot” of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10–20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.  more » « less
Award ID(s):
1916252 2105133 2126474 1901099
NSF-PAR ID:
10326095
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Intelligent Systems and Technology
Volume:
12
Issue:
5
ISSN:
2157-6904
Page Range / eLocation ID:
1 to 26
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Given a collection of geo-distributed points, we aim to detect statistically significant clusters of varying shapes and densities. Spatial clustering has been widely used many important societal applications, including public health and safety, transportation, environment, etc. The problem is challenging because many application domains have low-tolerance to false positives (e.g., falsely claiming a crime cluster in a community can have serious negative impacts on the residents) and clusters often have irregular shapes. In related work, the spatial scan statistic is a popular technique that can detect significant clusters but it requires clusters to have certain predefined shapes (e.g., circles, rings). In contrast, density-based methods (e.g., DBSCAN) can find clusters of arbitrary shape efficiently but do not consider statistical significance, making them susceptible to spurious patterns. To address these limitations, we first propose a modeling of statistical significance in DBSCAN based clustering. Then, we propose a baseline Monte Carlo method to estimate the significance of clusters and a Dual-Convergence algorithm to accelerate the computation. Experiment results show that significant DBSCAN is very effective in removing chance patterns and the Dual-Convergence algorithm can greatly reduce execution time. 
    more » « less
  2. Mapping of spatial hotspots, i.e., regions with significantly higher rates of generating cases of certain events (e.g., disease or crime cases), is an important task in diverse societal domains, including public health, public safety, transportation, agriculture, environmental science, and so on. Clustering techniques required by these domains differ from traditional clustering methods due to the high economic and social costs of spurious results (e.g., false alarms of crime clusters). As a result, statistical rigor is needed explicitly to control the rate of spurious detections. To address this challenge, techniques for statistically-robust clustering (e.g., scan statistics) have been extensively studied by the data mining and statistics communities. In this survey, we present an up-to-date and detailed review of the models and algorithms developed by this field. We first present a general taxonomy for statistically-robust clustering, covering key steps of data and statistical modeling, region enumeration and maximization, and significance testing. We further discuss different paradigms and methods within each of the key steps. Finally, we highlight research gaps and potential future directions, which may serve as a stepping stone in generating new ideas and thoughts in this growing field and beyond. 
    more » « less
  3. Abstract

    Recent advances in hail trajectory modeling regularly produce datasets containing millions of hail trajectories. Because hail growth within a storm cannot be entirely separated from the structure of the trajectories producing it, a method to condense the multidimensionality of the trajectory information into a discrete number of features analyzable by humans is necessary. This article presents a three-dimensional trajectory clustering technique that is designed to group trajectories that have similar updraft-relative structures and orientations. The new technique is an application of a two-dimensional method common in the data mining field. Hail trajectories (or “parent” trajectories) are partitioned into segments before they are clustered using a modified version of the density-based spatial applications with noise (DBSCAN) method. Parent trajectories with segments that are members of at least two common clusters are then grouped into parent trajectory clusters before output. This multistep method has several advantages. Hail trajectories with structural similarities along only portions of their length, e.g., sourced from different locations around the updraft before converging to a common pathway, can still be grouped. However, the physical information inherent in the full length of the trajectory is retained, unlike methods that cluster trajectory segments alone. The conversion of trajectories to an updraft-relative space also allows trajectories separated in time to be clustered. Once the final output trajectory clusters are identified, a method for calculating a representative trajectory for each cluster is proposed. Cluster distributions of hailstone and environmental characteristics at each time step in the representative trajectory can also be calculated.

    Significance Statement

    To understand how a storm produces large hail, we need to understand the paths that hailstones take in a storm when growing. We can simulate these paths using computer models. However, the millions of hailstones in a simulated storm create millions of paths, which is hard to analyze. This article describes a machine learning method that groups together hailstone paths based on how similar their three-dimensional structures look. It will let hail scientists analyze hailstone pathways in storms more easily, and therefore better understand how hail growth happens.

     
    more » « less
  4. Traffic shockwaves demonstrate the formation and spreading of traffic fluctuation on roads. Existing methods mainly detect the shockwaves and their propagation by estimating traffic density and flow, which presents weaknesses in applications when traffic data is only partially or locally collected. This paper proposed a four-step data-driven approach that integrates machine learning with the traffic features to detect shockwaves and estimate their propagation speeds only using partial vehicle trajectory data. Specifically, we first denoise the speed data derived from trajectory data by the Fast Fourier Transform (FFT) to mitigate the effect of spontaneous random speed fluctuation. Next, we identify trajectory curves’ turning points where a vehicle runs into a shockwave and its speed presents a high standard deviation within a short interval. Furthermore, the Density-based Spatial Clustering of Applications with Noise algorithm (DBSCAN) combined with traffic flow features is adopted to split the turning points into different clusters, each corresponding to a shockwave with constant speed. Last, the one-norm distance regression method is used to estimate the propagation speed of detected shockwaves. The proposed framework was applied to the field data collected from the I-80 and US-101 freeway by the Next Generation Simulation (NGSIM) program. The results show that this four-step data-driven method could efficiently detect the shockwaves and their propagation speeds without estimating the traffic densities and flows nearby. It performs well for both homogenous and nonhomogeneous road segments with trajectory data collected from total or partial traffic flow. 
    more » « less
  5. null (Ed.)
    SUMMARY Horizontal slowness vector measurements using array techniques have been used to analyse many Earth phenomena from lower mantle heterogeneity to meteorological event location. While providing observations essential for studying much of the Earth, slowness vector analysis is limited by the necessary and subjective visual inspection of observations. Furthermore, it is challenging to determine the uncertainties caused by limitations of array processing such as array geometry, local structure, noise and their effect on slowness vector measurements. To address these issues, we present a method to automatically identify seismic arrivals and measure their slowness vector properties with uncertainty bounds. We do this by bootstrap sampling waveforms, therefore also creating random sub arrays, then use linear beamforming to measure the coherent power at a range of slowness vectors. For each bootstrap sample, we take the top N peaks from each power distribution as the slowness vectors of possible arrivals. The slowness vectors of all bootstrap samples are gathered and the clustering algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is used to identify arrivals as clusters of slowness vectors. The mean of slowness vectors in each cluster gives the slowness vector measurement for that arrival and the distribution of slowness vectors in each cluster gives the uncertainty estimate. We tuned the parameters of DBSCAN using a data set of 2489 SKS and SKKS observations at a range of frequency bands from 0.1 to 1 Hz. We then present examples at higher frequencies (0.5–2.0 Hz) than the tuning data set, identifying PKP precursors, and lower frequency by identifying multipathing in surface waves (0.04–0.06 Hz). While we use a linear beamforming process, this method can be implemented with any beamforming process such as cross correlation beamforming or phase weighted stacking. This method allows for much larger data sets to be analysed without visual inspection of data. Phenomena such as multipathing, reflections or scattering can be identified automatically in body or surface waves and their properties analysed with uncertainties. 
    more » « less