skip to main content


Title: Significant DBSCAN+: Statistically Robust Density-based Clustering
Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single “hotspot” of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10–20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.  more » « less
Award ID(s):
1916252 2105133 2126474 1901099
NSF-PAR ID:
10326095
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Intelligent Systems and Technology
Volume:
12
Issue:
5
ISSN:
2157-6904
Page Range / eLocation ID:
1 to 26
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Given a collection of geo-distributed points, we aim to detect statistically significant clusters of varying shapes and densities. Spatial clustering has been widely used many important societal applications, including public health and safety, transportation, environment, etc. The problem is challenging because many application domains have low-tolerance to false positives (e.g., falsely claiming a crime cluster in a community can have serious negative impacts on the residents) and clusters often have irregular shapes. In related work, the spatial scan statistic is a popular technique that can detect significant clusters but it requires clusters to have certain predefined shapes (e.g., circles, rings). In contrast, density-based methods (e.g., DBSCAN) can find clusters of arbitrary shape efficiently but do not consider statistical significance, making them susceptible to spurious patterns. To address these limitations, we first propose a modeling of statistical significance in DBSCAN based clustering. Then, we propose a baseline Monte Carlo method to estimate the significance of clusters and a Dual-Convergence algorithm to accelerate the computation. Experiment results show that significant DBSCAN is very effective in removing chance patterns and the Dual-Convergence algorithm can greatly reduce execution time. 
    more » « less
  2. Mapping of spatial hotspots, i.e., regions with significantly higher rates of generating cases of certain events (e.g., disease or crime cases), is an important task in diverse societal domains, including public health, public safety, transportation, agriculture, environmental science, and so on. Clustering techniques required by these domains differ from traditional clustering methods due to the high economic and social costs of spurious results (e.g., false alarms of crime clusters). As a result, statistical rigor is needed explicitly to control the rate of spurious detections. To address this challenge, techniques for statistically-robust clustering (e.g., scan statistics) have been extensively studied by the data mining and statistics communities. In this survey, we present an up-to-date and detailed review of the models and algorithms developed by this field. We first present a general taxonomy for statistically-robust clustering, covering key steps of data and statistical modeling, region enumeration and maximization, and significance testing. We further discuss different paradigms and methods within each of the key steps. Finally, we highlight research gaps and potential future directions, which may serve as a stepping stone in generating new ideas and thoughts in this growing field and beyond. 
    more » « less
  3. Abstract

    Recent advances in hail trajectory modeling regularly produce datasets containing millions of hail trajectories. Because hail growth within a storm cannot be entirely separated from the structure of the trajectories producing it, a method to condense the multidimensionality of the trajectory information into a discrete number of features analyzable by humans is necessary. This article presents a three-dimensional trajectory clustering technique that is designed to group trajectories that have similar updraft-relative structures and orientations. The new technique is an application of a two-dimensional method common in the data mining field. Hail trajectories (or “parent” trajectories) are partitioned into segments before they are clustered using a modified version of the density-based spatial applications with noise (DBSCAN) method. Parent trajectories with segments that are members of at least two common clusters are then grouped into parent trajectory clusters before output. This multistep method has several advantages. Hail trajectories with structural similarities along only portions of their length, e.g., sourced from different locations around the updraft before converging to a common pathway, can still be grouped. However, the physical information inherent in the full length of the trajectory is retained, unlike methods that cluster trajectory segments alone. The conversion of trajectories to an updraft-relative space also allows trajectories separated in time to be clustered. Once the final output trajectory clusters are identified, a method for calculating a representative trajectory for each cluster is proposed. Cluster distributions of hailstone and environmental characteristics at each time step in the representative trajectory can also be calculated.

    Significance Statement

    To understand how a storm produces large hail, we need to understand the paths that hailstones take in a storm when growing. We can simulate these paths using computer models. However, the millions of hailstones in a simulated storm create millions of paths, which is hard to analyze. This article describes a machine learning method that groups together hailstone paths based on how similar their three-dimensional structures look. It will let hail scientists analyze hailstone pathways in storms more easily, and therefore better understand how hail growth happens.

     
    more » « less
  4. Abstract

    In scientific data analysis, clusters identified computationally often substantiate existing hypotheses or motivate new ones. Yet the combinatorial nature of the clustering result, which is a partition rather than a set of parameters or a function, blurs notions of mean, and variance. This intrinsic difficulty hinders the development of methods to improve clustering by aggregation or to assess the uncertainty of clusters generated. We overcome that barrier by aligning clusters via optimal transport. Equipped with this technique, we propose a new algorithm to enhance clustering by any baseline method using bootstrap samples. Cluster alignment enables us to quantify variation in the clustering result at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one‐to‐one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help address the crucial question of whether any cluster is an intrinsic or spurious pattern. Experimental results on both simulated and real data sets are provided. The corresponding R package OTclust is available on CRAN.

     
    more » « less
  5. Synthesis of device-quality GeSn materials with higher Sn compositions is hindered by various factors, such as Sn segregation, clustering, and short-range ordering effects. In the present work, the impact of the clustering of Sn atoms in a GeSn semiconductor alloy was studied by density functional theory using SG15 pseudopotentials in a Synopsys QuantumATK tool, where the thermodynamic stability, effective band structure, indirect and direct bandgaps, and density of states (DOS) were computed to highlight the difference between a cluster-free random GeSn alloy and a GeSn alloy with Sn–Sn clusters. A 54-atom bulk Ge1–xSnx (x = 3.71%–27.77%) supercell was constructed with cluster-free and a first nearest neighbor Sn–Sn clustered GeSn alloy at each composition for this work. Computation using the generalized gradient approximation exchange-correlation functional showed that the thermodynamic stability of GeSn was reduced due to the clustering of Sn, which increased the formation energy of the GeSn alloys by increasing the Hartree potential energy and exchange-correlation energy. Moreover, with the effective band structure of the GeSn material at a Sn composition of ∼22%, both direct (Eg,Γ) and indirect (Eg,L) bandgaps decreased by a large margin of 40.76 and 120.17 meV, respectively, due to Sn–Sn clustering. On the other hand, Eg,Γ and Eg,L decrease is limited to 0.5 and 12.8 meV, respectively, for Sn composition of ∼5.6%. Similar impacts were observed on DOS, in an independent computation without deducing from the electronic band structure, where the width of the forbidden band reduces due to the clustering of Sn atoms in GeSn. Moreover, using the energy bandgaps of GeSn computed with the assumption of it being a random alloy having well-dispersed Sn atoms needs revision by incorporating clustering to align with the experimentally determined bandgap. This necessitates incorporating the effect of Sn atoms clustered together at varying distributions based on experimental characterization techniques such as atom probe tomography or extended x-ray absorption fine structure to substantiate the energy bandgap of the GeSn alloy at a particular composition with precision. Hence, considering the effect of Sn clusters during material characterization, beginning with the accurate energy bandgap characterization of GeSn would help in mitigating the effect of process variations on the performance characteristics of GeSn-based group IV electronic and photonic devices such as varying leakage currents in transistors and photodiodes as well as the deviation from the targeted wavelength of operation in lasers and photodetectors.

     
    more » « less