skip to main content

Title: DBSpan: Density-Based Clustering Using a Spanner, With an Application to Persistence Diagrams
Since its introduction in the mid-1990s, DBSCAN has become one of the most widely used clustering algorithms. However, one of the steps in DBSCAN is to perform a range query, a task that is difficult in many spaces, including the space of persistence diagrams. In this paper, we introduce a spanner into the DBSCAN algorithm to facilitate range queries in such spaces. We provide a proof-of-concept implementation, and study time and clustering performance for two data sets of persistence diagrams.  more » « less
Award ID(s):
1664858 1854336 2046730
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Applications of Topological Data Analysis to Data Science, Artificial Intelligence, and Machine Learning (TDA at SDM)
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Persistence diagrams have been widely used to quantify the underlying features of filtered topological spaces in data visualization. In many applications, computing distances between diagrams is essential; however, computing these distances has been challenging due to the computational cost. In this paper, we propose a persistence diagram hashing framework that learns a binary code representation of persistence diagrams, which allows for fast computation of distances. This framework is built upon a generative adversarial network (GAN) with a diagram distance loss function to steer the learning process. Instead of using standard representations, we hash diagrams into binary codes, which have natural advantages in large-scale tasks. The training of this model is domain-oblivious in that it can be computed purely from synthetic, randomly created diagrams. As a consequence, our proposed method is directly applicable to various datasets without the need for retraining the model. These binary codes, when compared using fast Hamming distance, better maintain topological similarity properties between datasets than other vectorized representations. To evaluate this method, we apply our framework to the problem of diagram clustering and we compare the quality and performance of our approach to the state-of-the-art. In addition, we show the scalability of our approach on a dataset with 10k persistence diagrams, which is not possible with current techniques. Moreover, our experimental results demonstrate that our method is significantly faster with the potential of less memory usage, while retaining comparable or better quality comparisons. 
    more » « less
  2. Goaoc, Xavier and (Ed.)
    The space of persistence diagrams under bottleneck distance is known to have infinite doubling dimension. Because many metric search algorithms and data structures have bounds that depend on the dimension of the search space, the high-dimensionality makes it difficult to analyze and compare asymptotic running times of metric search algorithms on this space. We introduce the notion of nearly-doubling metrics, those that are Gromov-Hausdorff close to metric spaces of bounded doubling dimension and prove that bounded $k$-point persistence diagrams are nearly-doubling. This allows us to prove that in some ways, persistence diagrams can be expected to behave like a doubling metric space. We prove our results in great generality, studying a large class of quotient metrics (of which the persistence plane is just one example). We also prove bounds on the dimension of the $k$-point bottleneck space over such metrics. The notion of being nearly-doubling in this Gromov-Hausdorff sense is likely of more general interest. Some algorithms that have a dependence on the dimension can be analyzed in terms of the dimension of the nearby metric rather than that of the metric itself. We give a specific example of this phenomenon by analyzing an algorithm to compute metric nets, a useful operation on persistence diagrams. 
    more » « less
  3. Abstract

    This paper presents a new clustering algorithm for space–time data based on the concepts of topological data analysis and, in particular, persistent homology. Employing persistent homology—a flexible mathematical tool from algebraic topology used to extract topological information from data—in unsupervised learning is an uncommon and novel approach. A notable aspect of this methodology consists in analyzing data at multiple resolutions, which allows for distinguishing true features from noise based on the extent of their persistence. We evaluate the performance of our algorithm on synthetic data and compare it to other well‐known clustering algorithms such asK‐means, hierarchical clustering, and DBSCAN (density‐based spatial clustering of applications with noise). We illustrate its application in the context of a case study of water quality in the Chesapeake Bay.

    more » « less
  4. An emerging method for data analysis is called Topological Data Analysis (TDA). TDA is based in the mathematical field of topology and examines the properties of spaces under continuous deformation. One of the key tools used for TDA is called persistent homology which considers the connectivity of points in a d-dimensional point cloud at different spatial resolutions to identify topological properties (holes, loops, and voids) in the space. Persistent homology then classifies the topological features by their persistence through the range of spatial connectivity. Unfortunately the memory and run-time complexity of computing persistent homology is exponential and current tools can only process a few thousand points in R3. Fortunately, the use of data reduction techniques enables persistent homology to be applied to much larger point clouds. Techniques to reduce the data range from random sampling of points to clustering the data and using the cluster centroids as the reduced data. While several data reduction approaches appear to preserve the large topological features present in the original point cloud, no systematic study comparing the efficacy of different data clustering techniques in preserving the persistent homology results has been performed. This paper explores the question of topology preserving data reductions and describes formally when and how topological features can be mischaracterized or lost by data reduction techniques. The paper also performs an experimental assessment of data reduction techniques and resilient effects on the persistent homology. In particular, data reduction by random selection is compared to cluster centroids extracted from different data clustering algorithms. 
    more » « less
  5. null (Ed.)
    SUMMARY Horizontal slowness vector measurements using array techniques have been used to analyse many Earth phenomena from lower mantle heterogeneity to meteorological event location. While providing observations essential for studying much of the Earth, slowness vector analysis is limited by the necessary and subjective visual inspection of observations. Furthermore, it is challenging to determine the uncertainties caused by limitations of array processing such as array geometry, local structure, noise and their effect on slowness vector measurements. To address these issues, we present a method to automatically identify seismic arrivals and measure their slowness vector properties with uncertainty bounds. We do this by bootstrap sampling waveforms, therefore also creating random sub arrays, then use linear beamforming to measure the coherent power at a range of slowness vectors. For each bootstrap sample, we take the top N peaks from each power distribution as the slowness vectors of possible arrivals. The slowness vectors of all bootstrap samples are gathered and the clustering algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is used to identify arrivals as clusters of slowness vectors. The mean of slowness vectors in each cluster gives the slowness vector measurement for that arrival and the distribution of slowness vectors in each cluster gives the uncertainty estimate. We tuned the parameters of DBSCAN using a data set of 2489 SKS and SKKS observations at a range of frequency bands from 0.1 to 1 Hz. We then present examples at higher frequencies (0.5–2.0 Hz) than the tuning data set, identifying PKP precursors, and lower frequency by identifying multipathing in surface waves (0.04–0.06 Hz). While we use a linear beamforming process, this method can be implemented with any beamforming process such as cross correlation beamforming or phase weighted stacking. This method allows for much larger data sets to be analysed without visual inspection of data. Phenomena such as multipathing, reflections or scattering can be identified automatically in body or surface waves and their properties analysed with uncertainties. 
    more » « less