skip to main content


Title: DBSpan: Density-Based Clustering Using a Spanner, With an Application to Persistence Diagrams
Since its introduction in the mid-1990s, DBSCAN has become one of the most widely used clustering algorithms. However, one of the steps in DBSCAN is to perform a range query, a task that is difficult in many spaces, including the space of persistence diagrams. In this paper, we introduce a spanner into the DBSCAN algorithm to facilitate range queries in such spaces. We provide a proof-of-concept implementation, and study time and clustering performance for two data sets of persistence diagrams.  more » « less
Award ID(s):
1664858 1854336 2046730
NSF-PAR ID:
10351417
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Applications of Topological Data Analysis to Data Science, Artificial Intelligence, and Machine Learning (TDA at SDM)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Persistence diagrams have been widely used to quantify the underlying features of filtered topological spaces in data visualization. In many applications, computing distances between diagrams is essential; however, computing these distances has been challenging due to the computational cost. In this paper, we propose a persistence diagram hashing framework that learns a binary code representation of persistence diagrams, which allows for fast computation of distances. This framework is built upon a generative adversarial network (GAN) with a diagram distance loss function to steer the learning process. Instead of using standard representations, we hash diagrams into binary codes, which have natural advantages in large-scale tasks. The training of this model is domain-oblivious in that it can be computed purely from synthetic, randomly created diagrams. As a consequence, our proposed method is directly applicable to various datasets without the need for retraining the model. These binary codes, when compared using fast Hamming distance, better maintain topological similarity properties between datasets than other vectorized representations. To evaluate this method, we apply our framework to the problem of diagram clustering and we compare the quality and performance of our approach to the state-of-the-art. In addition, we show the scalability of our approach on a dataset with 10k persistence diagrams, which is not possible with current techniques. Moreover, our experimental results demonstrate that our method is significantly faster with the potential of less memory usage, while retaining comparable or better quality comparisons. 
    more » « less
  2. Goaoc, Xavier and (Ed.)
    The space of persistence diagrams under bottleneck distance is known to have infinite doubling dimension. Because many metric search algorithms and data structures have bounds that depend on the dimension of the search space, the high-dimensionality makes it difficult to analyze and compare asymptotic running times of metric search algorithms on this space. We introduce the notion of nearly-doubling metrics, those that are Gromov-Hausdorff close to metric spaces of bounded doubling dimension and prove that bounded $k$-point persistence diagrams are nearly-doubling. This allows us to prove that in some ways, persistence diagrams can be expected to behave like a doubling metric space. We prove our results in great generality, studying a large class of quotient metrics (of which the persistence plane is just one example). We also prove bounds on the dimension of the $k$-point bottleneck space over such metrics. The notion of being nearly-doubling in this Gromov-Hausdorff sense is likely of more general interest. Some algorithms that have a dependence on the dimension can be analyzed in terms of the dimension of the nearby metric rather than that of the metric itself. We give a specific example of this phenomenon by analyzing an algorithm to compute metric nets, a useful operation on persistence diagrams. 
    more » « less
  3. null (Ed.)
    Learning task-specific representations of persistence diagrams is an important problem in topological data analysis and machine learning. However, current state of the art methods are restricted in terms of their expressivity as they are focused on Euclidean representations. Persistence diagrams often contain features of infinite persistence (i.e., essential features) and Euclidean spaces shrink their importance relative to non-essential features because they cannot assign infinite distance to finite points. To deal with this issue, we propose a method to learn representations of persistence diagrams on hyperbolic spaces, more specifically on the Poincare ball. By representing features of infinite persistence infinitesimally close to the boundary of the ball, their distance to non-essential features approaches infinity, thereby their relative importance is preserved. This is achieved without utilizing extremely high values for the learnable parameters, thus the representation can be fed into downstream optimization methods and trained efficiently in an end-to-end fashion. We present experimental results on graph and image classification tasks and show that the performance of our method is on par with or exceeds the performance of other state of the art methods. 
    more » « less
  4. Abstract

    This paper presents a new clustering algorithm for space–time data based on the concepts of topological data analysis and, in particular, persistent homology. Employing persistent homology—a flexible mathematical tool from algebraic topology used to extract topological information from data—in unsupervised learning is an uncommon and novel approach. A notable aspect of this methodology consists in analyzing data at multiple resolutions, which allows for distinguishing true features from noise based on the extent of their persistence. We evaluate the performance of our algorithm on synthetic data and compare it to other well‐known clustering algorithms such asK‐means, hierarchical clustering, and DBSCAN (density‐based spatial clustering of applications with noise). We illustrate its application in the context of a case study of water quality in the Chesapeake Bay.

     
    more » « less
  5. Topological data analysis encompasses a broad set of techniques that investigate the shape of data. One of the predominant tools in topological data analysis is persistent homology, which is used to create topological summaries of data called persistence diagrams. Persistent homology offers a novel method for signal analysis. Herein, we aid interpretation of the sublevel set persistence diagrams of signals by 1) showing the effect of frequency and instantaneous amplitude on the persistence diagrams for a family of deterministic signals, and 2) providing a general equation for the probability density of persistence diagrams of random signals via a pushforward measure. We also provide a topologically-motivated, efficiently computable statistical descriptor analogous to the power spectral density for signals based on a generalized Bayesian framework for persistence diagrams. This Bayesian descriptor is shown to be competitive with power spectral densities and continuous wavelet transforms at distinguishing signals with different dynamics in a classification problem with autoregressive signals.

     
    more » « less