Clustering is a fundamental task in machine learning. One of the most successful and broadly used algorithms is DBSCAN, a density-based clustering algorithm. DBSCAN requires ϵ-nearest neighbor graphs of the input dataset, which are computed with range-search algorithms and spatial data structures like KD-trees. Despite many efforts to design scalable implementations for DBSCAN, existing work is limited to low-dimensional datasets, as constructing ϵ-nearest neighbor graphs can be expensive in high-dimensions. This article introduces a modified DBSCAN, usingk-nearest neighbor (kNN) graphs to improve efficiency. We outline conditions forkNN-DBSCAN to match DBSCAN’s results and present a parallel implementation using OpenMP and MPI for shared and distributed memory systems. Testing on datasets up to 32 dimensions, we achieve remarkable scalability. Our implementation clusters one billion 3D points in under one second on 28K cores at TACC’s Frontera system. In a larger run, we cluster 65 billion points in 20 dimensions in under 40 seconds using 114,688 cores. Our method is up to 37× faster than state-of-the-art parallel DBSCAN on a 20-dimensional dataset with 4 million points. Code is available athttps://github.com/ut-padas/knndbscan.
more »
« less
DBSpan: Density-Based Clustering Using a Spanner, With an Application to Persistence Diagrams
Since its introduction in the mid-1990s, DBSCAN has become one of the most widely used clustering algorithms. However, one of the steps in DBSCAN is to perform a range query, a task that is difficult in many spaces, including the space of persistence diagrams. In this paper, we introduce a spanner into the DBSCAN algorithm to facilitate range queries in such spaces. We provide a proof-of-concept implementation, and study time and clustering performance for two data sets of persistence diagrams.
more »
« less
- PAR ID:
- 10351417
- Date Published:
- Journal Name:
- Applications of Topological Data Analysis to Data Science, Artificial Intelligence, and Machine Learning (TDA at SDM)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Persistence diagrams have been widely used to quantify the underlying features of filtered topological spaces in data visualization. In many applications, computing distances between diagrams is essential; however, computing these distances has been challenging due to the computational cost. In this paper, we propose a persistence diagram hashing framework that learns a binary code representation of persistence diagrams, which allows for fast computation of distances. This framework is built upon a generative adversarial network (GAN) with a diagram distance loss function to steer the learning process. Instead of using standard representations, we hash diagrams into binary codes, which have natural advantages in large-scale tasks. The training of this model is domain-oblivious in that it can be computed purely from synthetic, randomly created diagrams. As a consequence, our proposed method is directly applicable to various datasets without the need for retraining the model. These binary codes, when compared using fast Hamming distance, better maintain topological similarity properties between datasets than other vectorized representations. To evaluate this method, we apply our framework to the problem of diagram clustering and we compare the quality and performance of our approach to the state-of-the-art. In addition, we show the scalability of our approach on a dataset with 10k persistence diagrams, which is not possible with current techniques. Moreover, our experimental results demonstrate that our method is significantly faster with the potential of less memory usage, while retaining comparable or better quality comparisons.more » « less
-
Goaoc, Xavier and (Ed.)The space of persistence diagrams under bottleneck distance is known to have infinite doubling dimension. Because many metric search algorithms and data structures have bounds that depend on the dimension of the search space, the high-dimensionality makes it difficult to analyze and compare asymptotic running times of metric search algorithms on this space. We introduce the notion of nearly-doubling metrics, those that are Gromov-Hausdorff close to metric spaces of bounded doubling dimension and prove that bounded $$k$$-point persistence diagrams are nearly-doubling. This allows us to prove that in some ways, persistence diagrams can be expected to behave like a doubling metric space. We prove our results in great generality, studying a large class of quotient metrics (of which the persistence plane is just one example). We also prove bounds on the dimension of the $$k$$-point bottleneck space over such metrics. The notion of being nearly-doubling in this Gromov-Hausdorff sense is likely of more general interest. Some algorithms that have a dependence on the dimension can be analyzed in terms of the dimension of the nearby metric rather than that of the metric itself. We give a specific example of this phenomenon by analyzing an algorithm to compute metric nets, a useful operation on persistence diagrams.more » « less
-
null (Ed.)Learning task-specific representations of persistence diagrams is an important problem in topological data analysis and machine learning. However, current state of the art methods are restricted in terms of their expressivity as they are focused on Euclidean representations. Persistence diagrams often contain features of infinite persistence (i.e., essential features) and Euclidean spaces shrink their importance relative to non-essential features because they cannot assign infinite distance to finite points. To deal with this issue, we propose a method to learn representations of persistence diagrams on hyperbolic spaces, more specifically on the Poincare ball. By representing features of infinite persistence infinitesimally close to the boundary of the ball, their distance to non-essential features approaches infinity, thereby their relative importance is preserved. This is achieved without utilizing extremely high values for the learnable parameters, thus the representation can be fed into downstream optimization methods and trained efficiently in an end-to-end fashion. We present experimental results on graph and image classification tasks and show that the performance of our method is on par with or exceeds the performance of other state of the art methods.more » « less
-
Abstract Topological data analysis (TDA) methods can be useful for classification and clustering tasks in many different fields as they can provide two dimensional persistence diagrams that summarize important information about the shape of potentially complex and high dimensional data sets. The space of persistence diagrams can be endowed with various metrics, which admit a statistical structure and allow to use these summaries for machine learning algorithms, e.g. the Wasserstein distance. However, computing the distance between two persistence diagrams involves finding an optimal way to match the points of the two diagrams and may not always be an easy task for classical computers. Recently, quantum algorithms have shown the potential to speedup the process of obtaining the persistence information displayed on persistence diagrams by estimating the spectra of persistent Dirac operators. So, in this work we explore the potential of quantum computers to estimate the distance between persistence diagrams as the next step in the design of a fully quantum framework for TDA. In particular we propose variational quantum algorithms for the Wasserstein distance as well as the distance. Our implementation is a weighted version of the quantum approximate optimization Algorithm that relies on control clauses to encode the constraints of the optimization problem.more » « less
An official website of the United States government

