<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search</title></titleStmt>
			<publicationStmt>
				<publisher>Society for Industrial and Applied Mathematics</publisher>
				<date>01/01/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10621431</idno>
					<idno type="doi">10.1137/1.9781611978759.1</idno>
					
					<author>Shangdi Yu</author><author>Joshua Engels</author><author>Yihao Huang</author><author>Julian Shun</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In this paper, we study variants of density peaks clustering, a popular type of density-based clustering algorithm for points that has been shown to work well in practice. Our goal is to cluster large high-dimensional datasets, which are prevalent in practice. Prior solutions are either sequential and cannot scale to large data, or are specialized for low-dimensional data. This paper unifies the different variants of density peaks clustering into a single framework, PECANN (Parallel Efficient Clustering with Approximate Nearest Neighbors), by abstracting out several key steps common to this class of algorithms. One such key step is to find nearest neighbors that satisfy a predicate function, and one of the main contributions of this paper is an efficient way to do this predicate search using graph-based approximate nearest neighbor search (ANNS). To provide ample parallelism, we propose a doubling search technique that enables points to find an approximate nearest neighbor satisfying the predicate in a small number of rounds. Our technique can be applied to many existing graph-based ANNS algorithms, which can all be plugged into PECANN.We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading. Compared to the state-of-the-art FASTDP algorithm for high-dimensional density peaks clustering, which is sequential, our best algorithm is 45x-734x faster while achieving competitive ARI scores. Compared to the state-of-the-art parallel DPCbased algorithm, which is optimized for low dimensions, PECANN is two orders of magnitude faster. As far as we know, we are the first to evaluate DPC variants on large highdimensional real-world image and text embedding datasets.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>1 Introduction Clustering is the task of grouping similar objects into clusters and is a fundamental task in data analysis and unsupervised machine learning <ref type="bibr">[48,</ref><ref type="bibr">1,</ref><ref type="bibr">8]</ref>. For example, clustering algorithms can be used to identify different types of tissues in medical imaging <ref type="bibr">[99]</ref>, analyze social networks <ref type="bibr">[68]</ref>, and identify weather regimes in climatology <ref type="bibr">[18]</ref>. They are also widely used as a data processing subroutine in other machine learning tasks <ref type="bibr">[20,</ref><ref type="bibr">97,</ref><ref type="bibr">60,</ref><ref type="bibr">66]</ref>. One popular type of clustering is density-based clustering, where clusters are defined as dense regions of points in space. Recently, density-based clustering algorithms have received a lot of attention <ref type="bibr">[32,</ref><ref type="bibr">2,</ref><ref type="bibr">6,</ref><ref type="bibr">50,</ref><ref type="bibr">77,</ref><ref type="bibr">92,</ref><ref type="bibr">43,</ref><ref type="bibr">42,</ref><ref type="bibr">82]</ref> because they can discover clusters of arbitrary shapes and detect outliers (unlike popular algorithms such as k-means, which can only detect spherical clusters).</p><p>Density peaks clustering (DPC) <ref type="bibr">[77]</ref> is a popular densitybased clustering technique for spatial data (i.e., point sets) that has proven very effective at clustering challenging datasets with non-spherical clusters. Due to DPC's success, many DPC variants have been proposed in the literature (e.g., <ref type="bibr">[33,</ref><ref type="bibr">15,</ref><ref type="bibr">84,</ref><ref type="bibr">101,</ref><ref type="bibr">88,</ref><ref type="bibr">98,</ref><ref type="bibr">102,</ref><ref type="bibr">27,</ref><ref type="bibr">87,</ref><ref type="bibr">44,</ref><ref type="bibr">35]</ref>). However, existing DPC variants are sequential and/or tailored to lowdimensional data, and so cannot scale to the large, highdimensional datasets that are common in practice.</p><p>This paper addresses this gap by proposing a novel framework called PECANN: Parallel Efficient Clustering with Approximate Nearest Neighbors. PECANN contains implementations for a variety of different DPC density techniques that both scale to large datasets (via efficient parallel implementations) and run on high dimensional data (via approximate nearest neighbor search). Designing a unifying framework for DPC variants is non-trivial, as DPC variants can differ significantly. Developing a modular and extensible framework that can seamlessly incorporate various DPC variants and allow for easy comparison and experimentation requires careful abstraction and encapsulation of the key algorithmic components. Furthermore, extending DPC to high dimensions is challenging as there are no efficient parallel solutions for constrained nearest neighbor search in high dimensions, which is needed for DPC. Before going into more details on our contributions, we review the main steps of DPC variants and discuss existing bottlenecks.</p><p>The three key steps of DPC variants are as follows:</p><p>(1) Compute the density of each point x.</p><p>(2) Construct a tree by connecting each point x to its closest neighbor with higher density than x. (3) Remove edges in the tree according to a pruning heuristic.</p><p>Each resulting connected component is a separate cluster.</p><p>Step <ref type="bibr">(1)</ref> is computed differently based on the variant, but all variants use a function that depends on either the k-nearest neighbors of x or the points within a given distance from x. Efficient implementations of this step rely on nearest neighbor queries or range queries. In low dimensions, these queries can be answered efficiently using spatial trees, such as kdtrees. However, kd-trees are inefficient in high dimensions due to the curse of dimensionality <ref type="bibr">[96]</ref>. Step (2) again requires finding nearest neighbors, but with the constraint that only neighbors with higher density are considered. Step <ref type="bibr">(3)</ref> can easily be computed using any connected components algorithm. Steps (1) and ( <ref type="formula">2</ref>) form the bottleneck of the computation, and take quadratic work in the worst case, while Step (3) can be done in (near) linear work. Note that different clusterings can be generated by reusing the tree from Step (2) and simply re-running Step (3) using different pruning heuristics. The tree from Step (2) can be viewed as a cluster hierarchy (or dendrogram) that contains clusterings at different resolutions.</p><p>Existing papers on DPC variants mainly focus on their own proposed variant, and as far as we know, there is no unified framework for implementing and comparing DPC variants and evaluating them on the same datasets. Furthermore, most DPC papers focus on clustering lowdimensional data, but many datasets in practice are high dimensional (d &gt; 100). The PECANN framework unifies a broad class of DPC variants by abstracting out these three steps and providing efficient parallel implementations for different variants of each step. For Step <ref type="bibr">(1)</ref>, we leverage graph-based approximate ANNS algorithms, which are fast and accurate in high dimensions <ref type="bibr">[65,</ref><ref type="bibr">91]</ref>. For Step <ref type="bibr">(2)</ref>, we adapt graph-based ANNS algorithms to find higher density neighbors by iteratively doubling the number of nearest neighbors returned until finding one that has higher density. Our doubling search guarantees that the algorithm finishes in a logarithmic number of rounds, making it highly parallel. For Steps <ref type="bibr">(1)</ref> and <ref type="bibr">(2)</ref>, PECANN supports the following graph-based ANNS algorithms: VAMANA <ref type="bibr">[52]</ref>, PYNNDESCENT <ref type="bibr">[67]</ref>, and HCNNG <ref type="bibr">[70]</ref>. For Step (3), we use a concurrent union-find algorithm <ref type="bibr">[51]</ref> to achieve high parallelism. Prior work <ref type="bibr">[84]</ref> has explored using graph-based ANNS for high-dimensional clustering, but their algorithm is not parallel and they only consider one DPC variant and one underlying ANNS algorithm. In addition, we provide theoretical work and span bounds of PECANN that depend on the complexity of the underlying ANNS algorithm. PECANN is implemented in C++, using the ParlayLib <ref type="bibr">[9]</ref> and ParlayANN <ref type="bibr">[65]</ref> libraries, and also has Python bindings.</p><p>We use PECANN to implement five DPC variants and evaluate them on a variety of synthetic and real-world data sets with up to 1.28 million points and up to 1024 dimensions. We find that using a density function that is the inverse of the distance to the k th nearest neighbor, combined with the VAMANA algorithm for ANNS, gives the best overall performance. On a 30-core machine with</p><p>Notation Meaning P input set of points n, d size and dimensionality of P x i i th point in P G a similarity search index D(x i , x j ) distance (dissimilarity) between x i and x j &#961; i , &#955; i density and dependent point of x i &#948; i dependent distance of x i (i.e., D(x i , &#955; i )) k the number of neighbors used for computing densities N i (approximate) k-nearest neighbors of x i Wc, Sc the work and span of constructing G Wnn, Snn the work and span of finding nearest neighbors using G</p><p>Table 2.1: Notation two-way hyper-threading, this best algorithm in PECANN achieves 37.7-854.3x speedup over a parallel brute force approach, and 45-734x speedup over FASTDP <ref type="bibr">[84]</ref>, the stateof-the-art DPC-based algorithm for high dimensions, while achieving similar accuracy in terms of ARI score. FASTDP is sequential, but even if we assume that it achieves a perfect speedup of 60x, PECANN still achieves a speedup of 0.76-12.24x. Compared to the state-of-the-art parallel density peaks clustering algorithm by Huang et al. <ref type="bibr">[45]</ref>, which is optimized for low dimensions, our best algorithm achieves a 320x speedup while achieving a higher ARI score on the MNIST dataset (their algorithm failed on larger datasets).</p><p>Our contributions are summarized below. 1. We introduce the PECANN framework that unifies existing k-nearest neighbor-based DPC variants and supports parallel implementations of them that scale to large highdimensional datasets. We provide fast parallel implementations for five DPC variants. 2. We extend graph-based ANNS algorithms with a parallel doubling-search method for finding higher density neighbors. 3. We perform comprehensive experiments on a 30-core machine with two-way hyper-threading showing that PECANN outperforms the state-of-the-art DPC-based algorithm for high dimensions by 45-734x. As far as we know, we are the first to compare different variants of DPC on large high-dimensional real-world image and text embedding datasets.</p><p>Our code and the full version of the paper are available at <ref type="url">https://github.com/yushangdi/PECANN-DPC</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Preliminaries</head><p>2.1 Definitions and Notation A summary of the notation is provided in Table <ref type="table">2</ref>.1. Let P = {x 1 , . . . , x n } represent a set of n points in d-dimensional coordinate space to be clustered. We use x i to represent the i th point in P . Let G be a search index that supports searching for the exact or approximate nearest neighbors of a query point. Let D(x i , x j ) denote the distance (dissimilarity) between points x i and x j , where a larger distance value means the points are less similar. D can be any distance measure the search index G supports. Let the neighbors (N i ) of a point x i be either its exact or approximate k-nearest neighbors. Let &#961; i be the density of point x i , representing how dense the local region around x i is. A larger &#961; i value indicates a denser local region. For example, in the original DPC algorithm <ref type="bibr">[77]</ref>, the density of a point x is the number of points within a given radius of x, and in the SD-DP (sparse dual of density peaks) algorithm <ref type="bibr">[33]</ref>, the density of a point is the inverse of its distance to its k th nearest neighbor. In this paper, we consider the densities that can be computed from the k-nearest neighbors of x. DEFINITION 2.1.</p><p>, it is the closest point with higher density than x i ). The dependent distance (&#948; i ) of x i is D(x i , &#955; i ), i.e., the distance to its dependent point (or &#8734; if it does not have one). Definition 2.1 defines the dependent point to be the closest point with higher density, which is expensive to compute in high dimensions. For high-dimensional data, we relax the constraint to allow reporting an approximate nearest neighbor with higher density (i.e., considering just the points with higher density, choose approximately the closest one). Roughly speaking, an approximate nearest neighbor of a point x is one whose distance from x is not too far from the distance of the true nearest neighbor from x. In our experiments, we use the Euclidean distance function, one of the most commonly used distance functions for clustering.</p><p>Points that are outliers and do not belong to any cluster are classified as noise points. A noise point is in its own singleton cluster. For example, some algorithms require a density cutoff parameter &#961; min , and points that have &#961; i &lt; &#961; min are considered noise points. A cluster center is a point whose density is a local maximum within a cluster. Each cluster center corresponds to a separate cluster. One way to pick cluster centers is using a parameter &#948; min , where a point x i is considered a cluster center if &#948; i &gt; &#948; min .</p><p>We use the work-span model <ref type="bibr">[49,</ref><ref type="bibr">21]</ref>, a standard model of computation for analyzing shared-memory parallel algorithms. The work W of an algorithm is the total number of operations executed by the algorithm, and the span S is the length of the longest sequential dependence of the algorithm (it is also the parallel time complexity when there are an infinite number of processors). We can bound the expected running time of an algorithm on P processors by W/P + O(S) using a randomized work-stealing scheduler <ref type="bibr">[10]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Relevant Techniques</head><p>Graph-based Approximate Nearest Neighbor Search. We use approximate nearest neighbor search (ANNS) algorithms in PECANN. Graphbased ANNS algorithms can find approximate nearest neighbors in high dimensions efficiently and accurately compared to alternatives such as locality-sensitive hashing, inverted indices, and tree-based indices <ref type="bibr">[91,</ref><ref type="bibr">65,</ref><ref type="bibr">95]</ref>. These algorithms Algorithm 2.1 Greedy Beam Search, modified from <ref type="bibr">[52]</ref> Input: Query point x, starting point set S, graph index G, beam width L, dissimilarity measure D, and integer k.</p><p>&#9655; points in the beam 3: while L \ V &#824; = &#8709; do 4:</p><p>p * &#8592; argmin (q&#8712;L\V) D(x, q) 5:</p><p>first construct a graph index on the input points, and later answer nearest neighbor queries by traversing the graph using a greedy search. Some popular methods include Vamana <ref type="bibr">[52]</ref>, HNSW <ref type="bibr">[64]</ref>, HCNNG <ref type="bibr">[70]</ref>, and PyNNDescent <ref type="bibr">[67]</ref>. Manohar et al. <ref type="bibr">[65]</ref> provide parallel implementations for constructing these indices, as well as a sequential implementation for running a single query. Multiple queries can be processed in parallel. We describe more graph-based ANNS methods in Section 7. Graph-based indices usually support any distance measure, while some indices <ref type="bibr">[52,</ref><ref type="bibr">67]</ref> use heuristics that assume the triangle inequality holds. ANNS on a Graph Index. We use the function G.FIND-KNN(x, k) to perform an ANNS on a graph G for the point x, which returns the approximate k-nearest neighbors of x.</p><p>Most graph-based ANNS methods use a variant of a greedy (beam) search (Algorithm 2.1) to answer a k-nearest neighbor query <ref type="bibr">[65]</ref>. For a query point x, the algorithm maintains a beam L with size at most L (the width of the beam) as a set of candidates for the nearest neighbors of x. Let G.E out (x) be the vertices incident to the edges going out from x in G. We call these the out-neighbors of x. On each step, the algorithm pops the closest vertex to x from L (Line 4), and processes it by adding all of its out-neighbors to the beam (Line 5). The set V maintains all points that have been processed (Line 6). If |L| exceeds L, only the L closest points to x will be kept (Line 7). The algorithm stops when all vertices in the beam have been visited, as no new vertices can be explored (Line 3). The algorithm returns the k closest points to x from L and the visited point set V (Line 8).</p><p>In some cases, it is possible that the algorithm traverses fewer than k points for a query, and thus returns fewer than k points. To solve this problem, options include using a brute force search or repeating the search from other starting points. Parallel Primitives. PAR-FILTER(A, f ) takes as input a sequence of elements A and a predicate f , and returns all elements a &#8712; A such that f (a) is true. PAR-ARGMIN(A, f ) takes as input a sequence of elements A and a function f : A &#8594; R, and returns the element a &#8712; A that has the minimum f (a). PAR-SUM(A) takes as input a sequence of numbers A, and returns the sum of the numbers in A.</p><p>PAR-FILTER, PAR-ARGMIN, and PAR-SUM all take O(n) work and O(log n) span. PAR-SELECT(A, k) takes as input a sequence of elements A and an integer 0 &lt; k &#8804; |A|, and Algorithm 3.1 PECANN Framework Input: Point set P , integer k &gt; 0, distance measure D, F density , F noise , Fcenter 1: G = BUILDINDEX(P ) 2: parfor i &#8712; 1 . . . n do 3: N i &#8592; G.FIND-KNN(x i , k) &#9655; find k-nearest neighbors 4: parfor i &#8712; 1 . . . n do 5: &#961; i &#8592; F density (x i , N i ) &#9655; compute densities 6: &#955; &#8592; COMPUTEDEPPTS(G, P , &#961;, N , D) 7: P noise &#8592; F noise (P , &#961;, &#955;, N ) &#9655; compute noise points 8: Pcenter &#8592; Fcenter(P \ P noise , &#961;, &#955;, N ) &#9655; compute center points 9: Initialize a union-find data structure U F with size n = |P | 10: parfor x i &#8712; P \ (P noise &#8746; Pcenter) do 11: U F .UNION(i, &#955; i ) 12: parfor i &#8712; 1 . . . n do 13:</p><p>parfor x i &#8712; P do 8:</p><p>parfor x i &#8712; P unfinished do 13:</p><p>N candidates &#8592; G.FINDKNN(i, k dep ) 14:</p><p>parfor x i &#8712; P unfinished do 18:</p><p>return &#955; returns the k th largest element in A. It takes O(n) work and O(log n log log n) span <ref type="bibr">[49]</ref>. We use the implementations of these primitives from ParlayLib <ref type="bibr">[9]</ref>. A union-find data structure maintains the set membership of elements and allows the sets to merge. Initially, each element is in its own set. A UNION(a, b) operation merges the sets containing a and b into the same set. A FIND(a) operation returns the membership of element a. We use a concurrent union-find data structure <ref type="bibr">[51]</ref>, which supports operations in parallel. Performing m unions on a set of n elements takes O(m(log( n m + 1) + &#945;(n, n))) work and O(log n) span (&#945; denotes the inverse Ackermann function).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">PECANN Framework</head><p>We present the PECANN framework in Algorithm 3.1. To make our description of the framework more concrete, we will give an example of instantiating the framework in this section. Section 4 will provide more examples and Section 5 will provide the work and span analysis of PECANN.</p><p>The input to PECANN is a point set P , a positive integer k, a distance measure D, and three functions F density , F noise , and F center that indicate how the density, noise points, and center points are computed, respectively. In the pseudocode, &#961; is an array of densities of all points in P and N is an array containing k-nearest neighbors for all points. &#955; is an array containing dependent points. c is an array containing the cluster IDs of all points and c i is the cluster ID of x i . The framework has the following six steps.</p><p>1. Construct Index. On Line 1, we construct an index G, which can be any index that supports k-nearest neighbor search. For example, it can be a kd-tree, which is suitable for low-dimensional exact k-nearest neighbor search <ref type="bibr">[34]</ref>, or a graph-based index for ANNS on high-dimensional data <ref type="bibr">[65,</ref><ref type="bibr">52,</ref><ref type="bibr">64,</ref><ref type="bibr">70,</ref><ref type="bibr">67]</ref>. It can also be an empty data structure, which would lead to doing brute force searches to find the exact k-nearest neighbors. An example of a graph index corresponding to a point set is shown in Figure <ref type="figure">3</ref>.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Compute k-nearest</head><p>Neighbors. On Lines 2-3, we compute the k-nearest neighbors of all points in parallel, using the index G. If we run the greedy search (Algorithm 2.1) on the example in Figure <ref type="figure">3</ref>.1 with k = 1, L = 1, and S containing only the query point, we would find that the nearest neighbors of a, b, c, d, e, and f are c, c, b, f , d, and d, respectively (here we assume that the graph index returns exact nearest neighbors).</p><p>3. Compute Densities. On Lines 4-5, we compute the density for each point in parallel using F density . An example density function is</p><p>, where x j is the furthest neighbor from x i in N i <ref type="bibr">[33]</ref>. For this density function, the densities of the points in Figure <ref type="figure">3</ref>.</p><p>2 , and</p><p>The ranking of the densities from high to low (breaking ties by node ID) is b, c, a, d, f, e. 4. Compute Dependent Points. On Line 6, we compute the dependent point of all points in parallel. The dependent points in our example are shown in Figure <ref type="figure">3</ref>.2. We explain the details of how we compute the dependent points in Subsection 3.1. As mentioned in Section 1, the resulting tree from this step is a hierarchy of clusters (dendrogram), which can be returned if desired. To compute a specific clustering, the following two steps are needed. 5. Compute Noise and Center Points. On Lines 7-8, we compute the noise and center points using the input functions F noise and F center . An example of F noise is par-filter(P, x i : &#961; i &gt; &#961; min ), where &#961; min is a userdefined parameter. Points whose densities are at most &#961; min are classified as noise points. An example of F center is par-filter(P, x i : D(x i , &#955; i ) &#8805; &#948; min ), where &#948; min is a user-defined parameter. Non-noise points whose distance are at least &#948; min from their dependent point are classified as center points. In our example (Figure <ref type="figure">3</ref>.3), if we let &#961; min = 1 &#8730; 2 , then e is a noise point. If we let &#948; min = 2.5, then b and d are center points.</p><p>6. Compute Clusters. On Lines 9-13, we compute the clusters using a concurrent union-find data structure <ref type="bibr">[51]</ref>. In parallel, for all points that are not noise points or center points, we merge them into the same cluster as their dependent point. This ensures that points (except noise points and center points) are in the same cluster as their dependent point. a b e f d c Figure 3.2: Each point has an outgoing edge to its dependent point. a b c d e f Figure 3.3: Clustering result with e as a noise point (white circle), and b and d as center points (blue diamonds). The dashed edges are ignored during the union step (Line 11 of Algorithm 3.1). The two blue circles are the two clusters found.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Dependent Point Computation</head><p>Our parallel algorithm for computing the dependent points (Algorithm 3.2) takes as input the index G, the point set P , the array of densities &#961;, the array of (approximate) k-nearest neighbors N , and the distance measure D.</p><p>DPBRUTEFORCE is a helper function (Lines 1-5) that searches for the nearest neighbor of x i with density higher than &#961; i among N candidates using brute force. It returns &#8709; if no points in N candidates have a higher density than &#961; i .</p><p>On Lines 7-8, we first search within the k-nearest neighbors of each point to find its dependent point. This optimization is also used in several other works <ref type="bibr">[33,</ref><ref type="bibr">88,</ref><ref type="bibr">15]</ref>. On Line 9, we obtain the set of points P unfinished that have not found their dependent points. Line 10 initializes k dep to L d .</p><p>L d and threshold are parameters used for our performance optimizations. We defer a discussion of these parameters to Subsection 3.2, and ignore their effect here by setting L d to be 2k and threshold to be 0 (this causes Lines 17-18 to have no effect, since P unfinished will be empty at that point).</p><p>The while-loop on Line 11 terminates when all points have found their dependent point. On Lines 12-14, we compute the dependent point for points in P unfinished . If the index is designed for approximate k-nearest neighbor search, we guarantee that the dependent point has a higher density, but it might not be the closest among points with higher densities. Note that on Line 12, we can skip the point with maximum density, since we know that it does not have a dependent point. On Lines 13-14, for each point, we find k dep neighbors of x i on each round, and if any of the neighbors have a higher density than x i , we can return the closest such neighbor as the dependent point. We then double k dep i for the next round (Line 15). A similar doubling optimization is used in <ref type="bibr">[15]</ref>, but with a cover tree. Furthermore, their algorithm is sequential. On Line 16, we compute the set of points P unfinished that have not found their dependent point. Example. On the dataset from Figure <ref type="figure">3</ref>.1, points a, c, e, and f would find their dependent point within their k-nearest neighbor (k = 1) on Lines 7-8 because their nearest neighbor has higher density than themselves. b is the point with maximum density and is skipped. For the remaining point d, on the first round we have k dep = 2, and so N candidates = {e, f }. This does not contain any point with a higher density than d, and so we double K dep = 2 and try again. On the second round, k dep = 4, and so N candidates = {b, c, e, f }, which contains d's dependent point c.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Performance Optimizations</head><p>Dependent Point Finding Now we explain the two integer parameters L d and threshold. The while-loop on Line 11 checks if |P unfinished | &gt; threshold, and when that is no longer true, we do a brute force k-nearest neighbor computation for the remaining points in P unfinished on Lines 17-18. This optimization is useful because for the points with relatively high density, it can be challenging for the index to find a dependent point (as most neighbors have lower density than them), and for these last few points it is faster to just do a brute force search than continue to double k dep . Furthermore, when few points are remaining, there is less parallelism available when calling FINDKNN, each of which is sequential, compared to the brute force search, which is highly parallel. In our experiments, we set threshold = 300, which we found to work well.</p><p>L d is a tunable parameter that is &gt; k (Line 10) and indicates the initial number of nearest neighbors to search for to find a dependent point (Line 13). A larger value of L d leads to fewer iterations. However, points that require fewer than L d nearest neighbors to find a dependent point will do some extra work (as they search for more nearest neighbors than necessary). On the other hand, points that require at least L d nearest neighbors to find a dependent point will do less work overall (they do not need to waste work on the initial rounds where they would not find a dependent point anyway). Vamana Graph Construction. Vamana <ref type="bibr">[52,</ref><ref type="bibr">65]</ref> is one of the graph-based indices that we use for ANNS. Its parallel construction algorithm <ref type="bibr">[65]</ref> builds the graph by running greedy search (from Algorithm 2.1) on each point x i (in batches), and then adds edges from x i to points visited during the search (V). It requires a degree bound parameter R, such that in the constructed graph each vertex has at most R out-neighbors. If adding edges between x i and V causes a vertex's degree to exceed R, a pruning procedure is called to iteratively select at most R out-neighbors. The pruning algorithm also has a parameter &#945; &#8805; 1 that controls how aggressive the pruning is; a higher &#945; corresponds to more aggressive pruning, which can lead to less than R neighbors being selected. Intuitively, this heuristic prunes the long edge of a triangle, with a slack of &#945;. The details of the pruning algorithm can be found in <ref type="bibr">[52]</ref>.</p><p>The original Vamana graph construction algorithm <ref type="bibr">[52,</ref><ref type="bibr">65]</ref> starts the greedy search from a single point, which is the medoid of P . Starting from a single point can make the algorithm require a high degree bound and beam width to achieve good results on clustered data because a search can be trapped within the cluster that the medoid is in. Instead of using a large degree bound and beam width, which degrades performance, we use an optimization where we randomly sample a set of starting points for the Vamana graph construction algorithm instead of starting from the medoid alone. This heuristic is also explored in <ref type="bibr">[59]</ref>.</p><p>4 Usage of PECANN PECANN allows users to plug in functions that can be combined to obtain new clustering algorithms. In this section, we describe several functions and provide their work and span bounds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Indices</head><p>Here we describe several approaches for building indices for k-nearest neighbor search. Let the work and span of constructing G be W c and S c , respectively. Brute Force. The brute force approach does not use an index at all. When searching for the exact k-nearest neighbors of x i , it uses a PAR-SELECT to find the k th smallest distance to x i , and a PAR-FILTER to filter for the points with smaller distances to x i . In this case, W c and S c are O(1), while W nn and S nn are O(n) and O(log n log log n), respectively. Tree Indices. Another option is to use a tree index, such as a kd-tree or a cover tree <ref type="bibr">[15]</ref>. For a parallel kd-tree, W c = O(n log n) and S c = O(log n log log n) <ref type="bibr">[94]</ref>. A parallel cover tree can be constructed in O(n log n) expected work and O(log 3 n log log n) span with high probability <ref type="bibr">[40]</ref>. A k-nearest neighbor search in a kd-tree takes O(n) work and O(log n) span. A k-nearest neighbor search in a cover tree takes O(c 7 (k + c 3 ) log k log &#8710;) expected work and span <ref type="bibr">[40,</ref><ref type="bibr">30,</ref><ref type="bibr">29]</ref>, where c is the expansion constant of P and &#8710; is the aspect ratio of P . However, note that these tree indices usually suffer from the curse of dimensionality and do not perform well on high-dimensional datasets. Graph Indices. Graph-based ANNS algorithms have been shown to be efficient and accurate in finding approximate nearest neighbors in high dimensions <ref type="bibr">[91,</ref><ref type="bibr">65,</ref><ref type="bibr">95]</ref>. Our framework includes three parallel graph indices from the ParlayANN library <ref type="bibr">[65]</ref>: Vamana <ref type="bibr">[52]</ref>, HCNNG <ref type="bibr">[70]</ref>, and PyNNDes-cent <ref type="bibr">[67]</ref>. Similar to Vamana, HCNNG also uses the parameter &#945; to prune edges. HCNNG and PYNNDESCENT also accept a num_repeats argument, which represents how many times they will independently repeat the construction process before merging the results together.</p><p>When the number of returned neighbors is less than k, we use the brute force method to find the exact k-nearest neighbors. While these graph indices have been shown to work well in practice, there are only a few works that theoretically analyze their performance <ref type="bibr">[71,</ref><ref type="bibr">74,</ref><ref type="bibr">83,</ref><ref type="bibr">56,</ref><ref type="bibr">47]</ref>. Indyk and Xu <ref type="bibr">[47]</ref> show that Vamana construction takes W c = O(n 3 ) work. In practice, we find that the work is usually much lower. Using the batch insertion method <ref type="bibr">[65]</ref>, which inserts points in batches of doubling size, Vamana construction takes S c = O(n 2 log n) span. <ref type="foot">1</ref>4.2 Density, Center, and Noise Functions Here, we describe a subset of the density, center, and noise functions (F density , F center , and F noise ) that we implement in PECANN. We describe other functions we implement in the full paper. kth Density Function. The density of x i is</p><p>where x j is the furthest neighbor from x i in N i , i.e., the distance to the exact or approximate k th nearest neighbor of x i <ref type="bibr">[33,</ref><ref type="bibr">15]</ref>. Each density computation is O(k) work and O(log k) span to find the furthest neighbor in N i .</p><p>The density can also be normalized <ref type="bibr">[44]</ref>. The normalized density (normalized) is</p><p>Intuitively, this function normalizes a point's density with an average of the densities of its neighbors. Each normalization takes an extra O(k) work and O(log k) span. Threshold Center Function. Recall from Section 2 that &#948; i = D(x i , &#955; i ) is the dependent distance of x i . F center obtains the center points by selecting the points whose distance to their dependent point is greater than &#948; min , a user-defined parameter. This can be implemented with a par-filter, whose work and span are O(n) and O(log n), respectively. This method is used in <ref type="bibr">[4,</ref><ref type="bibr">101,</ref><ref type="bibr">5]</ref>. Product Center Function. This method takes as input n c , a user-defined parameter that specifies how many clusters to output. We compute the product &#948; i &#215; &#961; i for all points x i . The n c points with the largest products are the center points. This function can be implemented with a PAR-SELECT to find the n th c largest product t, and then a PAR-FILTER to filter out the points with product less than t. The work and span are O(n) and O(log n log log n), respectively. This method is used in <ref type="bibr">[53,</ref><ref type="bibr">77,</ref><ref type="bibr">44,</ref><ref type="bibr">61]</ref>. Noise Function. We implement a noise function F noise , which returns the points x i with density &#961; i &lt; &#961; min . These points are then ignored in the remainder of the algorithm. This can be implemented using a parallel filter with O(n) work and O(log n) span. This noise function is used by <ref type="bibr">[4,</ref><ref type="bibr">77,</ref><ref type="bibr">5]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Analysis of PECANN</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Work and Span Analysis</head><p>The work and span of PECANN (Algorithm 3.1) depend on the specific index construction algorithm and functions F density , F noise , and F center . Here, we choose the functions that give the best performance in our experiments (kth density, product center, and default noise functions).</p><p>We first analyze the work and span of computing dependent points as shown in Algorithm 3. The following theorem gives the overall work and span.</p><p>THEOREM 5.1. The work and span of PECANN using the k th density, product center, and the default noise functions are O(W c + nW nn ) and O(S c + S nn + log 2 n), respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Approximation Analysis</head><p>In this section, we give a brief analysis of the approximation guarantees of PECANN. Proofs and more detailed analyses can be found in the full version of our paper. Our analysis of the density approximation is based on the kth density function described above. Our analysis of the approximate dependent point computation is based on the threshold center function described above. Density Estimation. Assuming some guarantee in approximate k-nearest neighbor search, we can show that the density peaks of the exact algorithm that do not conflict with other points will remain density peaks. A conflict occurs when the density ranges of two points overlap. The density range of a</p><p>Name n d Description # Clusters gaussian 10 5 to 10 8 128 Standard benchmark 10 to 10000 MNIST 70,000 784 Raw images 10 ImageNet 1,281,167 1024 Image embeddings 1000 birds 84,635 1024 Image embeddings 525 reddit 420,464 1024 Text embeddings 50 arxiv 732,723 1024 Text embeddings 180</p><p>Table <ref type="table">6</ref>.1: Our datasets, along with their sizes (n), their dimensionality (d), and the number of ground truth clusters.</p><p>point bounds the approximate density value of the point.</p><p>LEMMA 5.2. Consider the threshold center function, which obtains the center points by selecting the points whose distance to their dependent point is greater than &#948; min . If the density interval of a point does not conflict with any other interval and it is a true density peak, then it is still a density peak in PECANN given the same threshold &#948; min .</p><p>Note that there may be additional density peaks returned by the approximate algorithm, but the true density peaks in the exact algorithm are guaranteed to still be density peaks. Dependent Point Estimation. Now we analyze the approximate dependent point found by Algorithm 3.2. The following lemma guarantees that the approximate dependent points returned by our algorithm are not too much further than the true dependent points. Let d j be the distance to the true j th nearest neighbor from query point q. As far as we know, other approximate DPC methods <ref type="bibr">[3,</ref><ref type="bibr">4,</ref><ref type="bibr">38]</ref> do not provide approximation bound on approximate dependent point search. LEMMA 5.3. Suppose we find the approximate dependent point among the &#946;k-approximate nearest neighbor, for &#946; &#8805; 1. The approximate dependent point is at most c 2 d &#946;k d k further from the exact dependent point given the same densities for some constant c &#8805; 1.</p><p>In Algorithm 3.2, we use &#946; = 2 for Lemma 5.3, since we double the number of nearest neighbors to find until we have found a dependent point.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Experimental Setup</head><p>Computational Environment We use c2-standard-60 instances on the Google Cloud Platform. These are 30-core machines with two-way hyper-threading with Intel 3.1 GHz Cascade Lake processors that can reach a max turbo clockspeed of 3.8 GHz. The instances have two non-uniform memory access (NUMA) nodes, each with 15 cores. Except for the experiments studying scalability with respect to the number of threads, we use all 60 hyper-threads for our experiments. Datasets. We use a variety of real-world and artificial datasets, summarized in Table <ref type="table">6</ref> dataset of dimension d = 128 with size n and c clusters, we first sample c centers x i uniformly from [0, 1] d , and then sample n/c points from a Gaussian centered at each x i with variance 0.05. &#8226; MNIST <ref type="bibr">[23]</ref> is a standard dataset that consists of 28 &#215; 28 dimensional images of grayscale digits between 0 and 9. The i th cluster corresponds to all occurrences of digit i. &#8226; ImageNet <ref type="bibr">[22]</ref> is a standard image classification benchmark with more than one million images, each of size 224 &#215; 224 &#215; 3. The images are from 1000 classes of everyday objects. Unlike for MNIST, we do not cluster the raw ImageNet images, but instead first pass each image through ConvNet <ref type="bibr">[62]</ref> to get an embedding. Each ground truth cluster contains the embeddings corresponding to a single image class from the original ImageNet dataset.</p><p>&#8226; birds <ref type="bibr">[36]</ref> is a dataset that contains images of 525 species of birds. The images have the same number of dimensions as ImageNet, and we pass it through the same ConvNet model to obtain an embedding dataset. The ground truth clusters are the 525 species of birds. This dataset is is out of distribution for the original ConvNet model. &#8226; reddit and arxiv are text embedding datasets studied in the recent Massive Text Embedding Benchmark (MTEB) work <ref type="bibr">[69]</ref>. We restrict our attention to embeddings from the best model on the current MTEB leaderboard, GTElarge <ref type="bibr">[58]</ref>. We also restrict our attention to the two largest datasets from MTEB, reddit, where the goal is to cluster embeddings corresponding to post titles into subreddits, and arxiv, where the goal is to cluster embeddings corresponding to paper titles into topic categories. Algorithms. We implement our algorithms using the Par-layLib <ref type="bibr">[9]</ref> and ParlayANN <ref type="bibr">[65]</ref> libraries. We use C++ for all implementations, and the gcc compiler with the -O3 flag to compile the code. We also provide Python bindings for PECANN. We evaluate the following algorithms.</p><p>&#8226; PECANN: Our framework described in Section 3 with the different density functions described in Section 4. Unless specified otherwise, we use the kth density function without normalization with k = 16, the VAMANA graph index with &#945; = 1.1, and the product center function with n c set to the number of ground truth clusters, and the default noise function. In Table <ref type="table">6</ref>.2, we give the rest of the default parameters that we used for each dataset. &#8226; FASTDP <ref type="bibr">[84]</ref>: A single-threaded approximate DPC algorithm that also uses graph-based ANNS to estimate densities. &#8226; k-MEANS: The FAISS <ref type="bibr">[54]</ref> implementation of k-means, an extremely efficient k-means implementation. It is parallelized by using parallel k-nearest neighbor search. The k-means algorithm takes in k, the number of clusters, niter, the number of iterations, and nredo, the number of times to retry and choose the best clustering. Unless specified otherwise, the number of clusters used in k-means is the number of clusters in the ground truth clustering. &#8226; BRUTEFORCE: An instantiation of PECANN, where we use a naive parallel brute force approach for every step. This method takes O(n 2 ) work. It also first searches within the k-nearest neighbors to find the dependent point. We refer to the result of BRUTEFORCE as the "exact DPC" result. &#8226; DBSCAN: A density-based clustering algorithm for lowdimensional data <ref type="bibr">[32,</ref><ref type="bibr">80]</ref>. We use the implementation in the Intel Extension for Scikit-learn <ref type="bibr">[72]</ref> for high-dimensional datasets, which is implemented in C++ and parallelized with parallel nearest neighbor search. We also tried Wang et al.'s <ref type="bibr">[93]</ref> parallel implementation, which is optimized for low-dimensional data, and found it slower than Scikit-learn on high-dimensional data. DBSCAN has two parameters &#1013; and min_pts: &#1013; defines the maximum distance between two points to be considered neighbors. min_pts specifies the minimum number of points required to form a dense region (core point), which triggers the formation of a cluster. We also tried a parallel exact DPC algorithm that uses a priority search kd-tree-based dependent point finding algorithm that was designed for low dimensions <ref type="bibr">[45]</ref>. We changed the first step of <ref type="bibr">[45]</ref> from a range search to a k-nearest neighbor search to match our framework. On MNIST, their algorithm takes 280s on our 30-core machine, which is 320 times slower than PECANN. This method is prohibitively slow because kd-trees suffer from the curse of dimensionality, where performance in high dimensions degrades to no better than a linear search <ref type="bibr">[96]</ref>. We thus do not further compare against this method. Evaluation. We evaluate clustering quality using the Adjusted Rand Index (ARI) <ref type="bibr">[46]</ref>, homogeneity, and completeness <ref type="bibr">[78]</ref>. Consider our clustering C and the ground-truth or exact clustering T . Intuitively, ARI evaluates how similar C and T are. Homogeneity measures if each cluster in C contains members from the same class in T . Completeness measures whether all members in T of a given class are in the same cluster in C.</p><p>Let n ij be the number of objects in the ground truth cluster i and the cluster j generated by the algorithm, n i * be j n ij , n * j be i n ij , and n be i n i * . The ARI is computed as</p><p>.</p><p>The ARI score is 1 for a perfect match, and its expected value is 0 for random assignments.</p><p>The formulas for homogeneity and completeness of clusters are defined as follows: homogeneity = 1 -H(C|T ) H(C) ; completeness = 1 -H(T |C) H(T ) . H(C|T ) is the conditional entropy of the class distribution given the cluster assignment, 6.2 Scalability Figure <ref type="figure">6</ref>.1 shows the parallel scalability of PECANN on our larger datasets. PECANN achieves an average of 14.36x self-relative speedup on one NUMA node with 30 hyper-threads and an average of 16.57x self-relative speedup on two NUMA nodes with 60 hyper-threads.</p><p>We also study the runtime of PECANN as we increase the size of the synthetic gaussian dataset and vary the number of clusters between 10 to 10,000. We use a linear fit on the logarithm of runtime and log n to obtain the slopes, which reflects the exponent in the growth of runtime with respect to data size. We find that the slope ranges from 1.12-1.2 depending on the number of output clusters, and thus experimentally the runtime grows approximately as O(n 1.2 ) for this dataset. This shows that PECANN has good scalability with respect to n.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Runtime Decomposition</head><p>We present the runtime decomposition of PECANN on each dataset with all density methods and all values of k in the full paper. The bottleneck of the runtime is the index construction time and the k-nearest neighbor time when computing densities. When k is larger, the k-nearest neighbor search time for density computation is longer, as expected. Computing clusters with union-find is fast because this step has low work, as discussed in Section 4. The dependent point computation time is much shorter than the density computation because the dependent point for some points can be obtained from the k-nearest neighbors (Lines 7-8 in Algorithm 3.2), so we do not need to run nearest neighbor searches for these points. Additionally, even when the dependent point is not in the k-nearest neighbors, our doubling technique finds a dependent point in the first few rounds for most points, thereby usually avoiding an expensive exhaustive search.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Comparison of Different Density Functions,</head><p>Values of k, and Graph Indices In Figure <ref type="figure">6</ref>.2, we show the runtime vs. ARI of different density functions and values of k. We see that the kth density function is the most robust and achieves the highest ARI score on most datasets. We also observe that using k = 16 provides a good trade-off between quality and time. exp-sum, sum, and sum-exp are other density functions in PECANN, which are combinations of the distances to the k-nearest neighbors. We describe them in our full paper.</p><p>We can easily swap in different graph indices into our framework and compare the results. In Figure <ref type="figure">6</ref>.3, we show a Pareto frontier of the clustering quality vs. runtime on ImageNet for each of the following different graph indices: VAMANA <ref type="bibr">[52]</ref>, PYNNDESCENT <ref type="bibr">[67]</ref>, and HCNNG <ref type="bibr">[70]</ref>. The Pareto frontier comprises points that are non-dominated, meaning no point on the frontier can be improved in quality without worsening time and vice versa. In other words, the curve we plot represents the optimal trade-off in the parameter space between clustering time and quality.</p><p>To create the Pareto frontier, we do a grid search for each method over different choices of maximum degree R and the beam sizes for construction, k-nearest neighbor search, and dependent point finding. We choose all combinations of these four parameters from <ref type="bibr">[8,</ref><ref type="bibr">16,</ref><ref type="bibr">32,</ref><ref type="bibr">64,</ref><ref type="bibr">128</ref>, 256] 4 . We set the density method to be kth without normalization and k = 16. We set &#945; = 1.1 for VAMANA and PYNNDESCENT. HCNNG and PYNNDESCENT additionally accept a num_repeats argument, which represents how many times we independently repeat the construction process before merging the results together; we set this parameter equal to 3. We see that all graph indices are able to achieve similar maximum ARI with respect to the ground truth: VAMANA, HCNNG, and PYN-NDESCENT achieve maximum ARIs of 0.709, 0.715, and 0.713, respectively. HCNNG attains this maximum slightly faster than the other two indices, but when compared to the exact DPC result, HCNNG has a smaller maximum ARI, which means its clustering deviates more from the exact solution. Indeed, HCNNG has a maximum ARI compared to exact DPC of 0.918, while PYNNDESCENT and VAMANA attain a maximum ARI of 0.995 compared to exact DPC.</p><p>We also find that among the four Vamana hyperparameters, the maximum degree of the graph and construction beam size have both the largest contribution to the ARI and the largest impact on the clustering time. Please find more details in the full version of our paper. To obtain the Pareto frontiers, we use the same parameter values as in the last experiment, except that for the smaller datasets with n &lt; 250, 000 we use a smaller range <ref type="bibr">[8,</ref><ref type="bibr">16,</ref><ref type="bibr">32,</ref><ref type="bibr">64]</ref> 4 for the parameter search space. We see that PECANN can achieve results very close to the exact DPC clustering. On all datasets except arxiv, PECANN achieves at least 0.995 ARI with respect to exact DPC, and on arxiv, PECANN achieves 0.989 ARI with respect to exact DPC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6">Comparison of Different Methods</head><p>In Figure <ref type="figure">6</ref>.5, we plot the Pareto frontier of clustering quality (ARI with respect to the ground truth clustering) vs. runtime for different methods on the larger datasets. To obtain the Pareto frontiers, we use the same parameters for VAMANA as in the previous experiment. For K-MEANS, we use nredo &#8712; <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">3,</ref><ref type="bibr">4]</ref> and niter &#8712; [1, 2, 3, . . . , <ref type="bibr">9, 10, 15, 20, 25, . . . , 40, 45]</ref>, for all combinations where niter &#215; nredo &lt; 100. For FASTDP, we use window_size &#8712; <ref type="bibr">[20,</ref><ref type="bibr">40,</ref><ref type="bibr">80,</ref><ref type="bibr">160,</ref><ref type="bibr">320]</ref> for all datasets (controlling query quality) and max_iterations &#8712; <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">4,</ref><ref type="bibr">8,</ref><ref type="bibr">16,</ref><ref type="bibr">32,</ref><ref type="bibr">64]</ref> (controlling graph construction quality). For DBSCAN, we use different parameters for each dataset, based on guidelines from <ref type="bibr">[79,</ref><ref type="bibr">80,</ref><ref type="bibr">75]</ref>. <ref type="bibr">[79]</ref> suggest setting min_pts to 2d -1. For high-dimensional datasets, <ref type="bibr">[80]</ref> suggest that increasing min_pts may improve results. &#1013; is chosen based on the distribution of the min_pts-nearest neighbor distances <ref type="bibr">[75]</ref>. The parameters can be found in our full paper.</p><p>We observe that DBSCAN has lower quality and higher runtime than all other baselines. As the original authors of DBSCAN state, it is difficult to use DBSCAN for highdimensional data <ref type="bibr">[80]</ref>.</p><p>We observe that the sequential FASTDP is slower that PECANN on all datasets. In terms of accuracy, PECANN has better maximum ARI on birds and arxiv, while FASTDP has better maximum ARI on reddit (although as we discuss below, reddit is not well suited to DPC).</p><p>Compared with k-MEANS, PECANN obtains better quality and is faster on ImageNet and birds, where the number of ground truth clusters is large, and performs about equal with k-MEANS on arxiv. However, PECANN has worse quality for a given time limit on reddit and mnist. Although PECANN has worse quality than k-MEANS on two datasets when k-MEANS uses the correct number of clusters, k-MEANS's quality is sensitive to the number clusters. As shown in Subsection 6.7, k-MEANS can have lower quality than PECANN on these two datasets when k is not the number of ground truth clusters.</p><p>We summarize the best ground truth ARI and the corresponding parallel running time that all these methods, as well as BRUTEFORCE, achieve in Table <ref type="table">6</ref>.3. Compared to density-based methods, PECANN achieves 37.7-854.3x speedup over BRUTEFORCE, 45-734x speedup over FASTDP, while achieving comparable ARI. PECANN also achieves up to 0.7 higher ARI than DBSCAN, and is up to orders-ofmagnitude faster.</p><p>For more intuition on the runtime differences between PECANN and k-MEANS, note that the work of each iteration of k-MEANS is linear in the number of clusters multiplied by n, and so k-MEANS is fast on datasets like MNIST with a small number of ground truth clusters, while it is slower on datasets like birds and ImageNet that have many clusters.</p><p>In terms of an explanation for the quality difference between PECANN and k-MEANS, PECANN gets better maximum accuracy on ImageNet and birds, which may be because the ground truth clusters in these datasets form shapes that our density-based method PECANN can find, but that the geometrically constrained k-means cannot. On the other hand, for reddit, PECANN has lower quality than k-MEANS. Since we still obtain cluster quality very close to the exact DPC on reddit (see Figure <ref type="figure">6</ref>.4), this dataset is a case where the density based DPC method is worse than the simpler k-means heuristic.  frontier of ARI with respect to ground truth vs. runtime. Up and to the left is better. PECANN is the best method on ImageNet and birds, has similar performance to the best method (k-MEANS) on arxiv, and is or has worse quality than the best method (k-means) on mnist and reddit. FASTDP is sequential. The x-axis on arxiv, ImageNet, and MNIST are in log-scale. Algorithm Dataset Time (s) Maximum ARI PECANN arxiv 11.65 0.07 FASTDP arxiv 8557.89 0.06 BRUTEFORCE arxiv 9953.15 0.07 KMEANS arxiv 2.41 0.07 DBSCAN arxiv 451.99 0.03 PECANN birds 0.86 0.65 FASTDP birds 128.71 0.63 BRUTEFORCE birds 66.04 0.66 KMEANS birds 28.66 0.65 DBSCAN birds 6.79 0.30 PECANN ImageNet 101.58 0.71 FASTDP ImageNet 7655.91 0.71 BRUTEFORCE ImageNet 31979.98 0.71 KMEANS ImageNet 188.17 0.65 DBSCAN ImageNet 1481.39 0.42 PECANN MNIST 0.87 0.37 FASTDP MNIST 39.36 0.37 BRUTEFORCE MNIST 32.80 0.34 KMEANS MNIST 0.22 0.40 DBSCAN MNIST 3.59 0.18 PECANN reddit 14.90 0.12 FASTDP reddit 7621.71 0.14 BRUTEFORCE reddit 2888.20 0.10 KMEANS reddit 5.36 0.42 DBSCAN reddit 148.48 0.05 Table 6.3: The maximum ARI score with respect to the ground truth achieved by different clustering algorithms across different datasets, and their corresponding parallel running time. settings. In Figure 6.6, we show a Pareto frontier of the completeness and homogeneity scores (with respect to ground truth) of PECANN and k-MEANS on different datasets with varying number of clusters. We generate this Pareto frontier using the same experiment setup as earlier, except now we 0.2 0.4 0.6 0.8 1.0 Homogeneity 0.4 0.6 0.8 Completeness dataset arxiv birds imagenet mnist reddit method Vamana kmeans record homogeneity and completeness instead of ARI as we vary the number of clusters given to each method. Thus, points along the Pareto frontier in Figure <ref type="figure">6</ref>.6 are optimal tradeoffs between homogeneity and completeness as we vary the cluster granularity. We see that PECANN strictly dominates k-MEANS on birds and MNIST, and k-MEANS is better on arxiv and reddit. On ImageNet, PECANN achieves higher completeness and k-MEANS achieves higher homogeneity.</p><p>In the full paper, we also study the ARI of PECANN using VAMANA and k-MEANS when we pass a number of clusters to the algorithm different than the ground truth. When the number of clusters used is larger than the ground truth, the quality of k-MEANS decays quickly while the quality of PECANN is more robust.</p><p>7 Related Work Variants of DPC. The original DPC algorithm <ref type="bibr">[77]</ref> uses a range search to compute the density of a point x, where the density is defined as the number of points in a ball of fixed radius centered at x. In contrast, while PECANN supports any density metric, our paper focuses specifically on k-nearest neighbor-based DPC variants, which do not require a range search. These methods are less sensitive to noise and outliers <ref type="bibr">[33]</ref> and are more computationally efficient to compute in high dimensions. Some of these methods (e.g., <ref type="bibr">[33,</ref><ref type="bibr">101,</ref><ref type="bibr">88,</ref><ref type="bibr">98]</ref>) also have a refinement step after obtaining the initial DPC clustering. For these methods, PECANN can be used to efficiently obtain the first DPC clustering before the refinement step.</p><p>Floros et al. <ref type="bibr">[33]</ref> and Chen et al. <ref type="bibr">[15]</ref> use the inverse of the distance to the k th nearest neighbor as the density measure. Sieranoja and Fr&#228;nti <ref type="bibr">[84]</ref> propose FASTDP, which uses the inverse of the average distance to all k-nearest neighbors as the density measure, and finds the k-nearest neighbors by constructing an approximate k-nearest neighbor graph. d'Errico et al. <ref type="bibr">[28]</ref> propose a variant of DPC for highdimensional data. It combines DPC with a non-parametric density estimator called PAk, but their algorithm is sequential. There are also many other variants of DPC <ref type="bibr">[27,</ref><ref type="bibr">87,</ref><ref type="bibr">44,</ref><ref type="bibr">35,</ref><ref type="bibr">44,</ref><ref type="bibr">102,</ref><ref type="bibr">88,</ref><ref type="bibr">57,</ref><ref type="bibr">13,</ref><ref type="bibr">24,</ref><ref type="bibr">107,</ref><ref type="bibr">100,</ref><ref type="bibr">57,</ref><ref type="bibr">98,</ref><ref type="bibr">26,</ref><ref type="bibr">101]</ref>. There are also algorithms that perform dimensionality reduction on the dataset before running DPC <ref type="bibr">[26,</ref><ref type="bibr">13]</ref>. Parallel, Approximate, and Dynamic DPC. Zhang et al. <ref type="bibr">[104]</ref> propose an approximate DPC algorithm for MapReduce using locality-sensitive hashing. Amagata and Hara <ref type="bibr">[4]</ref> propose a partially parallel exact DPC algorithm and two parallel grid-based approximate DPC algorithms. They also propose parallel static and dynamic DPC algorithms for data in Euclidean space <ref type="bibr">[5,</ref><ref type="bibr">3]</ref>. Huang et al. <ref type="bibr">[45]</ref> propose a parallel exact DPC algorithm based on priority kd-trees and show their algorithm outperforms previous tree-index approaches <ref type="bibr">[4,</ref><ref type="bibr">76]</ref>. Lu et al. <ref type="bibr">[63]</ref> propose speeding up DPC using space-filling curves. Unlike PECANN, these algorithms <ref type="bibr">[4,</ref><ref type="bibr">5,</ref><ref type="bibr">45,</ref><ref type="bibr">7]</ref> are only efficient on low-dimensional datasets and must be used with Euclidean distance.</p><p>Amagata <ref type="bibr">[3]</ref> proposes an approximate dynamic DPC algorithm for metric data, but it is sequential and only tested on datasets with up to 115 dimensions. In comparison, PECANN is parallel and we experimented on datasets with up to 1024 dimensions. There are also dynamic algorithms for k-nearest neighbor-based DPC variants <ref type="bibr">[81,</ref><ref type="bibr">25]</ref>. Density-based Clustering Algorithms. DPC falls under the broad category of density-based clustering algorithms, which have the advantage of being able to detect clusters of arbitrary shapes. Some density-based clustering algorithms define the density of a point based on the number of points in its vicinity <ref type="bibr">[32,</ref><ref type="bibr">2,</ref><ref type="bibr">6,</ref><ref type="bibr">50,</ref><ref type="bibr">77,</ref><ref type="bibr">33,</ref><ref type="bibr">16]</ref>. Others use a gridbased definition, which first quantizes the space into cells and then does clustering on the cells <ref type="bibr">[92,</ref><ref type="bibr">43,</ref><ref type="bibr">42,</ref><ref type="bibr">82]</ref>. Still others use a probabilistic density function <ref type="bibr">[92,</ref><ref type="bibr">55,</ref><ref type="bibr">86]</ref>. One popular density-based clustering algorithm is DBSCAN <ref type="bibr">[32]</ref>, which has many derivatives as well <ref type="bibr">[6,</ref><ref type="bibr">89,</ref><ref type="bibr">39,</ref><ref type="bibr">11,</ref><ref type="bibr">31,</ref><ref type="bibr">12,</ref><ref type="bibr">17]</ref>. However, the original authors of DBSCAN state that it is difficult to use for high-dimensional data <ref type="bibr">[80]</ref>. Graph-based Approximate Nearest Neighbor Search (ANNS). Graph-based ANNS methods have been shown to be effective in practice <ref type="bibr">[91,</ref><ref type="bibr">65,</ref><ref type="bibr">95]</ref>. Existing graph-based indices include Hierarchical Navigable Small World Graph (HNSW) <ref type="bibr">[64]</ref>, DiskANN (also called Vamana) <ref type="bibr">[52]</ref>, HC-NNG <ref type="bibr">[70]</ref>, PyNNDescent <ref type="bibr">[67]</ref>, &#964; -MNG <ref type="bibr">[73]</ref>, and many others (e.g., <ref type="bibr">[106,</ref><ref type="bibr">14,</ref><ref type="bibr">19]</ref>). Please see <ref type="bibr">[65]</ref> and <ref type="bibr">[91]</ref> for comprehensive overviews of these methods and their comparisons with non-graph-based methods, such as locality-sensitive hashing, inverted indices, and tree-based indices. There are also works that explore the theoretical aspects of graph-based ANNS <ref type="bibr">[71,</ref><ref type="bibr">74,</ref><ref type="bibr">83,</ref><ref type="bibr">56,</ref><ref type="bibr">47]</ref>.</p><p>The dependent point search in DPC can also be viewed as a filtered search, where the points' labels are their density, and we filter for points with densities larger than the query point's density. Various graph-based similarity search algorithms have been adapted recently to support filtering <ref type="bibr">[105,</ref><ref type="bibr">90,</ref><ref type="bibr">37,</ref><ref type="bibr">41]</ref>. Gollapudi et al. <ref type="bibr">[37]</ref> propose the Filtered DiskANN algorithm, which supports filtered ANNS queries, where nearest neighbors returned must match the query's labels. Gupta et al. <ref type="bibr">[41]</ref> developed the CAPS index for filtered ANNS via space partitions, which supports conjunctive constraints while DiskANN does not. Both DiskANN <ref type="bibr">[85]</ref> and CAPS can be made dynamic. However, these solutions use categorical labels, and a point can have multiple labels. Using this approach for dependent point finding requires quadratic memory just to specify the labels (the i th least dense point would need i -1 labels, which are the i -1 smaller density values than its density), which is prohibitive. Indeed, we tried running the Filtered DiskANN code on our datasets but it ran out of space on our machine. VBASE <ref type="bibr">[103]</ref> also supports filtered search by first searching for k-nearest neighbors and then filtering. However, they do not handle the case when there are no neighbors returned that satisfy the criteria.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Conclusion</head><p>We present the PECANN framework for density peaks clustering (DPC) variants in high dimensions. We adapt graph-based approximate nearest neighbor search methods to support (filtered) proximity searches in DPC variants. PECANN is highly parallel and scales to large datasets. We show several DPC variants that can implemented in PECANN, and evaluate them on large datasets. PECANN achieves significant improvements in runtime and clustering quality over the state of the art.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Downloaded 07/30/25 to 204.98.75.5 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>The batch insertion method in<ref type="bibr">[65]</ref> sets a batch size upper bound of 0.02n, which does not affect the bounds, as there will only be a constant number (&lt; 50) more batches after the upper bound is reached.</p></note>
		</body>
		</text>
</TEI>
