<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Framework for Parallelizing Hierarchical Clustering Methods</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10130909</idno>
					<idno type="doi"></idno>
					<title level='j'>European Conference on Machine Learning</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Thomas Lavastida Silvio Lattanzi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Hierarchical clustering is a fundamental tool in data mining, machine learning and statistics. Popular hierarchical clustering algorithms include top-down divisive approaches such as bisecting k-means, k-median, and k-center and bottom-up agglomerative approaches such as single- linkage, average-linkage, and centroid-linkage. Unfortunately, only a few scalable hierarchical clustering algorithms are known, mostly based on the single-linkage algorithm. So, as datasets increase in size every day, there is a pressing need to scale other popular methods.We introduce efficient distributed algorithms for bisecting k-means, k- median, and k-center as well as centroid-linkage. In particular, we first formalize a notion of closeness for a hierarchical clustering algorithm, and then we use this notion to design new scalable distributed methods with strong worst case bounds on the running time and the quality of the solutions. Finally, we show experimentally that the introduced algorithms are efficient and close to their sequential variants in practice.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Thanks to its ability in explaining nested structures in real world data, hierarchical clustering is a fundamental tool in any machine learning or data mining library. In recent years the method has received a lot of attention <ref type="bibr">[23,</ref><ref type="bibr">24,</ref><ref type="bibr">15,</ref><ref type="bibr">35,</ref><ref type="bibr">7,</ref><ref type="bibr">10,</ref><ref type="bibr">34,</ref><ref type="bibr">37,</ref><ref type="bibr">5]</ref>. But despite these efforts, almost all proposed hierarchical clustering techniques are sequential methods that are difficult to apply on large data sets.</p><p>The input to the hierarchical clustering problem is a set of points and a function specifying either their pairwise similarity or their dissimilarity. The output of the problem is a rooted tree representing a hierarchical structure of the input data, also known as a dendrogram. The input points are the leaves of this tree and subtrees induced by non-leaf nodes represent clusters. These clusters should also become more refined when the root of the corresponding subtree is at a lower level in the tree. Hierarchical clustering is useful because the number of clusters does not need to be specified in advance and because the hierarchical structure yields a taxonomy that allows for interesting interpretations of the data set. For an overview of hierarchical clustering methods refer to <ref type="bibr">[31,</ref><ref type="bibr">27,</ref><ref type="bibr">18]</ref>. Several algorithms have emerged as popular approaches for hierarchical clustering. Different techniques are used depending on the context because each method has its own advantages and disadvantages. There are various classes of data sets where each method outperforms the others. For example, the centroidlinkage algorithm has been used for biological data such as genes <ref type="bibr">[12]</ref>, whereas, an alternative method, bisecting k-means, is popular for document comparison <ref type="bibr">[36]</ref>. The most commonly used methods can be categorized into two families: agglomerative algorithms and divisive algorithms.</p><p>Divisive algorithms are top-down. They partition the data starting from a single cluster and then refine the clusters iteratively layer by layer. The most commonly used techniques to refine clusters are k-means, k-median, or k-center clustering with k = 2. These divisive algorithms are known as bisecting k-means (respectfully, median, center) algorithms <ref type="bibr">[22]</ref>. Agglomerative algorithms are based on a bottom up approach (see <ref type="bibr">[17]</ref> for details). In an agglomerative algorithm, all points begin as their own cluster. Clusters are then merged through some merging strategy. The choice of merging strategy determines the algorithm. Common strategies include single-linkage, average-linkage and centroid-linkage.</p><p>Most of these algorithms are inherently sequential; they possess a large number of serial dependencies and do not lend themselves to efficient parallelization. For example, in centroid-linkage one cannot simultaneously perform many merges because the selection of which clusters to merge may depend strongly on prior merges (and their resultant centroids).</p><p>Recently, there has been interest in making hierarchical clustering scalable <ref type="bibr">[24,</ref><ref type="bibr">23,</ref><ref type="bibr">15,</ref><ref type="bibr">35,</ref><ref type="bibr">5,</ref><ref type="bibr">33,</ref><ref type="bibr">30,</ref><ref type="bibr">38]</ref>. Nevertheless most prior work has focused on scaling the single-linkage algorithm; efficient MapReduce and Spark algorithms are known for this problem <ref type="bibr">[24,</ref><ref type="bibr">23,</ref><ref type="bibr">5,</ref><ref type="bibr">38]</ref>. This includes the result of <ref type="bibr">[38]</ref> giving strong theoretical guarantees and practical performance for scaling single-linkage clustering. This is unsurprising because single-linkage can be reduced to computing a Minimum-Spanning-Tree <ref type="bibr">[14]</ref>, and there has been a line of work on efficiently computing minimum spanning trees in parallel and distributed settings <ref type="bibr">[2,</ref><ref type="bibr">26,</ref><ref type="bibr">28,</ref><ref type="bibr">1,</ref><ref type="bibr">32]</ref>. Unfortunately this approach does not extend to other hierarchical clustering techniques. In contrast, to the best of our knowledge no efficient distributed algorithm is known for centroid-linkage or divisive clustering methods. Thus, scaling methods such as centroid-linkage and bisecting k-means are open problems and the main focus of this paper. Our Contribution: In this paper we introduce fast scalable hierarchical clustering algorithms. The main results of the paper are the following: A Theoretical Framework: This paper develops a theoretical framework for scaling hierarchical clustering methods. We introduce the notion of closeness between two hierarchical clustering algorithms. Intuitively, two algorithms are close if they make provably close or similar decisions. This enforces that our scalable algorithms produce similar solutions to their sequential counterpart. Using this framework, the paper formalizes the root question for scaling existing methods. Provably Scalable Algorithms: We introduce fast scalable algorithms for centroid-linkage and the bisecting k-means, k-median and k-center algorithms. These new algorithms are the main contribution of the paper. The algorithms are proved to be close to their sequential counterparts and efficient in parallel and distributed models. These are the first scalable algorithms for divisive k-clustering as well as centroid-linkage. Empirical results: We empirically study the algorithms on three datasets to show that they are efficient. The empirical results demonstrate that the distributed algorithms are closer to their sequential counterparts than the theory suggests. This shows that the new methods produce clusterings remarkably similar to those produced by the sequential methods.</p><p>The algorithms can be used for most data sets. The scalable bisecting kclustering algorithms apply to data belonging to any metric space. For centroid linkage, we assume that the input data points belong to some Euclidean space so that computing the centroid of a finite set of points is well defined. In this case, our techniques generalize to any distance function between points in Euclidean space for which a family of Locality Sensitive Hash (LSH) functions is known, such as distances induced by an &#8467; p -norm for p &#8712; (0, 2] <ref type="bibr">[11]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Preliminaries</head><p>In this section we formally define the hierarchical clustering problem, describe popular sequential approaches, and provide other necessary background information. Problem Input: The input is a set S of n data points. The distance between points specifies their dissimilarity. Let d(u, v) &#8805; 0 denote the distance between two points u, v &#8712; S. It is assumed that d is a metric.</p><p>Problem Output: The output is a rooted tree where all of the input points are at the leaves. Internal nodes represent clusters; the leaves of the subtree rooted at a node correspond to the points in that specific cluster. Computational Model: We analyze our algorithms in the massively parallel model of computation <ref type="bibr">[26,</ref><ref type="bibr">20]</ref> <ref type="foot">foot_0</ref> . Let N be the input size. In this model we have m = O(N 1-&#948; ) machines with local memory of size &#213;(N &#948; ), for constant &#948; &gt; 0. The &#213; suppresses logarithmic factors. Notice that the total amount of memory used is near linear. The computation occurs in rounds and during each round each machine runs a sequential polynomial time algorithm on the data assigned to it. No communication between machines is allowed during a round. Between rounds, machines are allowed to communicate as long as each machine receives no more communication than its memory and no computation occurs. Ideally, in this model one would like to design algorithms using a number of rounds that is no more than logarithmic in the input size. k-Clustering Methods: We recall the definitions of k-{center,median,means} clusterings. Let C = {c 1 , c 2 , . . . , c k } be k distinct points of S called centers. For x &#8712; S let d(x, C) = min c&#8712;C d(x, c) We say that these centers solve the k-{center,median,means} problem if they optimize the following objectives, respectively: k-center:</p><p>The choice of centers induces a clustering of S in the following natural way.</p><p>)}, that is we map each point to its closest center and take the clustering that results. In general it is NP-hard to find the optimal set of centers for each of these objectives, but efficient O(1)-approximations are known <ref type="bibr">[19,</ref><ref type="bibr">9,</ref><ref type="bibr">25]</ref>. Classic Divisive Methods: We can now describe the classical divisive kclustering algorithms. The pseudocode for this class of methods is given in Algorithm 1. As stated before, these methods begin at the root of the cluster tree corresponding to the entire set S and recursively partition the set until we reach the leaves of the tree. Note that at each node of the tree, we use an optimal 2-clustering of the current set of points to determine the two child subtrees of the current node. Classic Agglomerative Methods: Recall that agglomerative methods start with each data point as a singleton cluster and iteratively merge clusters to build the tree. The choice of clusters to merge is determined by considering some cost on pairs of clusters and then choosing the pair that minimizes the cost. For example, in average-linkage the cost is the average distance between the clusters, and for centroid-linkage the cost is the distance between the clusters' centroids. In this paper we focus on the centroid-linkage method, and have provided pseudocode for this method in Algorithm 2. Notation: We present some additional notation and a few technical assumptions. Let X be a finite set of points and x a point in X. We define the ball of radius R around x, with notation B(x, R), as the set of points with distance at most Let &#8710;(X) = max x,y&#8712;X d(x, y) be the maximum distance between points of X. When X is a subset of Euclidean space, let &#181;(X) = 1</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>|X|</head><p>x&#8712;X x be the centroid of X. Finally, WLOG we assume that all points and pairwise distances are distinct <ref type="foot">5</ref>and that the ratio between the maximum and minimum distance between two points is polynomial in n.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">A Framework for Parallelizing Hierarchical Clustering Algorithms</head><p>We now introduce our theoretical framework that we use to design and analyze scalable hierachical clustering algorithms. Notice that both divisive and agglomerative methods use some cost function on pairs of clusters to guide the decisions of the algorithm. More precisely, in divisive algorithms the current set of points S is partitioned into S 1 , and S 2 according to some cost c(S 1 , S 2 ). Similarly, agglomerative algorithms merge clusters S 1 and S 2 by considering some cost c(S 1 , S 2 ). So in both settings the main step consists of determining the two sets S and S 2 using different cost functions. As an example, observe that c(S 1 , S 2 ) is the 2-clustering cost of the sets S 1 and S 2 in the divisive method above and that c(S 1 , S 2 ) = d(&#181;(S 1 ), &#181;(S 2 )) in centroid linkage.</p><p>The insistence on choosing S 1 , S 2 to minimize the cost S 1 , S 2 leads to the large number of serial dependencies that make parallelization of these methods difficult. Thus, the main idea behind this paper is to obtain more scalable algorithms by relaxing this decision making process to make near optimal decisions. This is formalized in the following definitions.</p><p>Definition 1 (&#945;-close sets). Let c be the cost function on pairs of sets and let S 1 , S 2 be the two sets that minimize c(S 1 , S 2 ). Then we say that two sets</p><p>Definition 2 (&#945;-close algorithm). We say that a hierarchical clustering algorithm is &#945;-close to the optimal algorithm for cost function c if at any step of the algorithm the sets selected by the algorithm are &#945;-close for cost c, for &#945; &#8805; 1. <ref type="foot">6</ref>A necessary condition for efficiently parallelizing an algorithm is that it must not have long chains of dependencies. Now we formalize the concept of chains of dependencies.</p><p>Definition 3 (Chain of dependency). We say that a hierarchical clustering algorithm has a chain of dependencies of length &#8467;, if every decision made by the algorithm depends on a chain of at most &#8467; previous decisions.</p><p>We now define the main problem tackled in this paper.</p><p>Problem 1 Is it possible to design hierarchical clustering algorithms that have chain of dependencies of length at most poly(log n) and that are &#945;-close, for small &#945;, for the k-means, the k-center, the k-median and centroid linkage cost functions?</p><p>It is not immediately obvious that allowing our algorithms to be &#945;-close will admit algorithms with small chains of dependencies. In sections 4.1 and 4.2 we answer this question affirmatively for divisive k-clustering methods and centroid linkage <ref type="foot">7</ref> . Then in section 4.3 we show how to efficiently implement these algorithms in the massively parallel model of computation. Finally, we give experimental results in section 5, demonstrating that our algorithms perform close to the sequential algorithms in practice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Fast Parallel Algorithms for Clustering</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Distributed Divisive k-Clustering</head><p>We now present an O(1)-close algorithm with dependency chains of length O(log(n)) under the assumption that the ratio of the maximum to the minimum distance between points is polynomial in n.</p><p>As discussed in Sections 2 and 3, the main drawback of Algorithm 1 is that its longest chains of dependencies an be linear in the size of the input 8 . We modify this algorithm to overcome this limitation while remaining O(1)-close with respect to the clustering cost objective. In order to accomplish this we maintain the following invariant. Each time we split S into S 1 and S 2 , each set either contains a constant factor fewer points than S or the maximum distance between any two points has been decreased by a constant factor compared to the maximum distance in S. This property will ensure that the algorithm has a chain of dependency of logarithmic depth. We present the pseudocode for the new algorithm in Algorithms 3 and 4. The goal of this subsection is to show the following theorem guaranteeing that Algorithm 3 is provably close to standard divisive k-clustering algorithms, while having a small chain of serial dependencies. The main difference between Algorithm 1 and Algorithm 3 is the reassignment step. This step's purpose is to ensure that the invariant holds at any point during the algorithm as shown in the following lemma. Intuitively, if both S 1 and S 2 are contained within a small ball around their cluster centers, then the invariant is maintained. However, if this is not the case, then there are many points "near the border" of the two clusters, so we move these around to maintain the invariant.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 1. After the execution of Reassign</head><p>Proof. Let S 1 , S 2 be the resulting clusters and v 1 , v 2 be their centers. Consider the sets B i = B(v i , &#8710;(S)/8) &#8745; S, for i = 1, 2. If S 1 &#8838; B 1 and S 2 &#8838; B 2 , then both 8 Consider an example where the optimal 2-clustering separates only 1 point at a time.</p><p>1 Reassign(S1, S2, &#8710;) 2 Let v1, v2 be the centers of S1, S2, respectively Lemma 2. Let p be the norm of the clustering objective desired (i.e. p = 2 for k-means, p = &#8734; for k-center or p = 2 for k-median). The clustering produced in each iteration is a constant approximation to any desired k-clustering objective with any constant norm p &#8712; (0, 2] or p = &#8734;.</p><p>The proof of Lemma 2 is deferred to the full version of this paper. Combining Lemma 1 and Lemma 2 we obtain Theorem 1 as a corollary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Distributed Centroid-Linkage</head><p>As discussed in Sections 2 and 3, Algorithm 2 has a linear chain of dependencies. In this subsection we show how to modify step 4 of Algorithm 2 to overcome this difficulty.</p><p>The main intuition is to change Algorithm 2 to merge any pair of clusters A, B whose centroids are within distance &#945;&#948;, where &#948; is the current smallest distance between cluster centroids and &#945; &#8805; 1 is a small constant. Our algorithm will find a collection of disjoint pairs of clusters which meet this condition. The algorithm then merges all such pairs and updates the minimum distance before repeating this procedure. This procedure is described in Algorithm 5.</p><p>1 CloseCentroidClustering(S, &#945;) 2 Let T be an empty tree 3 For each x &#8712; S add a leaf node corresponding to {x} to T 4 Let C be the current set of clusters while |C| &gt; 1 do There are two issues that arise when bounding the algorithm's worst-case guarantees. First, it is not clear how to efficiently implement lines 5-10 in the distributed setting. We will address this issue in Section 4.3, where we describe the distributed implementations. Intuitively, we apply the popular locality-sensitive-hashing (LSH) technique, allowing us to perform these steps efficiently in the distributed setting.</p><p>The second issues is that it is difficult to bound the chain of dependencies for this algorithm since we cannot guarantee that the minimum distance &#948; increases by a constant factor after each iteration of the while loop. <ref type="foot">10</ref> Nevertheless, we find that this formulation of the algorithm works well empirically despite not having a formal bound on the round complexity. See Section 5 for these results.</p><p>To understand why this algorithm might have small round complexity in practice, we developed an algorithm with strong guarantees on its running time. See the full version of this paper for a proper description of this algorithm. For intuition, the algorithm maintains the following two invariants. First, if two clusters X, Y are merged during the algorithm, then the distance between their centroids is O(log 2 (n)&#948;), where &#948; is the current minimum distance between any two clusters. Second, at the end of the merging step the minimum distance between the centroids of the resulting clusters is at least (1 + &#491;)&#948;, for some constant &#491; &gt; 0.<ref type="foot">foot_6</ref> These two invariants taken together imply an O(log 2 (n))-close algorithm for centroid linkage with O(poly(log n)) length dependency chains when the ratio of the maximum to the minimum distance in S is bounded by a polynomial in n.</p><p>To achieve these invariants our new algorithm carefully merges nodes in two stages. A first where the algorithm recursively merges subsets of points that are close at the beginning of the stage. Then a second where the algorithm merges the leftover points from different merges of the first stage. With this two stage approach, we can formally bound the dependency chains and the closeness of the resulting algorithm. The precise description and analysis of this algorithm is involved and is presented in the full version of this paper. The following theorem characterizes the main theoretical result of this section. Theorem 2. There exists an algorithm that is O(log 2 (n))-close to the sequential centroid linkage algorithm and it has O(poly(log n)) length chains of dependencies.</p><p>The main trade-off involved in this result is that in order to ensure a fast running time our algorithm must be willing to make some merges that are an O(log 2 (n))-factor worse than the best possible merge available at that time. This is due to considering a worst-case analysis of our algorithm. In practice, we find that the closeness of a variation of Algorithm 5 is much smaller than Theorem 2 would suggest while maintaining a fast running time. See Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">From Bounded Length Dependency Chains to Parallel Algorithms</head><p>We now discuss how to adapt our algorithms to run on distributed systems. In particular we show that every iteration between consequent recursive calls of our algorithms can be implemented using a small number of rounds in the massively parallel model of computation and so we obtain that both algorithms can be simulated in a polylogarithmic number of rounds.</p><p>Parallelizing Divisive k-Clustering: We start by observing that there are previously known procedures <ref type="bibr">[13,</ref><ref type="bibr">8,</ref><ref type="bibr">4,</ref><ref type="bibr">6]</ref> to compute approximate k-clusterings in the massively parallel model of computation using only a constant number of rounds. Here we use these procedures as a black-box.</p><p>Next, the reassignment operation can be performed within a constant number of parallel rounds. Elements can be distributed across machines and the centers v 1 and v 2 can be sent to every machine. In a single round, every element computes the distance to v 1 and v 2 and in the next round the size of B 1 , B 2 and U are computed. Finally given the sizes of B 1 , B 2 and U the reassignment can be computed in a single parallel round.</p><p>So steps 4, 5 and 6 of Algorithm 3 can be implemented in parallel using a constant number of parallel rounds. Furthermore, we established that the algorithm has at most logarithmic chain of dependencies. Thus we obtain the following theorem: Parallelizing Centroid-Linkage: Parallelizing our variant of centroid-linkage is more complicated. As stated before, the main challenge is to find an efficient way to implement lines 5-10 of Algorithm 5. The solution to this problem is the use of Locality Sensitive Hashing (LSH). For simplicity of exposition we focus on the Euclidian distances and we use the sketch from <ref type="bibr">[11]</ref> for norm p &#8712; (0, 2], nevertheless we note that our techniques can be easily extended to any LSH-able distance function. We refer the interested reader to the full version of this paper for complete technical details. The following Theorem is restated from <ref type="bibr">[11]</ref>.</p><p>Theorem 4. Fix a domain S of points a parameter &#948; &gt; 0 and constant parameter &#491; &gt; 0. There exists a class of hash functions H = {h : S &#8594; U } and constants p 1 , p 2 &gt; 0 with p 1 &gt; p 2 such that for any two points u and</p><p>Intuitively the LSH procedure allows us to group together points that are near each other. Using the previous theorem and the classic amplification technique for LSH presented in <ref type="bibr">[21]</ref>, it is possible to show the following theorem.</p><p>Theorem 5. Fix a domain S of points, a parameter &#948; &gt; 0 and a small constant &#491; &gt; 0. Let S &#8242; be the set of points where there exists another point within distance &#948; and S &#8242;&#8242; be the set of points where there exists another point within distance (1 + &#491;)&#948;. With probability 1 -n -2 for each points v &#8712; S &#8242; there is a O(1) round distributed procedure that can identify another point u such that d(u, v) &#8804; (1 + &#491;)&#948;. Furthermore the same procedure identifies for some v &#8712; S &#8242;&#8242; another point u such that d(u, v) &#8804; (1 + &#491;)&#948;.</p><p>Using these tools, we now describe a parallelizable variant of Algorithm 5. See Algorithm 6 for the pseudocode. Intuitively, we apply LSH to the centroids of the current set of clusters in order to identify candidate merges that are &#945;-close for centroid-linkage.</p><p>We note that LSH can also be applied to the theoretically efficient algorithm alluded to in Theorem 2. This allows us to get a theoretically efficient distributed algorithm that is close to centroid-linkage. The following theorem characterizes this result, however its proof is technical and deferred to the full version of this paper. Theorem 6. There exists an algorithm running in O(log 2 (n) log log(n)) distributed rounds and is O(log 2 (n))-close to the sequential centroid-linkage algorithm.</p><p>1 FastCentroid(S, &#945;) 2 Let T be an empty tree 3 For each x &#8712; S add a leaf node corresponding to {x} to T 4 Let C be the current set of clusters 5 while |C| &gt; 1 do </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experimental Results</head><p>In this section we empirically evaluate the algorithms in this paper. The algorithms will be referred to as Div-k-clust. (for the k-means algorithm) and CentroidLink (for the centroid algorithm). The sequential baseline algorithms are kBase and cBase. These are evaluated on three datasets from the UCI machine learning repository commonly used for clustering experimentation: Shuttle, Covertype, and Skin <ref type="bibr">[29]</ref>. Parameters. Both of our algorithms are parameterized with an adjustable parameter. This is c in the divisive algorithm and &#945; in the centroid algorithm. Both parameters were set to 4 in the experiments if not specified. Evaluation Criteria. The algorithms are evaluated on their efficiency as well as the quality of the solution compared to the sequential algorithms. The closeness is a measure of quality; the number of rounds measures the efficiency. We also examine the effect of varying the parameter on the efficiency of the algorithm. Quality Evaluation: Here we examine the closeness of the algorithm to their sequential counterparts. CentroidLink For the CentroidLink algorithm the parameter &#945; specifies the closeness. Recall that the sequential algorithm merges the pair of clusters whose centroids are closest to form a subtree; whereas, the distributed algorithm merges all pairs with distance at most an &#945; factor greater than the smallest distance. The experimenter can freely choose how close the parallel algorithm will adhere to the sequential one with a tradeoff in the number of rounds. We are interested in the closeness of the algorithm's decisions compared to that of the sequential algorithm. We will show this by presenting the ratio of the distance between the pair the algorithm merges compared to the actual distance of the closest pair of nodes.</p><p>Div-k-clust. Recall that Div-k-clust. differs from kBase by having an extra step in which some points are reassigned before the recursion. This step can potentially cause Div-k-clust. to deviate from kBase by placing points in different subtrees than kBase would. The closeness should be a measure of the cost of this difference. We measure closeness as the ratio of the k-means cost before and after the reassignment.</p><p>On average, the closeness ratio of the algorithms are small constants for each data set. Tables 1(b) and 1(a) have a more detailed breakdown of the results. There, we break down the data on closeness by noting the size of the subtree the moment the algorithm makes the decision which might differ from the sequential algorithm. As there are many different sizes for subtrees, we have grouped the subtrees which are close to each other in size and averaged them, for example, subtrees of size 0-1000 are averaged together in the first row of the table. The dashes, '-', in the table indicate that there were no resultant subtrees of the corresponding size range. Note that the ratios are small in general for both algorithms. Efficiency Evaluation: Figure <ref type="figure">2</ref> plots the number of rounds used by each algorithm on each dataset. Data points are subsampled and averaged over five trials. We compare our algorithms against the baseline sequential algorithms. However, in theory, the centroid baseline is very sequential; the ith merge must depend on all i -1 previous merges. Therefore, it has a round complexity of &#8486;(n). For a more reasonable comparison, we have instead plotted the function 2 ln(n) for comparison as we expect our algorithms to scale logarithmically. The sequential algorithm is much worse than this. Both Div-k-clust. and kBase perform poorly on the Skin dataset. One explanation is that this 2-class dataset mostly contains data points of one class, therefore, </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0"><p>This model is widely used to capture the class of algorithms that scale in frameworks such as Spark and MapReduce.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1"><p>We can remove this assumption by adding a small perturbation to every point.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2"><p>Note that the guarantees is on each single choice made by the algorithm but not on all the choices together.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_3"><p>In prior work, Yaroslavtsev and Vadapalli<ref type="bibr">[38]</ref> give an algorithm for single-linkage clustering with constant-dimensional Euclidean input that fits within our framework.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_4"><p>By the generalized triangle inequality this is true for p = 1, 2 and it is true for p = &#8734;. So this is true for the cost of k-center, k-means and k-median.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_5"><p>It is possible to construct worst-cases instances where the minimum distance &#948; can decrease between iterations of the while loop.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_6"><p>In order to guarantee this second invariant,our algorithm must be allowed to make merges at distance O(log 2 (n)&#948;).</p></note>
		</body>
		</text>
</TEI>
