Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Given a graph with positive and negative edge labels, the correlation clustering problem aims to cluster the nodes so to minimize the total number of between-cluster positive and within-cluster negative edges. This problem has many applications in data mining, particularly in unsupervised learning. Inspired by the prevalence of large graphs and constantly changing data in modern applications, we study correlation clustering in dynamic, parallel (MPC), and local computation (LCA) settings. We design an approach that improves state-of-the-art runtime complexities in all these settings. In particular, we provide the first fully dynamic algorithm that runs in an expected amortized constant time, without any dependence on the graph size. Moreover, our algorithm essentially matches the approximation guarantee of the celebrated PIVOT algorithm.more » « lessFree, publicly-accessible full text available June 30, 2025
-
Motivated by applications to classification problems on metric data, we study Weighted Metric Clustering problem: given a metric d over n points and a k x k symmetric matrix A with non-negative entries, the goal is to find a k-partition of these points into clusters C1,...,Ck, while minimizing the sum of A[i,j] * d(u,v) over all pairs of clusters Ci and Cj and all pairs of points u from Ci and v from Cj. Specific choices of A lead to Weighted Metric Clustering capturing well-studied graph partitioning problems in metric spaces, such as Min-Uncut, Min-k-Sum, Min-k-Cut, and more.Our main result is that Weighted Metric Clustering admits a polynomial-time approximation scheme (PTAS). Our algorithm handles all the above problems using the Sherali-Adams linear programming relaxation. This subsumes several prior works, unifies many of the techniques for various metric clustering objectives, and yields a PTAS for several new problems, including metric clustering on manifolds and a new family of hierarchical clustering objectives. Our experiments on the hierarchical clustering objective show that it better captures the ground-truth structural information compared to the popular Dasgupta's objective.
Free, publicly-accessible full text available March 25, 2025 -
We prove a new generalization of the higher-order Cheeger inequality for partitioning with buffers. Consider a graph G=(V,E). The buffered expansion of a set S⊆V with a buffer B⊆V∖S is the edge expansion of S after removing all the edges from set S to its buffer B. An ε-buffered k-partitioning is a partitioning of a graph into disjoint components Pi and buffers Bi, in which the size of buffer Bi for Pi is small relative to the size of Pi: |Bi|≤ε|Pi|. The buffered expansion of a buffered partition is the maximum of buffered expansions of the k sets Pi with buffers Bi. Let hk,εG be the buffered expansion of the optimal ε-buffered k-partitioning, then for every δ>0, hk,εG≤Oδ(1)⋅(logkε)⋅λ⌊(1+δ)k⌋, where λ⌊(1+δ)k⌋ is the ⌊(1+δ)k⌋-th smallest eigenvalue of the normalized Laplacian of G. Our inequality is constructive and avoids the ``square-root loss'' that is present in the standard Cheeger inequalities (even for k=2). We also provide a complementary lower bound, and a novel generalization to the setting with arbitrary vertex weights and edge costs. Moreover our result implies and generalizes the standard higher-order Cheeger inequalities and another recent Cheeger-type inequality by Kwok, Lau, and Lee (2017) involving robust vertex expansion.more » « lessFree, publicly-accessible full text available January 1, 2025
-
We show that a simple single-pass semi-streaming variant of the Pivot algorithm for Correlation Clustering gives a (3 + epsilon)-approximation using O(n/epsilon) words of memory. This is a slight improvement over the recent results of Cambus, Kuhn, Lindy, Pai, and Uitto, who gave a (3 + epsilon)-approximation using O(n log n) words of memory, and Behnezhad, Charikar, Ma, and Tan, who gave a 5-approximation using O(n) words of memory. One of the main contributions of this paper is that both the algorithm and its analysis are very simple, and also the algorithm is easy to implement.more » « lessFree, publicly-accessible full text available December 10, 2024
-
We show that the RandomCoordinateCut algorithm gives the optimal competitive ratio for explainable k-medians in l1. The problem of explainable k-medians was introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian in 2020. Several groups of authors independently proposed a simple polynomial-time randomized algorithm for the problem and showed that this algorithm is O(log k loglog k) competitive. We provide a tight analysis of the algorithm and prove that its competitive ratio is upper bounded by 2ln k +2. This bound matches the Omega(log k) lower bound by Dasgupta et al (2020).more » « lessFree, publicly-accessible full text available December 10, 2024
-
We study the natural problem of Triplet Reconstruction (also known as Rooted Triplets Consistency or Triplet Clustering), originally motivated by applications in computational biology and relational databases (Aho, Sagiv, Szymanski, and Ullman, 1981) [2]: given n datapoints, we want to embed them onto the n leaves of a rooted binary tree (also known as a hierarchical clustering, or ultrametric embedding) such that a given set of m triplet constraints is satisfied. A triplet constraint i j · k for points i, j, k indicates that 'i, j are more closely related to each other than to k,' (in terms of distances d(i, j) ≤ d(i, k) and d(i, j) ≤ d(j, k)) and we say that a tree satisfies the triplet i j · k if the distance in the tree between i, j is smaller than the distance between i, k (or j, k). Among all possible trees with n leaves, can we efficiently find one that satisfies a large fraction of the m given triplets? Aho et al. (1981) [2] studied the decision version and gave an elegant polynomial-time algorithm that determines whether or not there exists a tree that satisfies all of the m constraints. Moreover, it is straightforward to see that a random binary tree achieves a constant 13-approximation, since there are only 3 distinct triplets i j|k, i k| j, j k · i (each will be satisfied w.p. 13). Unfortunately, despite more than four decades of research by various communities, there is no better approximation algorithm for this basic Triplet Reconstruction problem.Our main theorem-which captures Triplet Reconstruction as a special case-is a general hardness of approximation result about Constraint Satisfaction Problems (CSPs) over infinite domains (CSPs where instead of boolean values {0,1} or a fixed-size domain, the variables can be mapped to any of the n leaves of a tree). Specifically, we prove that assuming the Unique Games Conjecture [57], Triplet Reconstruction and more generally, every Constraint Satisfaction Problem (CSP) over hierarchies is approximation resistant, i.e., there is no polynomial-time algorithm that does asymptotically better than a biased random assignment.Our result settles the approximability not only for Triplet Reconstruction, but for many interesting problems that have been studied by various scientific communities such as the popular Quartet Reconstruction and Subtree/Supertree Aggregation Problems. More broadly, our result significantly extends the list of approximation resistant predicates by pointing to a large new family of hard problems over hierarchies. Our main theorem is a generalization of Guruswami, Håstad, Manokaran, Raghavendra, and Charikar (2011) [36], who showed that ordering CSPs (CSPs over permutations of n elements, e.g., Max Acyclic Subgraph, Betweenness, Non-Betweenness) are approximation resistant. The main challenge in our analyses stems from the fact that trees have topology (in contrast to permutations and ordering CSPs) and it is the tree topology that determines whether a given constraint on the variables is satisfied or not. As a byproduct, we also present some of the first CSPs where their approximation resistance is proved against biased random assignments, instead of uniformly random assignments.more » « lessFree, publicly-accessible full text available November 6, 2024
-
Gørtz, Inge Li ; Farach-Colton, Martin ; Puglisi, Simon J ; Herman, Grzegorz (Ed.)We consider variants of the classic Multiway Cut problem. Multiway Cut asks to partition a graph G into k parts so as to separate k given terminals. Recently, Chandrasekaran and Wang (ESA 2021) introduced l_p-norm Multiway Cut, a generalization of the problem, in which the goal is to minimize the l_p norm of the edge boundaries of k parts. We provide an O(log^{1/2} n log^{1/2 + 1/p} k) approximation algorithm for this problem, improving upon the approximation guarantee of O(log^{3/2} n log^{1/2} k) due to Chandrasekaran and Wang. We also introduce and study Norm Multiway Cut, a further generalization of Multiway Cut. We assume that we are given access to an oracle, which answers certain queries about the norm. We present an O(log^{1/2} n log^{7/2} k) approximation algorithm with a weaker oracle and an O(log^{1/2} n log^{5/2} k) approximation algorithm with a stronger oracle. Additionally, we show that without any oracle access, there is no n^{1/4-ε} approximation algorithm for every ε > 0 assuming the Hypergraph Dense-vs-Random Conjecture.more » « less
-
We provide a new bi-criteria O(log2k) competitive algorithm for explainable k-means clustering. Explainable k-means was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable k-means clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bi-criteria algorithm for explainable clustering O(k) competitive, and this bound is tight. Our randomized bi-criteria algorithm constructs a threshold decision tree that partitions the data set into (1+δ)k clusters (where δ∈(0,1) is a parameter of the algorithm). The cost of this clustering is at most O(1/δ⋅log2k) times the cost of the optimal unconstrained k-means clustering. We show that this bound is almost optimal.more » « less
-
null (Ed.)In the Correlation Clustering problem, we are given a complete weighted graph G with its edges labeled as “similar" and “dissimilar" by a noisy binary classifier. For a clustering C of graph G, a similar edge is in disagreement with C, if its endpoints belong to distinct clusters; and a dissimilar edge is in disagreement with C if its endpoints belong to the same cluster. The disagreements vector, Agree, is a vector indexed by the vertices of G such that the v-th coordinate Disagre equals the weight of all disagreeing edges incident on v. The goal is to produce a clustering that minimizes the ℓp norm of the disagreements vector for p≥1. We study the ℓ_p objective in Correlation Clustering under the following assumption: Every similar edge has weight in [αw,w] and every dissimilar edge has weight at least αw (where α≤1 and w>0 is a scaling parameter). We give an O((1/α)^{1/2−1/2p}⋅log(1/α)) approximation algorithm for this problem. Furthermore, we show an almost matching convex programming integrality gap.more » « less
-
null (Ed.)We consider the problem of explainable k-medians and k-means introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). In this problem, our goal is to find a threshold decision tree that partitions data into k clusters and minimizes the k-medians or k-means objective. The obtained clustering is easy to interpret because every decision node of a threshold tree splits data based on a single feature into two groups. We propose a new algorithm for this problem which is O(log k) competitive with k-medians with ℓ1 norm and O(k) competitive with k-means. This is an improvement over the previous guarantees of O(k) and O(k^2) by Dasgupta et al (2020). We also provide a new algorithm which is O(log^{3}{2}k) competitive for k-medians with ℓ2 norm. Our first algorithm is near-optimal: Dasgupta et al (2020) showed a lower bound of Ω(log k) for k-medians; in this work, we prove a lower bound of Ω(k) for k-means. We also provide a lower bound of Ω(log k) for k-medians with ℓ2 norm.more » « less