skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Approximation Algorithms for Stochastic Clustering
We consider stochastic settings for clustering, and develop provably-good approximation algorithms for a number of these notions. These algorithms yield better approximation ratios compared to the usual deterministic clustering setting. Additionally, they offer a number of advantages including providing fairer clustering and clustering which has better long-term behavior for each user. In particular, they ensure that every user is guaranteed to get good service (on average). We also complement some of these with impossibility results.  more » « less
Award ID(s):
1749864
PAR ID:
10111466
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Conference on Neural Information Processing Systems (NeurIPS)
Page Range / eLocation ID:
6041 - 6050
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Bansal, Nikhil and (Ed.)
    his paper presents universal algorithms for clustering problems, including the widely studied k-median, k-means, and k-center objectives. The input is a metric space containing all potential client locations. The algorithm must select k cluster centers such that they are a good solution for any subset of clients that actually realize. Specifically, we aim for low regret, defined as the maximum over all subsets of the difference between the cost of the algorithm’s solution and that of an optimal solution. A universal algorithm’s solution sol for a clustering problem is said to be an (α, β)-approximation if for all subsets of clients C', it satisfies sol(C') ≤ α ⋅ opt(C') + β ⋅ mr, where opt(C') is the cost of the optimal solution for clients C' and mr is the minimum regret achievable by any solution. Our main results are universal algorithms for the standard clustering objectives of k-median, k-means, and k-center that achieve (O(1), O(1))-approximations. These results are obtained via a novel framework for universal algorithms using linear programming (LP) relaxations. These results generalize to other 𝓁_p-objectives and the setting where some subset of the clients are fixed. We also give hardness results showing that (α, β)-approximation is NP-hard if α or β is at most a certain constant, even for the widely studied special case of Euclidean metric spaces. This shows that in some sense, (O(1), O(1))-approximation is the strongest type of guarantee obtainable for universal clustering. 
    more » « less
  2. The Nyström method is a matrix approximation technique that has shown great promise in speeding up spectral clustering. However, when the input matrix is sparse, we show that the traditional Nyström method requires a prohibitively large number of samples to obtain a good approximation. We propose a novel sampling approach to select the landmark points used to compute the Nyström approximation. We show that the proposed sampling approach obeys the same error bound as in Bouneffouf and Birol (2015). To control sample complexity, we propose a selective densification step based on breadth-first traversal. We show that the proposed densification does not change the optimal clustering. Results on real world datasets show that by combining the proposed sampling and densification schemes, we can obtain better accuracy compared to other techniques used for the Nyström method while using significantly fewer samples. 
    more » « less
  3. In this paper, we study k-means++ and k-means++ parallel, the two most popular algorithms for the classic k-means clustering problem. We provide novel analyses and show improved approximation and bi-criteria approximation guarantees for k-means++ and k-means++ parallel. Our results give a better theoretical justification for why these algorithms perform extremely well in practice. We also propose a new variant of k-means++ parallel algorithm (Exponential Race k-means++) that has the same approximation guarantees as k-means++. 
    more » « less
  4. Clustering plays a crucial role in computer science, facilitating data analysis and problem-solving across numerous fields. By partitioning large datasets into meaningful groups, clustering reveals hidden structures and relationships within the data, aiding tasks such as unsupervised learning, classification, anomaly detection, and recommendation systems. Particularly in relational databases, where data is distributed across multiple tables, efficient clustering is essential yet challenging due to the computational complexity of joining tables. This paper addresses this challenge by introducing efficient algorithms for k-median and k-means clustering on relational data without the need for pre-computing the join query results. For the relational k-median clustering, we propose the first efficient relative approximation algorithm. For the relational k-means clustering, our algorithm significantly improves both the approximation factor and the running time of the known relational k-means clustering algorithms, which suffer either from large constant approximation factors, or expensive running time. Given a join query q and a database instance D of O(N) tuples, for both k-median and k-means clustering on the results of q on D, we propose randomized (1+ε)γ-approximation algorithms that run in roughly O(k2Nfhw)+T_γ(k2) time, where ε ∈ (0,1) is a constant parameter decided by the user, \fhw is the fractional hyper-tree width of Q, while γ and T_γ(x) represent the approximation factor and the running time, respectively, of a traditional clustering algorithm in the standard computational setting over x points. 
    more » « less
  5. This paper addresses the user-centric clustering and pilot assignment problems in cell-free networks, recognizing the need to solve both problems simultaneously. The motivation of this research stems from the absence of benchmarks, general formulations, and the reliance on subjectively designed objective functions and heuristic algorithms prevalent in existing literature. To tackle these challenges, we formulate stochastic non-linear binary integer programs for both the user-centric clustering and pilot assignment problems. We specifically design the pilot assignment formulation to incorporate user-centric clusters when evaluating the desirability of pilot assignments, resulting in improved efficiency. To solve the problems, the proposed methodology employs sample average approximation coupled with surrogate optimization for the user-centric clustering problem and the genetic algorithm for the pilot assignment problem. Numerical experiments demonstrate that the optimized solutions outperform baseline solutions, leading to significant gains in spectral efficiency. 
    more » « less