Clustering plays a crucial role in computer science, facilitating data analysis and problem-solving across numerous fields. By partitioning large datasets into meaningful groups, clustering reveals hidden structures and relationships within the data, aiding tasks such as unsupervised learning, classification, anomaly detection, and recommendation systems. Particularly in relational databases, where data is distributed across multiple tables, efficient clustering is essential yet challenging due to the computational complexity of joining tables. This paper addresses this challenge by introducing efficient algorithms for k-median and k-means clustering on relational data without the need for pre-computing the join query results. For the relational k-median clustering, we propose the first efficient relative approximation algorithm. For the relational k-means clustering, our algorithm significantly improves both the approximation factor and the running time of the known relational k-means clustering algorithms, which suffer either from large constant approximation factors, or expensive running time. Given a join query q and a database instance D of O(N) tuples, for both k-median and k-means clustering on the results of q on D, we propose randomized (1+ε)γ-approximation algorithms that run in roughly O(k2Nfhw)+T_γ(k2) time, where ε ∈ (0,1) is a constant parameter decided by the user, \fhw is the fractional hyper-tree width of Q, while γ and T_γ(x) represent the approximation factor and the running time, respectively, of a traditional clustering algorithm in the standard computational setting over x points. 
                        more » 
                        « less   
                    This content will become publicly available on November 4, 2025
                            
                            k-Clustering with Comparison and Distance Oracles
                        
                    
    
            In this paper, we address clustering problems in scenarios where accurate direct access to the full dataset is impractical or impossible. Instead, we leverage oracle-based methods, which are particularly valuable in real-world applications where the data may be noisy, restricted due to privacy concerns or sheer volume. We utilize two oracles, the quadruplet and the distance oracle. The quadruplet oracle is a weaker oracle that only approximately compares the distances of two pairs of vertices. In practice, these oracles can be implemented using crowdsourcing or training classifiers or other predictive models. On the other hand, the distance oracle returns exactly the distance of two vertices, so it is a stronger and more expensive oracle to implement. We consider two noise models for the quadruplet oracle. In the adversarial noise model, if two pairs have similar distances, the response is chosen by an adversary. In the probabilistic noise model, the pair with the smaller distance is returned with a constant probability. We consider a set V of n vertices in a metric space that supports the quadruplet and the distance oracle. For each of the k-center, k-median, and k-means clustering problem on V, we design constant approximation algorithms that perform roughly O(nk) calls to the quadruplet oracle and O(k^2) calls to the distance oracle in both noise models. When the dataset has low intrinsic dimension, we significantly improve the approximation factors of our algorithms by performing a few additional calls to the distance oracle. We also show that for k-median and k-means clustering there is no hope to return any sublinear approximation using only the quadruplet oracle. Finally, we give constant approximation algorithms for estimating the clustering cost induced by any set of k vertices, performing roughly O(nk) calls to the quadruplet oracle and O(k^2) calls to the distance oracle. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10616188
- Publisher / Repository:
- Association for Computing Machinery
- Date Published:
- Journal Name:
- Proceedings of the ACM on Management of Data
- Volume:
- 2
- Issue:
- 5
- ISSN:
- 2836-6573
- Page Range / eLocation ID:
- 1 to 26
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Data summarization is a powerful approach to deal with large-scale data analytics, which has wide applications in web search, recommendation systems, approximate query processing, etc. It computes a small, compact summary that preserves vital properties of the original data. In this paper, we study the data summarization problem of conjunctive query results, i.e., computing a k-size subset of a conjunctive query output, for any given k>0, that optimizes a certain objective. More specifically, we are interested in two commonly studied objectives: cohesion, which measures the maximum distance between a tuple in the query result tuples and its closest tuple in the summary (k-center clustering); and diversity, which measures the pairwise distances between the summary items. A simple approach that computes the entire query output and then applies existing algorithms on top of these materialized tuples suffers from high computational complexity because the query output can be large, e.g., for a relational database of N tuples, the number of result tuples can be NO(1).We propose O(1)-approximation algorithms that compute well-representative summaries of size k in time O(N*kO(1)), or even O(N+ kO(1)) in some cases, without computing all result tuples. We also propose the first efficient (2+\eps)-approximation algorithm for the k-center clustering problem over relational data. Our main idea is to formulate a few oracles that enable us to access specific query result tuples with certain properties, to show how these oracles can be implemented efficiently, and to compute desired summaries with few invocations of these oracles.more » « less
- 
            Given a data set of size n in d'-dimensional Euclidean space, the k-means problem asks for a set of k points (called centers) such that the sum of the l_2^2-distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private k-means clustering algorithms in both the central and local settings. In this work, we introduce a new locally private k-means clustering algorithm that achieves near-optimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^2-1))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor.more » « less
- 
            Bansal, Nikhil and (Ed.)his paper presents universal algorithms for clustering problems, including the widely studied k-median, k-means, and k-center objectives. The input is a metric space containing all potential client locations. The algorithm must select k cluster centers such that they are a good solution for any subset of clients that actually realize. Specifically, we aim for low regret, defined as the maximum over all subsets of the difference between the cost of the algorithm’s solution and that of an optimal solution. A universal algorithm’s solution sol for a clustering problem is said to be an (α, β)-approximation if for all subsets of clients C', it satisfies sol(C') ≤ α ⋅ opt(C') + β ⋅ mr, where opt(C') is the cost of the optimal solution for clients C' and mr is the minimum regret achievable by any solution. Our main results are universal algorithms for the standard clustering objectives of k-median, k-means, and k-center that achieve (O(1), O(1))-approximations. These results are obtained via a novel framework for universal algorithms using linear programming (LP) relaxations. These results generalize to other 𝓁_p-objectives and the setting where some subset of the clients are fixed. We also give hardness results showing that (α, β)-approximation is NP-hard if α or β is at most a certain constant, even for the widely studied special case of Euclidean metric spaces. This shows that in some sense, (O(1), O(1))-approximation is the strongest type of guarantee obtainable for universal clustering.more » « less
- 
            We present the first near-linear-time algorithm that computes a (1+ε)-approximation of the diameter of a weighted unit-disk graph of n vertices. Our algorithm requires O(n log^2 n) time for any constant ε>0, so we considerably improve upon the near-O(n^{3/2})-time algorithm of Gao and Zhang (2005). Using similar ideas we develop (1+ε)-approximate \emph{distance oracles} of O(1) query time with a likewise improvement in the preprocessing time, specifically from near O(n^{3/2}) to O(n log^3 n). We also obtain similar new results for a number of related problems in the weighted unit-disk graph metric such as the radius and the bichromatic closest pair. As a further application we employ our distance oracle, along with additional ideas, to solve the (1+ε)-approximate \emph{all-pairs bounded-leg shortest paths\/} (apBLSP) problem for a set of n planar points. Our data structure requires O(n^2 log n) space, O(loglog n) query time, and nearly O(n^{2.579}) preprocessing time for any constant ε>0, and is the first that breaks the near-cubic preprocessing time bound given by Roditty and Segal (2011).more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
