skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Universal Algorithms for Clustering Problems
This article presentsuniversalalgorithms for clustering problems, including the widely studiedk-median,k-means, andk-center objectives. The input is a metric space containing allpotentialclient locations. The algorithm must selectkcluster centers such that they are a good solution foranysubset of clients that actually realize. Specifically, we aim for lowregret, defined as the maximum over all subsets of the difference between the cost of the algorithm’s solution and that of an optimal solution. A universal algorithm’s solutionSolfor a clustering problem is said to be an α , β-approximation if for all subsets of clientsC, it satisfiessol(C) ≤ α ċopt(C′) + β ċmr, whereopt(C′ is the cost of the optimal solution for clients (C′) andmris the minimum regret achievable by any solution. Our main results are universal algorithms for the standard clustering objectives ofk-median,k-means, andk-center that achieve (O(1),O(1))-approximations. These results are obtained via a novel framework for universal algorithms using linear programming (LP) relaxations. These results generalize to other ℓp-objectives and the setting where some subset of the clients arefixed. We also give hardness results showing that (α, β)-approximation is NP-hard if α or β is at most a certain constant, even for the widely studied special case of Euclidean metric spaces. This shows that in some sense, (O(1),O(1))-approximation is the strongest type of guarantee obtainable for universal clustering.  more » « less
Award ID(s):
1750140
PAR ID:
10520521
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
ACM Transactions on Algorithms
Volume:
19
Issue:
2
ISSN:
1549-6325
Page Range / eLocation ID:
1 to 46
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Bansal, Nikhil and (Ed.)
    his paper presents universal algorithms for clustering problems, including the widely studied k-median, k-means, and k-center objectives. The input is a metric space containing all potential client locations. The algorithm must select k cluster centers such that they are a good solution for any subset of clients that actually realize. Specifically, we aim for low regret, defined as the maximum over all subsets of the difference between the cost of the algorithm’s solution and that of an optimal solution. A universal algorithm’s solution sol for a clustering problem is said to be an (α, β)-approximation if for all subsets of clients C', it satisfies sol(C') ≤ α ⋅ opt(C') + β ⋅ mr, where opt(C') is the cost of the optimal solution for clients C' and mr is the minimum regret achievable by any solution. Our main results are universal algorithms for the standard clustering objectives of k-median, k-means, and k-center that achieve (O(1), O(1))-approximations. These results are obtained via a novel framework for universal algorithms using linear programming (LP) relaxations. These results generalize to other 𝓁_p-objectives and the setting where some subset of the clients are fixed. We also give hardness results showing that (α, β)-approximation is NP-hard if α or β is at most a certain constant, even for the widely studied special case of Euclidean metric spaces. This shows that in some sense, (O(1), O(1))-approximation is the strongest type of guarantee obtainable for universal clustering. 
    more » « less
  2. We consider the problem of clustering in the learning-augmented setting. We are given a data set in $$d$$-dimensional Euclidean space, and a label for each data point given by a predictor indicating what subsets of points should be clustered together. This setting captures situations where we have access to some auxiliary information about the data set relevant for our clustering objective, for instance the labels output by a neural network. Following prior work, we assume that there are at most an $$\alpha \in (0,c)$$ for some $c<1$ fraction of false positives and false negatives in each predicted cluster, in the absence of which the labels would attain the optimal clustering cost $$\mathrm{OPT}$$. For a dataset of size $$m$$, we propose a deterministic $$k$$-means algorithm that produces centers with an improved bound on the clustering cost compared to the previous randomized state-of-the-art algorithm while preserving the $$O( d m \log m)$$ runtime. Furthermore, our algorithm works even when the predictions are not very accurate, i.e., our cost bound holds for $$\alpha$$ up to $1/2$, an improvement from $$\alpha$$ being at most $1/7$ in previous work. For the $$k$$-medians problem we again improve upon prior work by achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy parameter $$\alpha$$ to get a cost of $$(1+O(\alpha))\mathrm{OPT}$$, while requiring essentially just $$O(md \log^3 m/\alpha)$$ runtime. 
    more » « less
  3. In this paper we introduce and formally study the problem of $$k$$-clustering with faulty centers. Specifically, we study the faulty versions of $$k$$-center, $$k$$-median, and $$k$$-means clustering, where centers have some probability of not existing, as opposed to prior work where clients had some probability of not existing. For all three problems we provide fixed parameter tractable algorithms, in the parameters $$k$$, $$d$$, and $$\eps$$, that $$(1+\eps)$$-approximate the minimum expected cost solutions for points in $$d$$ dimensional Euclidean space. For Faulty $$k$$-center we additionally provide a 5-approximation for general metrics. Significantly, all of our algorithms have only a linear dependence on $$n$$. 
    more » « less
  4. Abstract We present results on the nature of extreme ejective feedback episodes and the physical conditions of a population of massive (M*∼ 1011M), compact starburst galaxies atz= 0.4–0.7. We use data from Keck/NIRSPEC, SDSS, Gemini/GMOS, MMT, and Magellan/MagE to measure rest-frame optical and near-IR spectra of 14 starburst galaxies with extremely high star formation rate surface densities (mean ΣSFR∼ 2000Myr−1kpc−2) and powerful galactic outflows (maximum speedsv98∼ 1000–3000 km s−1). Our unique data set includes an ensemble of both emission ([Oii]λλ3726,3729, Hβ, [Oiii]λλ4959,5007, Hα, [Nii]λλ6549,6585, and [Sii]λλ6716,6731) and absorption (Mgiiλλ2796,2803, and Feiiλ2586) lines that allow us to investigate the kinematics of the cool gas phase (T∼ 104K) in the outflows. Employing a suite of line ratio diagnostic diagrams, we find that the central starbursts are characterized by high electron densities (medianne∼ 530 cm−3), and high metallicity (solar or supersolar). We show that the outflows are most likely driven by stellar feedback emerging from the extreme central starburst, rather than by an AGN. We also present multiple intriguing observational signatures suggesting that these galaxies may have substantial Lyman continuum (LyC) photon leakage, including weak [Sii]nebular emission lines. Our results imply that these galaxies may be captured in a short-lived phase of extreme star formation and feedback where much of their gas is violently blown out by powerful outflows that open up channels for LyC photons to escape. 
    more » « less
  5. We present JWST/NIRSpec integral field data of the quasar PJ308-21 atz = 6.2342. As shown by previous ALMA and HST imaging, the quasar has two companion sources, interacting with the quasar host galaxy. The high-resolution G395H/290LP NIRSpec spectrum covers the 2.87 − 5.27 μm wavelength range and shows the rest-frame optical emission of the quasar with exquisite quality (signal-to-noise ratio ∼100 − 400 per spectral element). Based on the Hβline from the broad line region, we obtain an estimate of the black hole massMBH, Hβ ∼ 2.7 × 109 M. This value is within a factor ≲1.5 of the Hα-based black hole mass from the same spectrum (MBH, Hα ∼ 1.93 × 109 M) and is consistent with a previous estimate relying on the Mg IIλ2799 line (MBH, MgII ∼ 2.65 × 109 M). All theseMBHestimates are within the ∼0.5 dex intrinsic scatter of the adopted mass calibrations. The high Eddington ratio of PJ308-21λEdd, Hβ ∼ 0.67 (λEdd, Hα ∼ 0.96) is in line with the overall quasar population atz ≳ 6. The relative strengths of the [O III], Fe II, and Hβlines are consistent with the empirical “Eigenvector 1” correlations as observed for low redshift quasars. We find evidence for blueshifted [O III]λ5007 emission with a velocity offset Δv[O III] = −1922 ± 39 km s−1from the systemic velocity and a full width at half maximum (FWHM)FWHM([O III]) = 2776−74+75km s−1. This may be the signature of outflowing gas from the nuclear region, despite the true values of Δv[O III]andFWHM([O III]) likely being more uncertain due to the blending with Hβand Fe IIlines. Our study demonstrates the unique capabilities of NIRSpec in capturing quasar spectra at cosmic dawn and studying their properties in unprecedented detail. 
    more » « less