skip to main content


Title: Label consistency in overfitted generalized k-means
We provide theoretical guarantees for label consistency in generalized k-means problems, with an emphasis on the overfitted case where the number of clusters used by the algorithm is more than the ground truth. We provide conditions under which the estimated labels are close to a refinement of the true cluster labels. We consider both exact and approximate recovery of the labels. Our results hold for any constant-factor approximation to the k-means problem. The results are also model-free and only based on bounds on the maximum or average distance of the data points to the true cluster centers. These centers themselves are loosely defined and can be taken to be any set of points for which the aforementioned distances can be controlled. We show the usefulness of the results with applications to some manifold clustering problems.  more » « less
Award ID(s):
1945667
NSF-PAR ID:
10336065
Author(s) / Creator(s):
;
Editor(s):
Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.S.; Vaughan, J. W.
Date Published:
Journal Name:
Advances in neural information processing systems
Volume:
34
ISSN:
1049-5258
Page Range / eLocation ID:
7965-7977
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    This paper presents a novel accelerated exact k-means called as "Ball k-means" by using the ball to describe each cluster, which focus on reducing the point-centroid distance computation. It can exactly find its neighbor clusters for each cluster, resulting distance computations only between a point and its neighbor clusters' centroids instead of all centroids. What's more, each cluster can be divided into "stable area" and "active area", and the latter one is further divided into some exact "annular area". The assignment of the points in the "stable area" is not changed while the points in each "annular area" will be adjusted within a few neighbor clusters. There are no upper or lower bounds in the whole process. Moreover, ball k-means uses ball clusters and neighbor searching along with multiple novel stratagems for reducing centroid distance computations. In comparison with the current state-of-the art accelerated exact bounded methods, the Yinyang algorithm and the Exponion algorithm, as well as other top-of-the-line tree-based and bounded methods, the ball k-means attains both higher performance and performs fewer distance calculations, especially for large-k problems. The faster speed, no extra parameters and simpler design of "Ball k-means" make it an all-around replacement of the naive k-means. 
    more » « less
  2. In this paper we introduce and formally study the problem of $k$-clustering with faulty centers. Specifically, we study the faulty versions of $k$-center, $k$-median, and $k$-means clustering, where centers have some probability of not existing, as opposed to prior work where clients had some probability of not existing. For all three problems we provide fixed parameter tractable algorithms, in the parameters $k$, $d$, and $\eps$, that $(1+\eps)$-approximate the minimum expected cost solutions for points in $d$ dimensional Euclidean space. For Faulty $k$-center we additionally provide a 5-approximation for general metrics. Significantly, all of our algorithms have only a linear dependence on $n$. 
    more » « less
  3. Bae, Sang Won ; Park, Heejin (Ed.)
    In this paper we introduce and formally study the problem of k-clustering with faulty centers. Specifically, we study the faulty versions of k-center, k-median, and k-means clustering, where centers have some probability of not existing, as opposed to prior work where clients had some probability of not existing. For all three problems we provide fixed parameter tractable algorithms, in the parameters k, d, and ε, that (1+ε)-approximate the minimum expected cost solutions for points in d dimensional Euclidean space. For Faulty k-center we additionally provide a 5-approximation for general metrics. Significantly, all of our algorithms have a small dependence on n. Specifically, our Faulty k-center algorithms have only linear dependence on n, while for our algorithms for Faulty k-median and Faulty k-means the dependence is still only n^(1 + o(1)). 
    more » « less
  4. null (Ed.)
    Wind turbine wakes are responsible for power losses and added fatigue loads of wind turbines. Providing capabilities to predict accurately wind-turbine wakes for different atmospheric conditions and turbine settings with low computational requirements is crucial for the optimization of wind-farm layout, and for improving wind-turbine controls aiming to increase annual energy production (AEP) and reduce the levelized cost of energy (LCOE) for wind power plants. In this work, wake measurements collected with a scanning Doppler wind Li- DAR for broad ranges of the atmospheric static stability regime and incoming wind speed are processed through K-means clustering. For computational feasibility, the cluster analysis is performed on a low-dimensional embedding of the collected data, which is obtained through proper orthogonal decomposition (POD). After data compression, we perform K-means of the POD modes to identify cluster centers and corresponding members from the LiDAR data. The different cluster centers allow us to visualize wake variability over ranges of atmospheric, wind, and turbine parameters. The results show that accurate mapping of the wake variability can be achieved with K-means clustering, which represents an initial step to develop data-driven wake models for accurate and low-computational-cost simulations of wind farms. 
    more » « less
  5. Bojańczyk, Mikołaj ; Merelli, Emanuela ; Woodruff, David P (Ed.)
    Given n points in 𝓁_p^d, we consider the problem of partitioning points into k clusters with associated centers. The cost of a clustering is the sum of p-th powers of distances of points to their cluster centers. For p ∈ [1,2], we design sketches of size poly(log(nd),k,1/ε) such that the cost of the optimal clustering can be estimated to within factor 1+ε, despite the fact that the compressed representation does not contain enough information to recover the cluster centers or the partition into clusters. This leads to a streaming algorithm for estimating the clustering cost with space poly(log(nd),k,1/ε). We also obtain a distributed memory algorithm, where the n points are arbitrarily partitioned amongst m machines, each of which sends information to a central party who then computes an approximation of the clustering cost. Prior to this work, no such streaming or distributed-memory algorithm was known with sublinear dependence on d for p ∈ [1,2). 
    more » « less