We consider k-means clustering in an online setting where each new data point is assigned to its closest cluster center and incurs a loss equal to the squared distance to that center, after which the algorithm is allowed to update its centers. The goal over a data stream X is to achieve a total loss that is not too much larger than the best possible loss using k fixed centers in hindsight. Ours is the first algorithm to achieve polynomial space and time complexity in the online setting. We note that our results have implications to the related streaming setting, where one final clustering is outputted, and the no-substitution setting, where center selections are permanent. We show a general reduction between the no-substitution cost of a blackbox algorithm and its online cost. Finally, we translate our algorithm to the no-substitution setting and streaming settings, and it competes with and can outperform existing work in the areas.
more »
« less
Online k-means clustering on arbitrary data streams
We consider k-means clustering in an online setting where each new data point is assigned to its closest cluster center and incurs a loss equal to the squared distance to that center, after which the algorithm is allowed to update its centers. The goal over a data stream X is to achieve a total loss that is not too much larger than L(X, OPT), the best possible loss using k fixed centers in hindsight. We give the first algorithm to achieve polynomial space and time complexity in the online setting.
more »
« less
- Award ID(s):
- 2211386
- PAR ID:
- 10466919
- Publisher / Repository:
- Proceedings of Machine Learning Research
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.F.; Lin, H. (Ed.)In this paper, we provide an efficient approximation algorithm for finding the most likelihood configuration (MAP) of size k for Determinantal Point Processes (DPP) in the online setting where the data points arrive in an arbitrary order and the algorithm cannot discard the selected elements from its local memory. Given a tolerance additive error eta, our online algorithm achieves a k^O(k) multiplicative approximation guarantee along with an additive error eta, using a memory footprint independent of the size of the data stream. We note that the exponential dependence on k in the approximation factor is unavoidable even in the offline setting. Our result readily implies a streaming algorithm with an improved memory bound compared to existing results.more » « less
-
In many applications, multiple parties have private data regarding the same set of users but on disjoint sets of attributes, and a server wants to leverage the data to train a model. To enable model learning while protecting the privacy of the data subjects, we need vertical federated learning (VFL) techniques, where the data parties share only information for training the model, instead of the private data. However, it is challenging to ensure that the shared information maintains privacy while learning accurate models. To the best of our knowledge, the algorithm proposed in this paper is the first practical solution for differentially private vertical federatedk-means clustering, where the server can obtain a set of global centers with a provable differential privacy guarantee. Our algorithm assumes an untrusted central server that aggregates differentially private local centers and membership encodings from local data parties. It builds a weighted grid as the synopsis of the global dataset based on the received information. Final centers are generated by running anyk-means algorithm on the weighted grid. Our approach for grid weight estimation uses a novel, light-weight, and differentially private set intersection cardinality estimation algorithm based on the Flajolet-Martin sketch. To improve the estimation accuracy in the setting with more than two data parties, we further propose a refined version of the weights estimation algorithm and a parameter tuning strategy to reduce the finalk-means loss to be close to that in the central private setting. We provide theoretical utility analysis and experimental evaluation results for the cluster centers computed by our algorithm and show that our approach performs better both theoretically and empirically than the two baselines based on existing techniquesmore » « less
-
Given a data set of size n in d'-dimensional Euclidean space, the k-means problem asks for a set of k points (called centers) such that the sum of the l_2^2-distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private k-means clustering algorithms in both the central and local settings. In this work, we introduce a new locally private k-means clustering algorithm that achieves near-optimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^2-1))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor.more » « less
-
Electricity bill constitutes a significant portion of operational costs for large scale data centers. Empowering data centers with on-site storages can reduce the electricity bill by shaping the energy procurement from deregulated electricity markets with real-time price fluctuations. This work focuses on designing energy procurement and storage management strategies to minimize the electricity bill of storage-assisted data centers. Designing such strategies is challenging since the net energy demand of the data center and electricity market prices are not known in advance, and the underlying problem is coupled over time due to evolution of the storage level. Using competitive ratio as the performance measure, we propose an online algorithm that determines the energy procurement and storage management strategies using a threshold based policy. Our algorithm achieves the optimal competitive ratio of as a function of the price fluctuation ratio. We validate the algorithm using data traces from electricity markets and data-center energy demands. The results show that our algorithm achieves close to the offline optimal performance and outperforms existing alternatives.%more » « less
An official website of the United States government

