skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Randomized Dimensionality Reduction for Facility Location and Single-Linkage Clustering
Random dimensionality reduction is a versatile tool for speeding up algorithms for high-dimensional problems. We study its application to two clustering problems: the facility location problem, and the single-linkage hierarchical clustering problem, which is equivalent to computing the minimum spanning tree. We show that if we project the input pointset 𝑋 onto a random 𝑑=𝑂(𝑑𝑋)-dimensional subspace (where 𝑑𝑋 is the doubling dimension of 𝑋), then the optimum facility location cost in the projected space approximates the original cost up to a constant factor. We show an analogous statement for minimum spanning tree, but with the dimension 𝑑 having an extra loglog𝑛 term and the approximation factor being arbitrarily close to 1. Furthermore, we extend these results to approximating solutions instead of just their costs. Lastly, we provide experimental results to validate the quality of solutions and the speedup due to the dimensionality reduction. Unlike several previous papers studying this approach in the context of π‘˜-means and π‘˜-medians, our dimension bound does not depend on the number of clusters but only on the intrinsic dimensionality of 𝑋.  more » « less
Award ID(s):
2022448 1740751
PAR ID:
10276628
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Conference on Machine Learning
Page Range / eLocation ID:
7948-7957
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Numerical solutions of stochastic problems involving random processes 𝑋(𝑑), which constitutes infinite families of random variables, require to represent these processes by finite dimensional (FD) models 𝑋𝑑 (𝑑), i.e., deterministic functions of time depending on finite numbers 𝑑 of random variables. Most available FD models match the mean, correlation, and other global properties of 𝑋(𝑑). They provide useful information to a broad range of problems, but cannot be used to estimate extremes or other sample properties of 𝑋(𝑑). We develop FD models 𝑋𝑑 (𝑑) for processes 𝑋(𝑑) with continuous samples and establish conditions under which these models converge weakly to 𝑋(𝑑) in the space of continuous functions as 𝑑 β†’ ∞. These theoretical results are illustrated by numerical examples which show that, under the conditions established in this study, samples and extremes of 𝑋(𝑑) can be approximated by samples and extremes of 𝑋𝑑 (𝑑) and that the discrepancy between samples and extremes of these processes decreases with 𝑑. 
    more » « less
  2. Conditions under which samples of continuous stochastic processes 𝑋(𝑑) on bounded time intervals [0, 𝜏] can be represented by samples of finite dimensional (FD) processes 𝑋𝑑 (𝑑) are augmented such that samples of Slepian models 𝑆𝑑,π‘Ž(𝑑) of 𝑋𝑑 (𝑑) can be used as surrogates for samples of Slepian models π‘†π‘Ž(𝑑) of 𝑋(𝑑). FD processes are deterministic functions of time and 𝑑 < ∞ random variables. The discrepancy between target and FD samples is quantified by the metric of the space 𝐢[0, 𝜏] of continuous functions. The numerical illustrations, which include Gaussian/non-Gaussian FD processes and solutions of linear/nonlinear random vibration problems, are consistent with the theoretical findings in the paper. 
    more » « less
  3. Consider an instance of Euclidean k-means or k-medians clustering. We show that the cost of the optimal solution is preserved up to a factor of (1+Ξ΅) under a projection onto a random O(log(k /Ξ΅) / Ξ΅2)-dimensional subspace. Further, the cost of every clustering is preserved within (1+Ξ΅). More generally, our result applies to any dimension reduction map satisfying a mild sub-Gaussian-tail condition. Our bound on the dimension is nearly optimal. Additionally, our result applies to Euclidean k-clustering with the distances raised to the p-th power for any constant p. For k-means, our result resolves an open problem posed by Cohen, Elder, Musco, Musco, and Persu (STOC 2015); for k-medians, it answers a question raised by Kannan. 
    more » « less
  4. This paper studies differentially private stochastic convex optimization (DP-SCO) in the presence of heavy-tailed gradients, where only a π‘˜ kth-moment bound on sample Lipschitz constants is assumed, instead of a uniform bound. The authors propose a reduction-based approach that achieves the first near-optimal error rates (up to logarithmic factors) in this setting. Specifically, under ( πœ– , 𝛿 ) (Ο΅,Ξ΄)-approximate differential privacy, they achieve an error bound of 𝐺 2 𝑛 + 𝐺 π‘˜ β‹… ( 𝑑 𝑛 πœ– ) 1 βˆ’ 1 π‘˜ , n ​ G 2 ​ ​ +G k ​ β‹…( nΟ΅ d ​ ​ ) 1βˆ’ k 1 ​ , up to a mild polylogarithmic factor in 1 𝛿 Ξ΄ 1 ​ , where 𝐺 2 G 2 ​ and 𝐺 π‘˜ G k ​ are the 2nd and π‘˜ kth moment bounds on sample Lipschitz constants. This nearly matches the lower bound established by Lowy and Razaviyayn (2023). Beyond the basic result, the authors introduce a suite of private algorithms that further improve performance under additional assumptions: an optimal algorithm under a known-Lipschitz constant, a near-linear time algorithm for smooth functions, and an optimal linear-time algorithm for smooth generalized linear models. 
    more » « less
  5. Abstract Optimal transport (OT) methods seek a transformation map (or plan) between two probability measures, such that the transformation has the minimum transportation cost. Such a minimum transport cost, with a certain power transform, is called the Wasserstein distance. Recently, OT methods have drawn great attention in statistics, machine learning, and computer science, especially in deep generative neural networks. Despite its broad applications, the estimation of high‐dimensional Wasserstein distances is a well‐known challenging problem owing to the curse‐of‐dimensionality. There are some cutting‐edge projection‐based techniques that tackle high‐dimensional OT problems. Three major approaches of such techniques are introduced, respectively, the slicing approach, the iterative projection approach, and the projection robust OT approach. Open challenges are discussed at the end of the review. This article is categorized under:Statistical and Graphical Methods of Data Analysis > Dimension ReductionStatistical Learning and Exploratory Methods of the Data Sciences > Manifold Learning 
    more » « less