skip to main content

Title: Randomized Dimensionality Reduction for Facility Location and Single-Linkage Clustering
Random dimensionality reduction is a versatile tool for speeding up algorithms for high-dimensional problems. We study its application to two clustering problems: the facility location problem, and the single-linkage hierarchical clustering problem, which is equivalent to computing the minimum spanning tree. We show that if we project the input pointset 𝑋 onto a random 𝑑=𝑂(𝑑𝑋)-dimensional subspace (where 𝑑𝑋 is the doubling dimension of 𝑋), then the optimum facility location cost in the projected space approximates the original cost up to a constant factor. We show an analogous statement for minimum spanning tree, but with the dimension 𝑑 having an extra loglog𝑛 term and the approximation factor being arbitrarily close to 1. Furthermore, we extend these results to approximating solutions instead of just their costs. Lastly, we provide experimental results to validate the quality of solutions and the speedup due to the dimensionality reduction. Unlike several previous papers studying this approach in the context of π‘˜-means and π‘˜-medians, our dimension bound does not depend on the number of clusters but only on the intrinsic dimensionality of 𝑋.  more » « less
Award ID(s):
2022448 1740751
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Conference on Machine Learning
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Numerical solutions of stochastic problems involving random processes 𝑋(𝑑), which constitutes infinite families of random variables, require to represent these processes by finite dimensional (FD) models 𝑋𝑑 (𝑑), i.e., deterministic functions of time depending on finite numbers 𝑑 of random variables. Most available FD models match the mean, correlation, and other global properties of 𝑋(𝑑). They provide useful information to a broad range of problems, but cannot be used to estimate extremes or other sample properties of 𝑋(𝑑). We develop FD models 𝑋𝑑 (𝑑) for processes 𝑋(𝑑) with continuous samples and establish conditions under which these models converge weakly to 𝑋(𝑑) in the space of continuous functions as 𝑑 β†’ ∞. These theoretical results are illustrated by numerical examples which show that, under the conditions established in this study, samples and extremes of 𝑋(𝑑) can be approximated by samples and extremes of 𝑋𝑑 (𝑑) and that the discrepancy between samples and extremes of these processes decreases with 𝑑. 
    more » « less
  2. Consider an instance of Euclidean k-means or k-medians clustering. We show that the cost of the optimal solution is preserved up to a factor of (1+Ξ΅) under a projection onto a random O(log(k /Ξ΅) / Ξ΅2)-dimensional subspace. Further, the cost of every clustering is preserved within (1+Ξ΅). More generally, our result applies to any dimension reduction map satisfying a mild sub-Gaussian-tail condition. Our bound on the dimension is nearly optimal. Additionally, our result applies to Euclidean k-clustering with the distances raised to the p-th power for any constant p. For k-means, our result resolves an open problem posed by Cohen, Elder, Musco, Musco, and Persu (STOC 2015); for k-medians, it answers a question raised by Kannan. 
    more » « less
  3. Conditions under which samples of continuous stochastic processes 𝑋(𝑑) on bounded time intervals [0, 𝜏] can be represented by samples of finite dimensional (FD) processes 𝑋𝑑 (𝑑) are augmented such that samples of Slepian models 𝑆𝑑,π‘Ž(𝑑) of 𝑋𝑑 (𝑑) can be used as surrogates for samples of Slepian models π‘†π‘Ž(𝑑) of 𝑋(𝑑). FD processes are deterministic functions of time and 𝑑 < ∞ random variables. The discrepancy between target and FD samples is quantified by the metric of the space 𝐢[0, 𝜏] of continuous functions. The numerical illustrations, which include Gaussian/non-Gaussian FD processes and solutions of linear/nonlinear random vibration problems, are consistent with the theoretical findings in the paper. 
    more » « less
  4. In this work, we formulate and solve a new type of approximate nearest neighbor search (ANNS) problems called ANNS after linear transformation (ALT). In ANNS-ALT, we search for the vector (in a dataset) that, after being linearly transformed by a user-specified query matrix, is closest to a query vector. It is a very general mother problem in the sense that a wide range of baby ANNS problems that have important applications in databases and machine learning can be reduced to and solved as ANNS-ALT, or its dual that we call ANNS-ALTD. We propose a novel and computationally efficient solution, called ONe Index for All Kernels (ONIAK), to ANNS-ALT and all its baby problems when the data dimension 𝑑 is not too large (say 𝑑 ≀ 200). In ONIAK, a universal index is built, once and for all, for answering all future ANNS-ALT queries that can have distinct query matrices. We show by experiments that, when 𝑑 is not too large, ONIAK has better query performance than linear scan on the mother problem (of ANNS-ALT), and has query performances comparable to those of the state-of-the-art solutions on the baby problems. However, the algorithmic technique behind this universal index approach suffers from a so-called dimension blowup problem that can make the indexing time prohibitively long for a large dataset. We propose a novel algorithmic technique, called fast GOE quadratic form (FGoeQF), that completely solves the (prohibitively long indexing time) fallout of the dimension blowup problem. We also propose a Johnson-Lindenstrauss transform (JLT) based ANNS- ALT (and ANNS-ALTD) solution that significantly outperforms any competitor when 𝑑 is large. 
    more » « less
  5. We present a novel framework for learning cost-efficient latent representations in problems with highdimensional state spaces through nonlinear dimension reduction. By enriching linear state approximations with low-order polynomial terms we account for key nonlinear interactions existing in the data thereby reducing the problem’s intrinsic dimensionality. Two methods are introduced for learning the representation of such low-dimensional, polynomial manifolds for embedding the data. The manifold parametrization coefficients can be obtained by regression via either a proper orthogonal decomposition or an alternating minimization based approach. Our numerical results focus on the one-dimensional Korteweg-de Vries equation where accounting for nonlinear correlations in the data was found to lower the representation error by up to two orders of magnitude compared to linear dimension reduction techniques. 
    more » « less