skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.


Title: Randomized Dimensionality Reduction for Facility Location and Single-Linkage Clustering
Random dimensionality reduction is a versatile tool for speeding up algorithms for high-dimensional problems. We study its application to two clustering problems: the facility location problem, and the single-linkage hierarchical clustering problem, which is equivalent to computing the minimum spanning tree. We show that if we project the input pointset 𝑋 onto a random 𝑑=𝑂(𝑑𝑋)-dimensional subspace (where 𝑑𝑋 is the doubling dimension of 𝑋), then the optimum facility location cost in the projected space approximates the original cost up to a constant factor. We show an analogous statement for minimum spanning tree, but with the dimension 𝑑 having an extra loglog𝑛 term and the approximation factor being arbitrarily close to 1. Furthermore, we extend these results to approximating solutions instead of just their costs. Lastly, we provide experimental results to validate the quality of solutions and the speedup due to the dimensionality reduction. Unlike several previous papers studying this approach in the context of 𝑘-means and 𝑘-medians, our dimension bound does not depend on the number of clusters but only on the intrinsic dimensionality of 𝑋.  more » « less
Award ID(s):
2022448 1740751
NSF-PAR ID:
10276628
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Conference on Machine Learning
Page Range / eLocation ID:
7948-7957
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Numerical solutions of stochastic problems involving random processes 𝑋(𝑡), which constitutes infinite families of random variables, require to represent these processes by finite dimensional (FD) models 𝑋𝑑 (𝑡), i.e., deterministic functions of time depending on finite numbers 𝑑 of random variables. Most available FD models match the mean, correlation, and other global properties of 𝑋(𝑡). They provide useful information to a broad range of problems, but cannot be used to estimate extremes or other sample properties of 𝑋(𝑡). We develop FD models 𝑋𝑑 (𝑡) for processes 𝑋(𝑡) with continuous samples and establish conditions under which these models converge weakly to 𝑋(𝑡) in the space of continuous functions as 𝑑 → ∞. These theoretical results are illustrated by numerical examples which show that, under the conditions established in this study, samples and extremes of 𝑋(𝑡) can be approximated by samples and extremes of 𝑋𝑑 (𝑡) and that the discrepancy between samples and extremes of these processes decreases with 𝑑. 
    more » « less
  2. Consider an instance of Euclidean k-means or k-medians clustering. We show that the cost of the optimal solution is preserved up to a factor of (1+ε) under a projection onto a random O(log(k /ε) / ε2)-dimensional subspace. Further, the cost of every clustering is preserved within (1+ε). More generally, our result applies to any dimension reduction map satisfying a mild sub-Gaussian-tail condition. Our bound on the dimension is nearly optimal. Additionally, our result applies to Euclidean k-clustering with the distances raised to the p-th power for any constant p. For k-means, our result resolves an open problem posed by Cohen, Elder, Musco, Musco, and Persu (STOC 2015); for k-medians, it answers a question raised by Kannan. 
    more » « less
  3. A conditional sampling oracle for a probability distribution D returns samples from the conditional distribution of D restricted to a specified subset of the domain. A recent line of work (Chakraborty et al. 2013 and Cannone et al. 2014) has shown that having access to such a conditional sampling oracle requires only polylogarithmic or even constant number of samples to solve distribution testing problems like identity and uniformity. This significantly improves over the standard sampling model where polynomially many samples are necessary. Inspired by these results, we introduce a computational model based on conditional sampling to develop sublinear algorithms with exponentially faster runtimes compared to standard sublinear algorithms. We focus on geometric optimization problems over points in high dimensional Euclidean space. Access to these points is provided via a conditional sampling oracle that takes as input a succinct representation of a subset of the domain and outputs a uniformly random point in that subset. We study two well studied problems: k-means clustering and estimating the weight of the minimum spanning tree. In contrast to prior algorithms for the classic model, our algorithms have time, space and sample complexity that is polynomial in the dimension and polylogarithmic in the number of points. Finally, we comment on the applicability of the model and compare with existing ones like streaming, parallel and distributed computational models. 
    more » « less
  4. Conditions under which samples of continuous stochastic processes 𝑋(𝑡) on bounded time intervals [0, 𝜏] can be represented by samples of finite dimensional (FD) processes 𝑋𝑑 (𝑡) are augmented such that samples of Slepian models 𝑆𝑑,𝑎(𝑡) of 𝑋𝑑 (𝑡) can be used as surrogates for samples of Slepian models 𝑆𝑎(𝑡) of 𝑋(𝑡). FD processes are deterministic functions of time and 𝑑 < ∞ random variables. The discrepancy between target and FD samples is quantified by the metric of the space 𝐶[0, 𝜏] of continuous functions. The numerical illustrations, which include Gaussian/non-Gaussian FD processes and solutions of linear/nonlinear random vibration problems, are consistent with the theoretical findings in the paper. 
    more » « less
  5. Given a set of facilities and clients, and costs to open facilities, the classic facility location problem seeks to open a set of facilities and assign each client to one open facility to minimize the cost of opening the chosen facilities and the total distance of the clients to their assigned open facilities. Such an objective may induce an unequal cost over certain socioeconomic groups of clients (i.e., total distance traveled by clients in such a group). This is important when planning the location of socially relevant facilities such as emergency rooms. In this work, we consider a fair version of the problem where we are given 𝑟 clients groups that partition the set of clients, and the distance of a given group is defined as the average distance of clients in the group to their respective open facilities. The objective is to minimize the Minkowski 𝑝-norm of vector of group distances, to penalize high access costs to open facilities across 𝑟 groups of clients. This generalizes classic facility location (𝑝 = 1) and the minimization of the maximum group distance (𝑝 = ∞). However, in practice, fairness criteria may not be explicit or even known to a decision maker, and it is often unclear how to select a specific "𝑝" to model the cost of unfairness. To get around this, we study the notion of solution portfolios where for a fixed problem instance, we seek a small portfolio of solutions such that for any Minkowski norm 𝑝, one of these solutions is an 𝑂(1)-approximation. Using the geometric relationship between various 𝑝-norms, we show the existence of a portfolio of cardinality 𝑂(log 𝑟), and a lower bound of (\sqrt{log r}). There may not be common structure across different solutions in this portfolio, which can make planning difficult if the notion of fairness changes over time or if the budget to open facilities is disbursed over time. For example, small changes in 𝑝 could lead to a completely different set of open facilities in the portfolio. Inspired by this, we introduce the notion of refinement, which is a family of solutions for each 𝑝-norm satisfying a combinatorial property. This property requires that (1) the set of facilities open for a higher 𝑝-norm must be a subset of the facilities open for a lower 𝑝-norm, and (2) all clients assigned to an open facility for a lower 𝑝-norm must be assigned to the same open facility for any higher 𝑝-norm. A refinement is 𝛼-approximate if the solution for each 𝑝-norm problem is an 𝛼-approximation for it. We show that it is sufficient to consider only 𝑂(log 𝑟) norms instead of all 𝑝-norms, 𝑝 ∈ [1, ∞] to construct refinements. A natural greedy algorithm for the problem gives a poly(𝑟)-approximate refinement, which we improve to poly(r^1/\sqrt{log 𝑟})-approximate using a recursive algorithm. We improve this ratio to 𝑂(log 𝑟) for the special case of tree metric for uniform facility open cost. Our recursive algorithm extends to other settings, including to a hierarchical facility location problem that models facility location problems at several levels, such as public works departments and schools. A full version of this paper can be found at https://arxiv.org/abs/2211.14873. 
    more » « less