Numerical solutions of stochastic problems involving random processes 𝑋(𝑡), which constitutes infinite families
of random variables, require to represent these processes by finite dimensional (FD) models 𝑋𝑑 (𝑡), i.e.,
deterministic functions of time depending on finite numbers 𝑑 of random variables. Most available FD models
match the mean, correlation, and other global properties of 𝑋(𝑡). They provide useful information to a broad
range of problems, but cannot be used to estimate extremes or other sample properties of 𝑋(𝑡). We develop FD
models 𝑋𝑑 (𝑡) for processes 𝑋(𝑡) with continuous samples and establish conditions under which these models
converge weakly to 𝑋(𝑡) in the space of continuous functions as 𝑑 → ∞. These theoretical results are illustrated
by numerical examples which show that, under the conditions established in this study, samples and extremes
of 𝑋(𝑡) can be approximated by samples and extremes of 𝑋𝑑 (𝑡) and that the discrepancy between samples and
extremes of these processes decreases with 𝑑.
more »
« less
Randomized Dimensionality Reduction for Facility Location and SingleLinkage Clustering
Random dimensionality reduction is a versatile tool for speeding up algorithms for highdimensional problems. We study its application to two clustering problems: the facility location problem, and the singlelinkage hierarchical clustering problem, which is equivalent to computing the minimum spanning tree. We show that if we project the input pointset 𝑋 onto a random 𝑑=𝑂(𝑑𝑋)dimensional subspace (where 𝑑𝑋 is the doubling dimension of 𝑋), then the optimum facility location cost in the projected space approximates the original cost up to a constant factor. We show an analogous statement for minimum spanning tree, but with the dimension 𝑑 having an extra loglog𝑛 term and the approximation factor being arbitrarily close to 1. Furthermore, we extend these results to approximating solutions instead of just their costs. Lastly, we provide experimental results to validate the quality of solutions and the speedup due to the dimensionality reduction. Unlike several previous papers studying this approach in the context of 𝑘means and 𝑘medians, our dimension bound does not depend on the number of clusters but only on the intrinsic dimensionality of 𝑋.
more »
« less
 NSFPAR ID:
 10276628
 Date Published:
 Journal Name:
 International Conference on Machine Learning
 Page Range / eLocation ID:
 79487957
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Consider an instance of Euclidean kmeans or kmedians clustering. We show that the cost of the optimal solution is preserved up to a factor of (1+ε) under a projection onto a random O(log(k /ε) / ε2)dimensional subspace. Further, the cost of every clustering is preserved within (1+ε). More generally, our result applies to any dimension reduction map satisfying a mild subGaussiantail condition. Our bound on the dimension is nearly optimal. Additionally, our result applies to Euclidean kclustering with the distances raised to the pth power for any constant p. For kmeans, our result resolves an open problem posed by Cohen, Elder, Musco, Musco, and Persu (STOC 2015); for kmedians, it answers a question raised by Kannan.more » « less

A conditional sampling oracle for a probability distribution D returns samples from the conditional distribution of D restricted to a specified subset of the domain. A recent line of work (Chakraborty et al. 2013 and Cannone et al. 2014) has shown that having access to such a conditional sampling oracle requires only polylogarithmic or even constant number of samples to solve distribution testing problems like identity and uniformity. This significantly improves over the standard sampling model where polynomially many samples are necessary. Inspired by these results, we introduce a computational model based on conditional sampling to develop sublinear algorithms with exponentially faster runtimes compared to standard sublinear algorithms. We focus on geometric optimization problems over points in high dimensional Euclidean space. Access to these points is provided via a conditional sampling oracle that takes as input a succinct representation of a subset of the domain and outputs a uniformly random point in that subset. We study two well studied problems: kmeans clustering and estimating the weight of the minimum spanning tree. In contrast to prior algorithms for the classic model, our algorithms have time, space and sample complexity that is polynomial in the dimension and polylogarithmic in the number of points. Finally, we comment on the applicability of the model and compare with existing ones like streaming, parallel and distributed computational models.more » « less

Conditions under which samples of continuous stochastic processes 𝑋(𝑡) on bounded time intervals [0, 𝜏] can be represented by samples of finite dimensional (FD) processes 𝑋𝑑 (𝑡) are augmented such that samples of Slepian models 𝑆𝑑,𝑎(𝑡) of 𝑋𝑑 (𝑡) can be used as surrogates for samples of Slepian models 𝑆𝑎(𝑡) of 𝑋(𝑡). FD processes are deterministic functions of time and 𝑑 < ∞ random variables. The discrepancy between target and FD samples is quantified by the metric of the space 𝐶[0, 𝜏] of continuous functions. The numerical illustrations, which include Gaussian/nonGaussian FD processes and solutions of linear/nonlinear random vibration problems, are consistent with the theoretical findings in the paper.more » « less

Given a set of facilities and clients, and costs to open facilities, the classic facility location problem seeks to open a set of facilities and assign each client to one open facility to minimize the cost of opening the chosen facilities and the total distance of the clients to their assigned open facilities. Such an objective may induce an unequal cost over certain socioeconomic groups of clients (i.e., total distance traveled by clients in such a group). This is important when planning the location of socially relevant facilities such as emergency rooms. In this work, we consider a fair version of the problem where we are given 𝑟 clients groups that partition the set of clients, and the distance of a given group is defined as the average distance of clients in the group to their respective open facilities. The objective is to minimize the Minkowski 𝑝norm of vector of group distances, to penalize high access costs to open facilities across 𝑟 groups of clients. This generalizes classic facility location (𝑝 = 1) and the minimization of the maximum group distance (𝑝 = ∞). However, in practice, fairness criteria may not be explicit or even known to a decision maker, and it is often unclear how to select a specific "𝑝" to model the cost of unfairness. To get around this, we study the notion of solution portfolios where for a fixed problem instance, we seek a small portfolio of solutions such that for any Minkowski norm 𝑝, one of these solutions is an 𝑂(1)approximation. Using the geometric relationship between various 𝑝norms, we show the existence of a portfolio of cardinality 𝑂(log 𝑟), and a lower bound of (\sqrt{log r}). There may not be common structure across different solutions in this portfolio, which can make planning difficult if the notion of fairness changes over time or if the budget to open facilities is disbursed over time. For example, small changes in 𝑝 could lead to a completely different set of open facilities in the portfolio. Inspired by this, we introduce the notion of refinement, which is a family of solutions for each 𝑝norm satisfying a combinatorial property. This property requires that (1) the set of facilities open for a higher 𝑝norm must be a subset of the facilities open for a lower 𝑝norm, and (2) all clients assigned to an open facility for a lower 𝑝norm must be assigned to the same open facility for any higher 𝑝norm. A refinement is 𝛼approximate if the solution for each 𝑝norm problem is an 𝛼approximation for it. We show that it is sufficient to consider only 𝑂(log 𝑟) norms instead of all 𝑝norms, 𝑝 ∈ [1, ∞] to construct refinements. A natural greedy algorithm for the problem gives a poly(𝑟)approximate refinement, which we improve to poly(r^1/\sqrt{log 𝑟})approximate using a recursive algorithm. We improve this ratio to 𝑂(log 𝑟) for the special case of tree metric for uniform facility open cost. Our recursive algorithm extends to other settings, including to a hierarchical facility location problem that models facility location problems at several levels, such as public works departments and schools. A full version of this paper can be found at https://arxiv.org/abs/2211.14873.more » « less