Numerical solutions of stochastic problems involving random processes π(π‘), which constitutes infinite families
of random variables, require to represent these processes by finite dimensional (FD) models ππ (π‘), i.e.,
deterministic functions of time depending on finite numbers π of random variables. Most available FD models
match the mean, correlation, and other global properties of π(π‘). They provide useful information to a broad
range of problems, but cannot be used to estimate extremes or other sample properties of π(π‘). We develop FD
models ππ (π‘) for processes π(π‘) with continuous samples and establish conditions under which these models
converge weakly to π(π‘) in the space of continuous functions as π β β. These theoretical results are illustrated
by numerical examples which show that, under the conditions established in this study, samples and extremes
of π(π‘) can be approximated by samples and extremes of ππ (π‘) and that the discrepancy between samples and
extremes of these processes decreases with π.
more »
« less
Randomized Dimensionality Reduction for Facility Location and SingleLinkage Clustering
Random dimensionality reduction is a versatile tool for speeding up algorithms for highdimensional problems. We study its application to two clustering problems: the facility location problem, and the singlelinkage hierarchical clustering problem, which is equivalent to computing the minimum spanning tree. We show that if we project the input pointset π onto a random π=π(ππ)dimensional subspace (where ππ is the doubling dimension of π), then the optimum facility location cost in the projected space approximates the original cost up to a constant factor. We show an analogous statement for minimum spanning tree, but with the dimension π having an extra loglogπ term and the approximation factor being arbitrarily close to 1. Furthermore, we extend these results to approximating solutions instead of just their costs. Lastly, we provide experimental results to validate the quality of solutions and the speedup due to the dimensionality reduction. Unlike several previous papers studying this approach in the context of πmeans and πmedians, our dimension bound does not depend on the number of clusters but only on the intrinsic dimensionality of π.
more »
« less
 NSFPAR ID:
 10276628
 Date Published:
 Journal Name:
 International Conference on Machine Learning
 Page Range / eLocation ID:
 79487957
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Consider an instance of Euclidean kmeans or kmedians clustering. We show that the cost of the optimal solution is preserved up to a factor of (1+Ξ΅) under a projection onto a random O(log(k /Ξ΅) / Ξ΅2)dimensional subspace. Further, the cost of every clustering is preserved within (1+Ξ΅). More generally, our result applies to any dimension reduction map satisfying a mild subGaussiantail condition. Our bound on the dimension is nearly optimal. Additionally, our result applies to Euclidean kclustering with the distances raised to the pth power for any constant p. For kmeans, our result resolves an open problem posed by Cohen, Elder, Musco, Musco, and Persu (STOC 2015); for kmedians, it answers a question raised by Kannan.more » « less

Conditions under which samples of continuous stochastic processes π(π‘) on bounded time intervals [0, π] can be represented by samples of finite dimensional (FD) processes ππ (π‘) are augmented such that samples of Slepian models ππ,π(π‘) of ππ (π‘) can be used as surrogates for samples of Slepian models ππ(π‘) of π(π‘). FD processes are deterministic functions of time and π < β random variables. The discrepancy between target and FD samples is quantified by the metric of the space πΆ[0, π] of continuous functions. The numerical illustrations, which include Gaussian/nonGaussian FD processes and solutions of linear/nonlinear random vibration problems, are consistent with the theoretical findings in the paper.more » « less

In this work, we formulate and solve a new type of approximate nearest neighbor search (ANNS) problems called ANNS after linear transformation (ALT). In ANNSALT, we search for the vector (in a dataset) that, after being linearly transformed by a userspecified query matrix, is closest to a query vector. It is a very general mother problem in the sense that a wide range of baby ANNS problems that have important applications in databases and machine learning can be reduced to and solved as ANNSALT, or its dual that we call ANNSALTD. We propose a novel and computationally efficient solution, called ONe Index for All Kernels (ONIAK), to ANNSALT and all its baby problems when the data dimension π is not too large (say π β€ 200). In ONIAK, a universal index is built, once and for all, for answering all future ANNSALT queries that can have distinct query matrices. We show by experiments that, when π is not too large, ONIAK has better query performance than linear scan on the mother problem (of ANNSALT), and has query performances comparable to those of the stateoftheart solutions on the baby problems. However, the algorithmic technique behind this universal index approach suffers from a socalled dimension blowup problem that can make the indexing time prohibitively long for a large dataset. We propose a novel algorithmic technique, called fast GOE quadratic form (FGoeQF), that completely solves the (prohibitively long indexing time) fallout of the dimension blowup problem. We also propose a JohnsonLindenstrauss transform (JLT) based ANNS ALT (and ANNSALTD) solution that significantly outperforms any competitor when π is large.more » « less

We present a novel framework for learning costefficient latent representations in problems with highdimensional state spaces through nonlinear dimension reduction. By enriching linear state approximations with loworder polynomial terms we account for key nonlinear interactions existing in the data thereby reducing the problemβs intrinsic dimensionality. Two methods are introduced for learning the representation of such lowdimensional, polynomial manifolds for embedding the data. The manifold parametrization coefficients can be obtained by regression via either a proper orthogonal decomposition or an alternating minimization based approach. Our numerical results focus on the onedimensional Kortewegde Vries equation where accounting for nonlinear correlations in the data was found to lower the representation error by up to two orders of magnitude compared to linear dimension reduction techniques.more » « less