Abstract In this paper, we address the problem of estimating transport surplus (a.k.a. matching affinity) in high-dimensional optimal transport problems. Classical optimal transport theory specifies the matching affinity and determines the optimal joint distribution. In contrast, we study the inverse problem of estimating matching affinity based on the observation of the joint distribution, using an entropic regularization of the problem. To accommodate high dimensionality of the data, we propose a novel method that incorporates a nuclear norm regularization that effectively enforces a rank constraint on the affinity matrix. The low-rank matrix estimated in this way reveals the main factors that are relevant for matching.
more »
« less
Statistical Optimal Transport via Factored Couplings
We propose a new method to estimate Wasserstein distances and optimal transport plans between two probability distributions from samples in high dimension. Unlike plug-in rules that simply replace the true distributions by their empirical counterparts, our method promotes couplings with low transport rank, a new structural assumption that is similar to the nonnegative rank of a matrix. Regularizing based on this assumption leads to drastic improvements on high-dimensional data for various tasks, including domain adaptation in single-cell RNA sequencing data. These findings are supported by a theoretical analysis that indicates that the transport rank is key in overcoming the curse of dimensionality inherent to data-driven optimal transport.
more »
« less
- PAR ID:
- 10100611
- Date Published:
- Journal Name:
- Proceedings of Machine Learning Research
- Volume:
- 89
- Page Range / eLocation ID:
- 2454--2465
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Krause, Andreas; Brunskill, Emma; Cho, Kyunghyun; Engelhardt, Barbara; Sabato, Sivan; Scarlett, Jonathan (Ed.)We consider the problem of estimating the optimal transport map between two probability distributions, P and Q in R^d, on the basis of i.i.d. samples. All existing statistical analyses of this problem require the assumption that the transport map is Lipschitz, a strong requirement that, in particular, excludes any examples where the transport map is discontinuous. As a first step towards developing estimation procedures for discontinuous maps, we consider the important special case where the data distribution Q is a discrete measure supported on a finite number of points in R^d. We study a computationally efficient estimator initially proposed by Pooladian & Niles-Weed (2021), based on entropic optimal transport, and show in the semi-discrete setting that it converges at the minimax-optimal rate n^{−1/2}, independent of dimension. Other standard map estimation techniques both lack finite-sample guarantees in this setting and provably suffer from the curse of dimensionality. We confirm these results in numerical experiments, and provide experiments for other settings, not covered by our theory, which indicate that the entropic estimator is a promising methodology for other discontinuous transport map estimation problems.more » « less
-
We consider the problem of estimating the optimal transport map between two probability distributions, P and Q in Rd, on the basis of i.i.d. samples. All existing statistical analyses of this problem require the assumption that the transport map is Lipschitz, a strong requirement that, in particular, excludes any examples where the transport map is discontinuous. As a first step towards developing estimation procedures for discontinuous maps, we consider the important special case where the data distribution Q is a discrete measure supported on a finite number of points in Rd. We study a computationally efficient estimator initially proposed by Pooladian and Niles-Weed (2021), based on entropic optimal transport, and show in the semi-discrete setting that it converges at the minimax-optimal rate n−1/2, independent of dimension. Other standard map estimation techniques both lack finite-sample guarantees in this setting and provably suffer from the curse of dimensionality. We confirm these results in numerical experiments, and provide experiments for other settings, not covered by our theory, which indicate that the entropic estimator is a promising methodology for other discontinuous transport map estimation problems.more » « less
-
We tackle the problem of recovering a complex signal $$\vx\in\mathbb{C}^n$$ from quadratic measurements of the form $$y_i=\vx^*\vA_i\vx$$, where $$\{\vA_i\}_{i=1}^m$$ is a set of complex iid standard Gaussian matrices. This non-convex problem is related to the well understood phase retrieval problem where $$\vA_i$$ is a rank-1 positive semidefinite matrix. Here we study a general full-rank case which models a number of key applications such as molecular geometry recovery from distance distributions and compound measurements in phaseless diffractive imaging. Most prior work either addresses the rank-1 case or focuses on real measurements. The several papers that address the full-rank complex case adopt the semidefinite relaxation approach and are thus computationally demanding. In this paper we propose a method based on the standard framework comprising a spectral initialization followed by iterative gradient descent updates. We prove that when the number of measurements exceeds the signal's length by some constant factor, a globally optimal solution can be recovered from complex quadratic measurements with high probability. Numerical experiments on simulated data corroborate our theoretical analysis.more » « less
-
Estimating ranks, quantiles, and distributions over streaming data is a central task in data analysis and monitoring. Given a stream of n items from a data universe equipped with a total order, the task is to compute a sketch (data structure) of size polylogarithmic in n . Given the sketch and a query item y , one should be able to approximate its rank in the stream, i.e., the number of stream elements smaller than or equal to y . Most works to date focused on additive ε n error approximation, culminating in the KLL sketch that achieved optimal asymptotic behavior. This paper investigates multiplicative (1 ± ε)-error approximations to the rank. Practical motivation for multiplicative error stems from demands to understand the tails of distributions, and hence for sketches to be more accurate near extreme values. The most space-efficient algorithms due to prior work store either O (log (ε 2 n )/ε 2 ) or O (log 3 (ε n )/ε) universe items. We present a randomized sketch storing O (log 1.5 (ε n )/ε) items that can (1 ± ε)-approximate the rank of each universe item with high constant probability; this space bound is within an \(O(\sqrt {\log (\varepsilon n)}) \) factor of optimal. Our algorithm does not require prior knowledge of the stream length and is fully mergeable, rendering it suitable for parallel and distributed computing environments.more » « less
An official website of the United States government

