Maximum mean discrepancy (MMD), also called energy distance or N-distance in statistics and Hilbert-Schmidt independence criterion (HSIC), specifically distance covariance in statistics, are among the most popular and successful approaches to quantify the difference and independence of random variables, respectively. Thanks to their kernel-based foundations, MMD and HSIC are applicable on a wide variety of domains. Despite their tremendous success, quite little is known about when HSIC characterizes independence and when MMD with tensor product kernel can discriminate probability distributions. In this paper, we answer these questions by studying various notions of the characteristic property of the tensor product kernel.
more »
« less
Convergence of Gaussian-smoothed optimal transport distance with sub-gamma distributions and dependent samples
The Gaussian-smoothed optimal transport (GOT) framework, recently proposed by Goldfeld et al., scales to high dimensions in estimation and provides an alternative to entropy regularization. This paper provides convergence guarantees for estimating the GOT distance under more general settings. For the Gaussian-smoothed $$p$$-Wasserstein distance in $$d$$ dimensions, our results require only the existence of a moment greater than $d + 2p$. For the special case of sub-gamma distributions, we quantify the dependence on the dimension $$d$$ and establish a phase transition with respect to the scale parameter. We also prove convergence for dependent samples, only requiring a condition on the pairwise dependence of the samples measured by the covariance of the feature map of a kernel space. A key step in our analysis is to show that the GOT distance is dominated by a family of kernel maximum mean discrepancy (MMD) distances with a kernel that depends on the cost function as well as the amount of Gaussian smoothing. This insight provides further interpretability for the GOT framework and also introduces a class of kernel MMD distances with desirable properties. The theoretical results are supported by numerical experiments.The Gaussian-smoothed optimal transport (GOT) framework, recently proposed by Goldfeld et al., scales to high dimensions in estimation and provides an alternative to entropy regularization. This paper provides convergence guarantees for estimating the GOT distance under more general settings. For the Gaussian-smoothed $$p$$-Wasserstein distance in $$d$$ dimensions, our results require only the existence of a moment greater than $d + 2p$. For the special case of sub-gamma distributions, we quantify the dependence on the dimension $$d$$ and establish a phase transition with respect to the scale parameter. We also prove convergence for dependent samples, only requiring a condition on the pairwise dependence of the samples measured by the covariance of the feature map of a kernel space. A key step in our analysis is to show that the GOT distance is dominated by a family of kernel maximum mean discrepancy (MMD) distances with a kernel that depends on the cost function as well as the amount of Gaussian smoothing. This insight provides further interpretability for the GOT framework and also introduces a class of kernel MMD distances with desirable properties. The theoretical results are supported by numerical experiments.
more »
« less
- Award ID(s):
- 1750362
- PAR ID:
- 10318450
- Date Published:
- Journal Name:
- International Conference on Artificial Intelligence and Statistics
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract The Gromov–Wasserstein distance—a generalization of the usual Wasserstein distance—permits comparing probability measures defined on possibly different metric spaces. Recently, this notion of distance has found several applications in Data Science and in Machine Learning. With the goal of aiding both the interpretability of dissimilarity measures computed through the Gromov–Wasserstein distance and the assessment of the approximation quality of computational techniques designed to estimate the Gromov–Wasserstein distance, we determine the precise value of a certain variant of the Gromov–Wasserstein distance between unit spheres of different dimensions. Indeed, we consider a two-parameter family$$\{d_{{{\text {GW}}}p,q}\}_{p,q=1}^{\infty }$$ of Gromov–Wasserstein distances between metric measure spaces. By exploiting a suitable interaction between specific values of the parameterspandqand the metric of the underlying spaces, we are able to determine the exact value of the distance$$d_{{{\text {GW}}}4,2}$$ between all pairs of unit spheres of different dimensions endowed with their Euclidean distance and their uniform measure.more » « less
-
Abstract The paper introduces a new kernel-based Maximum Mean Discrepancy (MMD) statistic for measuring the distance between two distributions given finitely many multivariate samples. When the distributions are locally low-dimensional, the proposed test can be made more powerful to distinguish certain alternatives by incorporating local covariance matrices and constructing an anisotropic kernel. The kernel matrix is asymmetric; it computes the affinity between $$n$$ data points and a set of $$n_R$$ reference points, where $$n_R$$ can be drastically smaller than $$n$$. While the proposed statistic can be viewed as a special class of Reproducing Kernel Hilbert Space MMD, the consistency of the test is proved, under mild assumptions of the kernel, as long as $$\|p-q\| \sqrt{n} \to \infty $$, and a finite-sample lower bound of the testing power is obtained. Applications to flow cytometry and diffusion MRI datasets are demonstrated, which motivate the proposed approach to compare distributions.more » « less
-
Abstract The seminal result of Benamou and Brenier provides a characterization of the Wasserstein distance as the path of the minimal action in the space of probability measures, where paths are solutions of the continuity equation and the action is the kinetic energy. Here we consider a fundamental modification of the framework where the paths are solutions of nonlocal (jump) continuity equations and the action is a nonlocal kinetic energy. The resulting nonlocal Wasserstein distances are relevant to fractional diffusions and Wasserstein distances on graphs. We characterize the basic properties of the distance and obtain sharp conditions on the (jump) kernel specifying the nonlocal transport that determine whether the topology metrized is the weak or the strong topology. A key result of the paper are the quantitative comparisons between the nonlocal and local Wasserstein distance.more » « less
-
We focus on the task of learning a single index model with respect to the isotropic Gaussian distribution in d dimensions. Prior work has shown that the sample complexity of learning the hidden direction is governed by the information exponent k of the link function. Ben Arous et al. showed that d^k samples suffice for learning and that this is tight for online SGD. However, the CSQ lowerbound for gradient based methods only shows that d^{k/2} samples are necessary. In this work, we close the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns the hidden direction with the correct number of samples. We also draw connections to statistical analyses of tensor PCA and to the implicit regularization effects of minibatch SGD on empirical losses.more » « less