skip to main content

Title: The Phylogenetic Kantorovich–Rubinstein Metric for Environmental Sequence Samples

It is now common to survey microbial communities by sequencing nucleic acid material extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, which gives a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used tools for these analyses. We provide a foundation for such methods by establishing that, if we equate a metagenomic sample with its empirical distribution on a reference phylogenetic tree, then the weighted UniFrac distance between two samples is just the classical Kantorovich–Rubinstein, or earth mover’s, distance between the corresponding empirical distributions. We demonstrate that this Kantorovich–Rubinstein distance and extensions incorporating uncertainty in the sample locations can be written as a readily computable integral over the tree, we develop Lp Zolotarev-type generalizations of the metric, and we show how the p-value of the resulting natural permutation test of the null hypothesis ‘no difference between two communities’ can be approximated by using a Gaussian process functional. We relate the L2-case to an analysis-of-variance type of decomposition, finding that the distribution of its associated Gaussian functional is that of a computable linear combination of independent X12 random variables.

more » « less
Author(s) / Creator(s):
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Page Range / eLocation ID:
p. 569-592
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We consider the problem of estimating the Wasserstein distance between the empirical measure and a set of probability measures whose expectations over a class of functions (hypothesis class) are constrained. If this class is sufficiently rich to characterize a particular distribution (e.g., all Lipschitz functions), then our formulation recovers the Wasserstein distance to such a distribution. We establish a strong duality result that generalizes the celebrated Kantorovich-Rubinstein duality. We also show that our formulation can be used to beat the curse of dimensionality, which is well known to affect the rates of statistical convergence of the empirical Wasserstein distance. In particular, examples of infinite-dimensional hypothesis classes are presented, informed by a complex correlation structure, for which it is shown that the empirical Wasserstein distance to such classes converges to zero at the standard parametric rate. Our formulation provides insights that help clarify why, despite the curse of dimensionality, the Wasserstein distance enjoys favorable empirical performance across a wide range of statistical applications. 
    more » « less
  2. Fog computing has been advocated as an enabling technology for computationally intensive services in connected smart vehicles. Most existing works focus on analyzing and opti- mizing the queueing and workload processing latencies, ignoring the fact that the access latency between vehicles and fog/cloud servers can sometimes dominate the end-to-end service latency. This motivates the work in this paper, where we report a five- month urban measurement study of the wireless access latency between a connected vehicle and a fog computing system sup- ported by commercially available multi-operator LTE networks. We propose AdaptiveFog, a novel framework for autonomous and dynamic switching between different LTE operators that implement fog/cloud infrastructure. The main objective here is to maximize the service confidence level, defined as the probability that the tolerable latency threshold for each supported type of service can be guaranteed. AdaptiveFog has been implemented on a smart phone app, running on a moving vehicle. The app periodically measures the round-trip time between the vehicle and fog/cloud servers. An empirical spatial statistic model is established to characterize the spatial variation of the latency across the main driving routes of the city. To quantify the perfor- mance difference between different LTE networks, we introduce the weighted Kantorovich-Rubinstein (K-R) distance. An optimal policy is derived for the vehicle to dynamically switch between LTE operators’ networks while driving. Extensive analysis and simulation are performed based on our latency measurement dataset. Our results show that AdaptiveFog achieves around 30% and 50% improvement in the confidence level of fog and cloud latency, respectively. 
    more » « less
  3. null (Ed.)
    Sensitivity properties describe how changes to the input of a program affect the output, typically by upper bounding the distance between the outputs of two runs by a monotone function of the distance between the corresponding inputs. When programs are probabilistic, the distance between outputs is a distance between distributions. The Kantorovich lifting provides a general way of defining a distance between distributions by lifting the distance of the underlying sample space; by choosing an appropriate distance on the base space, one can recover other usual probabilistic distances, such as the Total Variation distance. We develop a relational pre-expectation calculus to upper bound the Kantorovich distance between two executions of a probabilistic program. We illustrate our methods by proving algorithmic stability of a machine learning algorithm, convergence of a reinforcement learning algorithm, and fast mixing for card shuffling algorithms. We also consider some extensions: using our calculus to show convergence of Markov chains to the uniform distribution over states and an asynchronous extension to reason about pairs of program executions with different control flow. 
    more » « less
  4. Summary

    We introduce an L2-type test for testing mutual independence and banded dependence structure for high dimensional data. The test is constructed on the basis of the pairwise distance covariance and it accounts for the non-linear and non-monotone dependences among the data, which cannot be fully captured by the existing tests based on either Pearson correlation or rank correlation. Our test can be conveniently implemented in practice as the limiting null distribution of the test statistic is shown to be standard normal. It exhibits excellent finite sample performance in our simulation studies even when the sample size is small albeit the dimension is high and is shown to identify non-linear dependence in empirical data analysis successfully. On the theory side, asymptotic normality of our test statistic is shown under quite mild moment assumptions and with little restriction on the growth rate of the dimension as a function of sample size. As a demonstration of good power properties for our distance-covariance-based test, we further show that an infeasible version of our test statistic has the rate optimality in the class of Gaussian distributions with equal correlation.

    more » « less
  5. Abstract

    We present theoretical expectations for infall toward supercluster-scale cosmological filaments, motivated by the Arecibo Pisces–Perseus Supercluster Survey (APPSS) to map the velocity field around the Pisces–Perseus Supercluster (PPS) filament. We use a minimum spanning tree applied to dark matter halos the size of galaxy clusters to identify 236 large filaments within the Millennium simulation. Stacking the filaments along their principal axes, we determine a well-defined, sharp-peaked velocity profile function that can be expressed in terms of the maximum infall rateVmaxand the distanceρmaxbetween the location of maximum infall and the principal axis of the filament. This simple, two-parameter functional form is surprisingly universal across a wide range of linear mass densities.Vmaxis positively correlated with the halo mass per length along the filament, andρmaxis negatively correlated with the degree to which the halos are concentrated along the principal axis. We also assess an alternative, single-parameter method usingV25, the infall rate at a distance of 25 Mpc from the axis of the filament. Filaments similar to the PPS haveVmax=612±116 km s−1,ρmax=8.9±2.1Mpc, andV25= 329 ± 68 km s−1. We create mock observations to model uncertainties associated with viewing angle, lack of three-dimensional velocity information, limited sample size, and distance uncertainties. Our results suggest that it would be especially useful to measure infall for a larger sample of filaments to test our predictions for the shape of the infall profile and the relationships among infall rates and filament properties.

    more » « less