It is now common to survey microbial communities by sequencing nucleic acid material extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, which gives a somewhat ad hoc phylogeneticsbased distance between two communities, is one of the most commonly used tools for these analyses. We provide a foundation for such methods by establishing that, if we equate a metagenomic sample with its empirical distribution on a reference phylogenetic tree, then the weighted UniFrac distance between two samples is just the classical Kantorovich–Rubinstein, or earth mover’s, distance between the corresponding empirical distributions. We demonstrate that this Kantorovich–Rubinstein distance and extensions incorporating uncertainty in the sample locations can be written as a readily computable integral over the tree, we develop Lp Zolotarevtype generalizations of the metric, and we show how the pvalue of the resulting natural permutation test of the null hypothesis ‘no difference between two communities’ can be approximated by using a Gaussian process functional. We relate the L2case to an analysisofvariance type of decomposition, finding that the distribution of its associated Gaussian functional is that of a computable linear combination of independent X12 random variables.
more » « less NSFPAR ID:
 10401253
 Publisher / Repository:
 Oxford University Press
 Date Published:
 Journal Name:
 Journal of the Royal Statistical Society Series B: Statistical Methodology
 Volume:
 74
 Issue:
 3
 ISSN:
 13697412
 Page Range / eLocation ID:
 p. 569592
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

null (Ed.)We consider the problem of estimating the Wasserstein distance between the empirical measure and a set of probability measures whose expectations over a class of functions (hypothesis class) are constrained. If this class is sufficiently rich to characterize a particular distribution (e.g., all Lipschitz functions), then our formulation recovers the Wasserstein distance to such a distribution. We establish a strong duality result that generalizes the celebrated KantorovichRubinstein duality. We also show that our formulation can be used to beat the curse of dimensionality, which is well known to affect the rates of statistical convergence of the empirical Wasserstein distance. In particular, examples of infinitedimensional hypothesis classes are presented, informed by a complex correlation structure, for which it is shown that the empirical Wasserstein distance to such classes converges to zero at the standard parametric rate. Our formulation provides insights that help clarify why, despite the curse of dimensionality, the Wasserstein distance enjoys favorable empirical performance across a wide range of statistical applications.more » « less

Fog computing has been advocated as an enabling technology for computationally intensive services in connected smart vehicles. Most existing works focus on analyzing and opti mizing the queueing and workload processing latencies, ignoring the fact that the access latency between vehicles and fog/cloud servers can sometimes dominate the endtoend service latency. This motivates the work in this paper, where we report a five month urban measurement study of the wireless access latency between a connected vehicle and a fog computing system sup ported by commercially available multioperator LTE networks. We propose AdaptiveFog, a novel framework for autonomous and dynamic switching between different LTE operators that implement fog/cloud infrastructure. The main objective here is to maximize the service confidence level, defined as the probability that the tolerable latency threshold for each supported type of service can be guaranteed. AdaptiveFog has been implemented on a smart phone app, running on a moving vehicle. The app periodically measures the roundtrip time between the vehicle and fog/cloud servers. An empirical spatial statistic model is established to characterize the spatial variation of the latency across the main driving routes of the city. To quantify the perfor mance difference between different LTE networks, we introduce the weighted KantorovichRubinstein (KR) distance. An optimal policy is derived for the vehicle to dynamically switch between LTE operators’ networks while driving. Extensive analysis and simulation are performed based on our latency measurement dataset. Our results show that AdaptiveFog achieves around 30% and 50% improvement in the confidence level of fog and cloud latency, respectively.more » « less

null (Ed.)Sensitivity properties describe how changes to the input of a program affect the output, typically by upper bounding the distance between the outputs of two runs by a monotone function of the distance between the corresponding inputs. When programs are probabilistic, the distance between outputs is a distance between distributions. The Kantorovich lifting provides a general way of defining a distance between distributions by lifting the distance of the underlying sample space; by choosing an appropriate distance on the base space, one can recover other usual probabilistic distances, such as the Total Variation distance. We develop a relational preexpectation calculus to upper bound the Kantorovich distance between two executions of a probabilistic program. We illustrate our methods by proving algorithmic stability of a machine learning algorithm, convergence of a reinforcement learning algorithm, and fast mixing for card shuffling algorithms. We also consider some extensions: using our calculus to show convergence of Markov chains to the uniform distribution over states and an asynchronous extension to reason about pairs of program executions with different control flow.more » « less

Summary We introduce an L2type test for testing mutual independence and banded dependence structure for high dimensional data. The test is constructed on the basis of the pairwise distance covariance and it accounts for the nonlinear and nonmonotone dependences among the data, which cannot be fully captured by the existing tests based on either Pearson correlation or rank correlation. Our test can be conveniently implemented in practice as the limiting null distribution of the test statistic is shown to be standard normal. It exhibits excellent finite sample performance in our simulation studies even when the sample size is small albeit the dimension is high and is shown to identify nonlinear dependence in empirical data analysis successfully. On the theory side, asymptotic normality of our test statistic is shown under quite mild moment assumptions and with little restriction on the growth rate of the dimension as a function of sample size. As a demonstration of good power properties for our distancecovariancebased test, we further show that an infeasible version of our test statistic has the rate optimality in the class of Gaussian distributions with equal correlation.

Abstract We present theoretical expectations for infall toward superclusterscale cosmological filaments, motivated by the Arecibo Pisces–Perseus Supercluster Survey (APPSS) to map the velocity field around the Pisces–Perseus Supercluster (PPS) filament. We use a minimum spanning tree applied to dark matter halos the size of galaxy clusters to identify 236 large filaments within the Millennium simulation. Stacking the filaments along their principal axes, we determine a welldefined, sharppeaked velocity profile function that can be expressed in terms of the maximum infall rate
V _{max}and the distanceρ _{max}between the location of maximum infall and the principal axis of the filament. This simple, twoparameter functional form is surprisingly universal across a wide range of linear mass densities.V _{max}is positively correlated with the halo mass per length along the filament, andρ _{max}is negatively correlated with the degree to which the halos are concentrated along the principal axis. We also assess an alternative, singleparameter method usingV _{25}, the infall rate at a distance of 25 Mpc from the axis of the filament. Filaments similar to the PPS have 116 km s^{−1}, ${V}_{\mathrm{max}}=612\phantom{\rule{0.33em}{0ex}}\pm $ Mpc, and ${\rho}_{\mathrm{max}}=8.9\pm 2.1$V _{25}= 329 ± 68 km s^{−1}. We create mock observations to model uncertainties associated with viewing angle, lack of threedimensional velocity information, limited sample size, and distance uncertainties. Our results suggest that it would be especially useful to measure infall for a larger sample of filaments to test our predictions for the shape of the infall profile and the relationships among infall rates and filament properties.