skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Projection‐based techniques for high‐dimensional optimal transport problems
Abstract Optimal transport (OT) methods seek a transformation map (or plan) between two probability measures, such that the transformation has the minimum transportation cost. Such a minimum transport cost, with a certain power transform, is called the Wasserstein distance. Recently, OT methods have drawn great attention in statistics, machine learning, and computer science, especially in deep generative neural networks. Despite its broad applications, the estimation of high‐dimensional Wasserstein distances is a well‐known challenging problem owing to the curse‐of‐dimensionality. There are some cutting‐edge projection‐based techniques that tackle high‐dimensional OT problems. Three major approaches of such techniques are introduced, respectively, the slicing approach, the iterative projection approach, and the projection robust OT approach. Open challenges are discussed at the end of the review. This article is categorized under:Statistical and Graphical Methods of Data Analysis > Dimension ReductionStatistical Learning and Exploratory Methods of the Data Sciences > Manifold Learning  more » « less
Award ID(s):
1903226 1925066 2124493
PAR ID:
10443887
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
WIREs Computational Statistics
Volume:
15
Issue:
2
ISSN:
1939-5108
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Quantifying the structure and dynamics of species interactions in ecological communities is fundamental to studying ecology and evolution. While there are numerous approaches to analysing ecological networks, there is not yet an approach that can (1) quantify dissimilarity in the global structure of ecological networks that range from identical species and interaction composition to zero shared species or interactions and (2) map species between such networks while incorporating additional ecological information, such as species traits or abundances.To address these challenges, we introduce the use of optimal transport distances to quantify ecological network dissimilarity and functionally equivalent species between networks. Specifically, we describe the Gromov–Wasserstein (GW) and Fused Gromov–Wasserstein (FGW) distances. We apply these optimal transport methods to synthetic and empirical data, using mammal food webs throughout sub‐Saharan Africa for illustration. We showcase the application of GW and FGW distances to identify the most functionally similar species between food webs, incorporate additional trait information into network comparisons and quantify food web dissimilarity among geographic regions.Our results demonstrate that GW and FGW distances can effectively differentiate ecological networks based on their topological structure while identifying functionally equivalent species, even when networks have different species. The FGW distance further improves node mapping for basal species by incorporating node‐level traits. We show that these methods allow for a more nuanced understanding of the topological similarities in food web networks among geographic regions compared to an alternative measure of network dissimilarity based on species identities.Optimal transport distances offer a new approach for quantifying functional equivalence between networks and a measure of network dissimilarity suitable for a broader range of uses than existing approaches. OT methods can be harnessed to analyse ecological networks at large spatial scales and compare networks among ecosystems, realms or taxa. Optimal transport‐based distances, therefore, provide a powerful tool for analysing ecological networks with great potential to advance our understanding of ecological community structure and dynamics in a changing world. 
    more » « less
  2. Abstract Proliferation of high‐resolution imaging data in recent years has led to substantial improvements in the two popular approaches for analyzing shapes of data objects based on landmarks and/or continuous curves. We provide an expository account of elastic shape analysis of parametric planar curves representing shapes of two‐dimensional (2D) objects by discussing its differences, and its commonalities, to the landmark‐based approach. Particular attention is accorded to the role of reparameterization of a curve, which in addition to rotation, scaling and translation, represents an important shape‐preserving transformation of a curve. The transition to the curve‐based approach moves the mathematical setting of shape analysis from finite‐dimensional non‐Euclidean spaces to infinite‐dimensional ones. We discuss some of the challenges associated with the infinite‐dimensionality of the shape space, and illustrate the use of geometry‐based methods in the computation of intrinsic statistical summaries and in the definition of statistical models on a 2D imaging dataset consisting of mouse vertebrae. We conclude with an overview of the current state‐of‐the‐art in the field. This article is categorized under: Image and Spatial Data < Data: Types and StructureComputational Mathematics < Applications of Computational Statistics 
    more » « less
  3. Abstract Since the very first detection of gravitational waves from the coalescence of two black holes in 2015, Bayesian statistical methods have been routinely applied by LIGO and Virgo to extract the signal out of noisy interferometric measurements, obtain point estimates of the physical parameters responsible for producing the signal, and rigorously quantify their uncertainties. Different computational techniques have been devised depending on the source of the gravitational radiation and the gravitational waveform model used. Prominent sources of gravitational waves are binary black hole or neutron star mergers, the only objects that have been observed by detectors to date. But also gravitational waves from core‐collapse supernovae, rapidly rotating neutron stars, and the stochastic gravitational‐wave background are in the sensitivity band of the ground‐based interferometers and expected to be observable in future observation runs. As nonlinearities of the complex waveforms and the high‐dimensional parameter spaces preclude analytic evaluation of the posterior distribution, posterior inference for all these sources relies on computer‐intensive simulation techniques such as Markov chain Monte Carlo methods. A review of state‐of‐the‐art Bayesian statistical parameter estimation methods will be given for researchers in this cross‐disciplinary area of gravitational wave data analysis. This article is categorized under:Applications of Computational Statistics > Signal and Image Processing and CodingStatistical and Graphical Methods of Data Analysis > Markov Chain Monte Carlo (MCMC)Statistical Models > Time Series Models 
    more » « less
  4. Abstract The rapid development of modeling techniques has brought many opportunities for data‐driven discovery and prediction. However, this also leads to the challenge of selecting the most appropriate model for any particular data task. Information criteria, such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC), have been developed as a general class of model selection methods with profound connections with foundational thoughts in statistics and information theory. Many perspectives and theoretical justifications have been developed to understand when and how to use information criteria, which often depend on particular data circumstances. This review article will revisit information criteria by summarizing their key concepts, evaluation metrics, fundamental properties, interconnections, recent advancements, and common misconceptions to enrich the understanding of model selection in general. This article is categorized under:Data: Types and Structure > Traditional Statistical DataStatistical Learning and Exploratory Methods of the Data Sciences > Modeling MethodsStatistical and Graphical Methods of Data Analysis > Information Theoretic MethodsStatistical Models > Model Selection 
    more » « less
  5. Abstract Structured population models are among the most widely used tools in ecology and evolution. Integral projection models (IPMs) use continuous representations of how survival, reproduction and growth change as functions of state variables such as size, requiring fewer parameters to be estimated than projection matrix models (PPMs). Yet, almost all published IPMs make an important assumption that size‐dependent growth transitions are or can be transformed to be normally distributed. In fact, many organisms exhibit highly skewed size transitions. Small individuals can grow more than they can shrink, and large individuals may often shrink more dramatically than they can grow. Yet, the implications of such skew for inference from IPMs has not been explored, nor have general methods been developed to incorporate skewed size transitions into IPMs, or deal with other aspects of real growth rates, including bounds on possible growth or shrinkage.Here, we develop a flexible approach to modelling skewed growth data using a modified beta regression model. We propose that sizes first be converted to a (0,1) interval by estimating size‐dependent minimum and maximum sizes through quantile regression. Transformed data can then be modelled using beta regression with widely available statistical tools. We demonstrate the utility of this approach using demographic data for a long‐lived plant, gorgonians and an epiphytic lichen. Specifically, we compare inferences of population parameters from discrete PPMs to those from IPMs that either assume normality or incorporate skew using beta regression or, alternatively, a skewed normal model.The beta and skewed normal distributions accurately capture the mean, variance and skew of real growth distributions. Incorporating skewed growth into IPMs decreases population growth and estimated life span relative to IPMs that assume normally distributed growth, and more closely approximate the parameters of PPMs that do not assume a particular growth distribution. A bounded distribution, such as the beta, also avoids the eviction problem caused by predicting some growth outside the modelled size range.Incorporating biologically relevant skew in growth data has important consequences for inference from IPMs. The approaches we outline here are flexible and easy to implement with existing statistical tools. 
    more » « less