skip to main content


Title: Hierarchical Optimal Transport for Multimodal Distribution Alignment
In many machine learning applications, it is necessary to meaningfully aggregate, through alignment, different but related datasets. Optimal transport (OT)-based approaches pose alignment as a divergence minimization problem: the aim is to transform a source dataset to match a target dataset using the Wasserstein distance as a divergence measure. We introduce a hierarchical formulation of OT which leverages clustered structure in data to improve alignment in noisy, ambiguous, or multimodal settings. To solve this numerically, we propose a distributed ADMM algorithm that also exploits the Sinkhorn distance, thus it has an efficient computational complexity that scales quadratically with the size of the largest cluster. When the transformation between two datasets is unitary, we provide performance guarantees that describe when and how well aligned cluster correspondences can be recovered with our formulation, as well as provide worst-case dataset geometry for such a strategy. We apply this method to synthetic datasets that model data as mixtures of low-rank Gaussians and study the impact that different geometric properties of the data have on alignment. Next, we applied our approach to a neural decoding application where the goal is to predict movement directions and instantaneous velocities from populations of neurons in the macaque primary motor cortex. Our results demonstrate that when clustered structure exists in datasets, and is consistent across trials or time points, a hierarchical alignment strategy that leverages such structure can provide significant improvements in cross-domain alignment.  more » « less
Award ID(s):
1755871
NSF-PAR ID:
10208018
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The Gromov-Wasserstein (GW) formalism can be seen as a generalization of the optimal transport (OT) formalism for comparing two distributions associated with different metric spaces. It is a quadratic optimization problem and solving it usually has computational costs that can rise sharply if the problem size exceeds a few hundred points. Recently fast techniques based on entropy regularization have being developed to solve an approximation of the GW problem quickly. There are issues, however, with the numerical convergence of those regularized approximations to the true GW solution. To circumvent those issues, we introduce a novel strategy to solve the discrete GW problem using methods taken from statistical physics. We build a temperature-dependent free energy function that reflects the GW problem’s constraints. To account for possible differences of scales between the two metric spaces, we introduce a scaling factor s in the definition of the energy. From the extremum of the free energy, we derive a mapping between the two probability measures that are being compared, as well as a distance between those measures. This distance is equal to the GW distance when the temperature goes to zero. The optimal scaling factor itself is obtained by minimizing the free energy with respect to s. We illustrate our approach on the problem of comparing shapes defined by unstructured triangulations of their surfaces. We use several synthetic and “real life” datasets. We demonstrate the accuracy and automaticity of our approach in non-rigid registration of shapes. We provide numerical evidence that there is a strong correlation between the GW distances computed from low-resolution, surface-based representations of proteins and the analogous distances computed from atomistic models of the same proteins. 
    more » « less
  2. Network alignment is a critical steppingstone behind a variety of multi-network mining tasks. Most of the existing methods essentially optimize a Frobenius-like distance or ranking-based loss, ignoring the underlying geometry of graph data. Optimal transport (OT), together with Wasserstein distance, has emerged to be a powerful approach accounting for the underlying geometry explicitly. Promising as it might be, the state-of-the-art OT-based alignment methods suffer from two fundamental limitations, including (1) effectiveness due to the insufficient use of topology and consistency information and (2) scalability due to the non-convex formulation and repeated computationally costly loss calculation. In this paper, we propose a position-aware regularized optimal transport framework for network alignment named PARROT. To tackle the effectiveness issue, the proposed PARROT captures topology information by random walk with restart, with three carefully designed consistency regularization terms. To tackle the scalability issue, the regularized OT problem is decomposed into a series of convex subproblems and can be efficiently solved by the proposed constrained proximal point method with guaranteed convergence. Extensive experiments show that our algorithm achieves significant improvements in both effectiveness and scalability, outperforming the state-of-the-art network alignment methods and speeding up existing OT-based methods by up to 100 times. 
    more » « less
  3. Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single “hotspot” of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10–20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data. 
    more » « less
  4. Abstract

    Phylogenomics seeks to use next‐generation data to robustly infer an organism's evolutionary history. Yet, the practical caveats of phylogenomics motivate investigation of improved efficiency, particularly when quality of phylogenies are questionable. To achieve improvements, one goal is to maintain or enhance the quality of phylogenetic inference while severely reducing dataset size. We approach this by assessing which kinds of loci in phylogenomic alignments provide the majority of support for a phylogenetic inference of cockroaches in Blaberoidea. We examine locus substitution rate, saturation, evolutionary divergence, rate heterogeneity, stabilizing selection, anda prioriinformation content as traits that may determine optimality. Our controlled experimental design is based on 265 loci for 102 blaberoidean taxa and 22 outgroup species. Loci with high substitution rate, low saturation, low sequence distance, low rate heterogeneity, and strong stabilizing selection derive more support for phylogenetic relationships. We found that some phylogenetic information content estimators may not be meaningful for assessing information contenta priori. We use these findings to design concatenated datasets with an optimized subsample of 100 loci. The tree inferred from the optimized subsample alignment was largely identical to that inferred from all 265 loci but with less evidence of long branch attraction, improved statistical support, and potential 4‐6x improvements to computation time. Supported by phylogenetic and morphological evidence, we erect three newly named clades (Anallactinae Evangelista & Wipflersubfam. nov., Orkrasomeriatax. nov.Evangelista, Wipfler, & Béthoux and Hemithyrsocerini Evangelistatribe nov.) and propose other taxonomic modifications. The diagnosis of Pseudophyllodromiidae Grandcolas, 1996 is modified to accommodate Anallactinae and Pseudophyllodromiinae Vickery & Kevan, 1983. The diagnosis of Ectobiidae Brunner von Wattenwyl, 1865 is modified to add novel morphological characters.

     
    more » « less
  5. Abstract

    The fossilized birth–death (FBD) process provides an ideal model for inferring phylogenies from both extant and fossil taxa. Using this approach, fossils are directly integrated into the tree, leading to a statistically coherent prior on divergence times. Since fossils are typically not associated with molecular sequences, additional information is required to place fossils in the tree. We use simulations to evaluate two different approaches to handling fossil placement in FBD analyses: using topological constraints, where the user specifies monophyletic clades based on established taxonomy, or using total‐evidence analyses, which use a morphological data matrix in addition to the molecular alignment. We also explore how rate variation in fossil recovery or diversification rates impacts these approaches. We find that the extant topology is well recovered under all methods of fossil placement. Divergence times are similarly well recovered across all methods, with the exception of constraints which contain errors. We see similar patterns in datasets which include rate variation, however, relative errors in extant divergence times increase when more variation is included in the dataset, for all approaches using topological constraints, and particularly for constraints with errors. Finally, we show that trees recovered under the FBD model are more accurate than those estimated using non‐time calibrated inference. Overall, we show that both fossil placement approaches are reliable even when including uncertainty. Our results underscore the importance of core taxonomic research, including morphological data collection and species descriptions, irrespective of the approach to handling phylogenetic uncertainty using the FBD process.

     
    more » « less