skip to main content

This content will become publicly available on September 1, 2024

Title: Sparsification of large ultrametric matrices: insights into the microbial Tree of Life
Ultrametric matrices appear in many domains of mathematics and science; nevertheless, they can be large and dense, making them difficult to store and manipulate, unlike large but sparse matrices. In this manuscript, we exploit that ultrametric matrices can be represented as binary trees to sparsify them via an orthonormal base change based on Haar-like wavelets. We show that, with overwhelmingly high probability, only an asymptotically negligible fraction of the off-diagonal entries in random but large ultrametric matrices remain non-zero after the base change; and develop an algorithm to sparsify such matrices directly from their tree representation. We also identify the subclass of matrices diagonalized by the Haar-like wavelets and supply a sufficient condition to approximate the spectrum of ultrametric matrices outside this subclass. Our methods give computational access to a covariance matrix model of the microbiologists’ Tree of Life, which was previously inaccessible due to its size, and motivate introducing a new wavelet-based (beta-diversity) metric to compare microbial environments. Unlike the established metrics, the new metric may be used to identify internal nodes (i.e. splits) in the Tree that link microbial composition and environmental factors in a statistically significant manner.  more » « less
Award ID(s):
Author(s) / Creator(s):
Publisher / Repository:
The Royal Society Publishing
Date Published:
Journal Name:
Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences
Subject(s) / Keyword(s):
double principal coordinate analysis Haar-like wavelets sparsification phylogenetic covariance matrix ultrametric UniFrac
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Felsenstein's classical model for Gaussian distributions on a phylogenetic tree is shown to be a toric variety in the space of concentration matrices. We present an exact semialgebraic characterization of this model, and we demonstrate how the toric structure leads to exact methods for maximum likelihood estimation. Our results also give new insights into the geometry of ultrametric matrices. 
    more » « less
  2. Abstract

    We exhibit the necessary range for which functions in the Sobolev spaces can be represented as an unconditional sum of orthonormal spline wavelet systems, such as the Battle–Lemarié wavelets. We also consider the natural extensions to Triebel–Lizorkin spaces. This builds upon, and is a generalization of, previous work of Seeger and Ullrich, where analogous results were established for the Haar wavelet system.

    more » « less
  3. Abstract Microbiome data from sequencing experiments contain the relative abundance of a large number of microbial taxa with their evolutionary relationships represented by a phylogenetic tree. The compositional and high-dimensional nature of the microbiome mediator challenges the validity of standard mediation analyses. We propose a phylogeny-based mediation analysis method called PhyloMed to address this challenge. Unlike existing methods that directly identify individual mediating taxa, PhyloMed discovers mediation signals by analyzing subcompositions defined on the phylogenic tree. PhyloMed produces well-calibrated mediation test p -values and yields substantially higher discovery power than existing methods. 
    more » « less
  4. null (Ed.)
    Random projections or sketching are widely used in many algorithmic and learning contexts. Here we study the performance of iterative Hessian sketch for leastsquares problems. By leveraging and extending recent results from random matrix theory on the limiting spectrum of matrices randomly projected with the subsampled randomized Hadamard transform, and truncated Haar matrices, we can study and compare the resulting algorithms to a level of precision that has not been possible before. Our technical contributions include a novel formula for the second moment of the inverse of projected matrices. We also find simple closed-form expressions for asymptotically optimal step-sizes and convergence rates. These show that the convergence rate for Haar and randomized Hadamard matrices are identical, and asymptotically improve upon Gaussian random projections. These techniques may be applied to other algorithms that employ randomized dimension reduction. 
    more » « less
  5. We consider the problem of clustering with the longest-leg path distance (LLPD) metric, which is informative for elongated and irregularly shaped clusters. We prove finite-sample guarantees on the performance of clustering with respect to this metric when random samples are drawn from multiple intrinsically low-dimensional clusters in high-dimensional space, in the presence of a large number of high-dimensional outliers. By combining these results with spectral clustering with respect to LLPD, we provide conditions under which the Laplacian eigengap statistic correctly determines the number of clusters for a large class of data sets, and prove guarantees on the labeling accuracy of the proposed algorithm. Our methods are quite general and provide performance guarantees for spectral clustering with any ultrametric. We also introduce an efficient, easy to implement approximation algorithm for the LLPD based on a multiscale analysis of adjacency graphs, which allows for the runtime of LLPD spectral clustering to be quasilinear in the number of data points. 
    more » « less