skip to main content


Title: Tree-aggregated predictive modeling of microbiome data
Abstract Modern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call  (ee-ggregation of ompositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest.  more » « less
Award ID(s):
1748166
NSF-PAR ID:
10378110
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Scientific Reports
Volume:
11
Issue:
1
ISSN:
2045-2322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We introduce Operational Genomic Unit (OGU), a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent from taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldomly applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in one synthetic and two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome datasets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project dataset, and more accurate prediction of human age by the gut microbiomes in the Finnish population. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate OGU adoption in future metagenomics studies. Importance Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. However, current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution compared to 16S rRNA amplicon sequence variant analysis. To solve these challenges, we introduce Operational Genomic Units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition while (ii) permitting use of phylogeny-aware tools. Our analysis of real-world datasets shows several advantages over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGU as standard practice in metagenomic studies. 
    more » « less
  2. We sampled the respiratory mucus from voluntary blowhole exhalations (“blow”) of three healthy beluga whales (Delphinapterus leucas) under professional human care. Blow samples were collected from three resident belugas, one adult male (M1) and two adult females (F1, F2), with voluntary behaviors via non-invasive methods over three days in July 2021 (four days for M1). Samples were weighed and examined microscopically for the enumeration of eukaryotic and prokaryotic microbes, and then were used to evaluate carbon substrate use and taxonomic diversity of prokaryotic communities in the host respiratory sytem. Microscopical observations and 18S rRNA gene sequencing indicated the presence of eukaryotic microbiota, the ciliate generaPlanilaminaandKyaroikeusin all three individuals. Exposure of samples to different metabolic carbon substrates indicated significant differences in the number of carbon sources usable by the prokaryotic communities of different whales (range: 11-25 sources), as well as a signficantly decreased diversity of carbon sources used by the community in the habitat water (5 sources). Sequencing of the hypervariable V4 region of the 16S rRNA gene revealed 19 amplicon sequence variants (ASVs) that were present in all whale samples. The oldest femaleD. leucas(F2) had the lowest overall diversity, and was significantly different from M1 and F1 in taxon composition, including an anomalously low ratio of Baccillota: Bacteroidota (0.01) compared to the other whales. In comparisons of microbial community composition, M1 had a significantly higher diversity than F1 and F2. These results suggest that attention should be given to regular microbiome sampling, and indicate a need for the pairing of microbiome and clinical data for animals in aquaria. Overall, these data contribute to the growing database on the core respiratory microbiota in cohabiting cetaceans under professional human care, indicate the utility of non-invasive sampling, and help characterize a baseline for healthyD. leucas.

     
    more » « less
  3. 16S rRNA gene profiling (amplicon sequencing) is a popular technique for understanding host-associated and environmental microbial communities. Most protocols for sequencing amplicon libraries follow a standardized pipeline that can differ slightly depending on laboratory facility and user. Given that the same variable region of the 16S gene is targeted, it is generally accepted that sequencing output from differing protocols are comparable and this assumption underlies our ability to identify universal patterns in microbial dynamics through meta-analyses. However, discrepant results from a combined 16S rRNA gene dataset prepared by two labs whose protocols differed only in DNA polymerase and sequencing platform led us to scrutinize the outputs and challenge the idea of confidently combining them for standard microbiome analysis. Using technical replicates of reef-building coral samples from two species, Montipora aequituberculata and Porites lobata , we evaluated the consistency of alpha and beta diversity metrics between data resulting from these highly similar protocols. While we found minimal variation in alpha diversity between platform, significant differences were revealed with most beta diversity metrics, dependent on host species. These inconsistencies persisted following removal of low abundance taxa and when comparing across higher taxonomic levels, suggesting that bacterial community differences associated with sequencing protocol are likely to be context dependent and difficult to correct without extensive validation work. The results of this study encourage caution in the statistical comparison and interpretation of studies that combine rRNA gene sequence data from distinct protocols and point to a need for further work identifying mechanistic causes of these observed differences. 
    more » « less
  4. Johnson, Karyn N. (Ed.)
    ABSTRACT Leeches are found in terrestrial, aquatic, and marine habitats on all continents. Sanguivorous leeches have been used in medicine for millennia. Modern scientific uses include studies of neurons, anticoagulants, and gut microbial symbioses. Hirudo verbana , the European medicinal leech, maintains a gut community dominated by two bacterial symbionts, Aeromonas veronii and Mucinivorans hirudinis , which sometimes account for as much as 97% of the total crop microbiota. The highly simplified gut anatomy and microbiome of H. verbana make it an excellent model organism for studying gut microbial dynamics. The North American medicinal leech, Macrobdella decora , is a hirudinid leech native to Canada and the northern United States. In this study, we show that M. decora symbiont communities are very similar to those in H. verbana. We performed an extensive study using field-caught M. decora and purchased H. verbana from two suppliers. Deep sequencing of the V4 region of the 16S rRNA gene allowed us to determine that the core microbiome of M. decora consists of Bacteroides , Aeromonas, Proteocatella , and Butyricicoccus. The analysis revealed that the compositions of the gut microbiomes of the two leech species were significantly different at all taxonomic levels. The R 2 value was highest at the genus and amplicon sequence variant (ASV) levels and much lower at the phylum, class, and order levels. The gut and bladder microbial communities were distinct. We propose that M. decora is an alternative to H. verbana for studies of wild-caught animals and provide evidence for the conservation of digestive-tract and bladder symbionts in annelid models. IMPORTANCE Building evidence implicates the gut microbiome in critical animal functions such as regulating digestion, nutrition, immune regulation, and development. Simplified, phylogenetically diverse models for hypothesis testing are necessary because of the difficulty of assigning causative relationships in complex gut microbiomes. Previous research used Hirudo verbana as a tractable animal model of digestive-tract symbioses. Our data show that Macrobdella decora may work just as well without the drawback of being an endangered organism and with the added advantage of easy access to field-caught specimens. The similarity of the microbial community structures of species from two different continents reveals the highly conserved nature of the microbial symbionts in sanguivorous leeches. 
    more » « less
  5. Despite advances in sequencing, lack of standardization makes comparisons across studies challenging and hampers insights into the structure and function of microbial communities across multiple habitats on a planetary scale. Here we present a multi-omics analysis of a diverse set of 880 microbial community samples collected for the Earth Microbiome Project. We include amplicon (16S, 18S, ITS) and shotgun metagenomic sequence data, and untargeted metabolomics data (liquid chromatography-tandem mass spectrometry and gas chromatography mass spectrometry). We used standardized protocols and analytical methods to characterize microbial communities, focusing on relationships and co-occurrences of microbially related metabolites and microbial taxa across environments, thus allowing us to explore diversity at extraordinary scale. In addition to a reference database for metagenomic and metabolomic data, we provide a framework for incorporating additional studies, enabling the expansion of existing knowledge in the form of an evolving community resource. We demonstrate the utility of this database by testing the hypothesis that every microbe and metabolite is everywhere but the environment selects. Our results show that metabolite diversity exhibits turnover and nestedness related to both microbial communities and the environment, whereas the relative abundances of microbially related metabolites vary and co-occur with specific microbial consortia in a habitat-specific manner. We additionally show the power of certain chemistry, in particular terpenoids, in distinguishing Earth’s environments (for example, terrestrial plant surfaces and soils, freshwater and marine animal stool), as well as that of certain microbes including Conexibacter woesei (terrestrial soils), Haloquadratum walsbyi (marine deposits) and Pantoea dispersa (terrestrial plant detritus). This Resource provides insight into the taxa and metabolites within microbial communities from diverse habitats across Earth, informing both microbial and chemical ecology, and provides a foundation and methods for multi-omics microbiome studies of hosts and the environment. 
    more » « less