skip to main content

This content will become publicly available on June 28, 2023

Title: Optimizing UniFrac with OpenACC Yields Greater Than One Thousand Times Speed Increase
ABSTRACT UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another (beta diversity). Striped UniFrac recently added the ability to split the problem into many independent subproblems, exhibiting nearly linear scaling but suffering from memory contention. Here, we adapt UniFrac to graphics processing units using OpenACC, enabling greater than 1,000× computational improvement, and apply it to 307,237 samples, the largest 16S rRNA V4 uniformly preprocessed microbiome data set analyzed to date. IMPORTANCE UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another. Here, we adapt UniFrac to operate on graphics processing units, enabling a 1,000× computational improvement. To highlight this advance, we perform what may be the largest microbiome analysis to date, applying UniFrac to 307,237 16S rRNA V4 microbiome samples preprocessed with Deblur. These scaling improvements turn UniFrac into a real-time tool for common data sets and unlock new research questions as more microbiome data are collected.
Authors:
; ; ; ;
Editors:
Greene, Casey S.
Award ID(s):
2038509
Publication Date:
NSF-PAR ID:
10336375
Journal Name:
mSystems
Volume:
7
Issue:
3
ISSN:
2379-5077
Sponsoring Org:
National Science Foundation
More Like this
  1. Moreno-Hagelsieb, Gabriel (Ed.)
    Advances in the analysis of amplicon sequence datasets have introduced a methodological shift in how research teams investigate microbial biodiversity, away from sequence identity-based clustering (producing Operational Taxonomic Units, OTUs) to denoising methods (producing amplicon sequence variants, ASVs). While denoising methods have several inherent properties that make them desirable compared to clustering-based methods, questions remain as to the influence that these pipelines have on the ecological patterns being assessed, especially when compared to other methodological choices made when processing data (e.g. rarefaction) and computing diversity indices. We compared the respective influences of two widely used methods, namely DADA2 (a denoising method) vs. Mothur (a clustering method) on 16S rRNA gene amplicon datasets (hypervariable region v4), and compared such effects to the rarefaction of the community table and OTU identity threshold (97% vs. 99%) on the ecological signals detected. We used a dataset comprising freshwater invertebrate (three Unionidae species) gut and environmental (sediment, seston) communities sampled in six rivers in the southeastern USA. We ranked the respective effects of each methodological choice on alpha and beta diversity, and taxonomic composition. The choice of the pipeline significantly influenced alpha and beta diversities and changed the ecological signal detected, especially on presence/absence indicesmore »such as the richness index and unweighted Unifrac. Interestingly, the discrepancy between OTU and ASV-based diversity metrics could be attenuated by the use of rarefaction. The identification of major classes and genera also revealed significant discrepancies across pipelines. Compared to the pipeline’s effect, OTU threshold and rarefaction had a minimal impact on all measurements.« less
  2. We introduce Operational Genomic Unit (OGU), a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent from taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldomly applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in one synthetic and two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome datasets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project dataset, and more accurate prediction of human age by the gut microbiomes in the Finnish population. We provide Woltka, amore »bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate OGU adoption in future metagenomics studies. Importance Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. However, current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution compared to 16S rRNA amplicon sequence variant analysis. To solve these challenges, we introduce Operational Genomic Units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition while (ii) permitting use of phylogeny-aware tools. Our analysis of real-world datasets shows several advantages over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGU as standard practice in metagenomic studies.« less
  3. Johnson, Karyn N. (Ed.)
    ABSTRACT Leeches are found in terrestrial, aquatic, and marine habitats on all continents. Sanguivorous leeches have been used in medicine for millennia. Modern scientific uses include studies of neurons, anticoagulants, and gut microbial symbioses. Hirudo verbana , the European medicinal leech, maintains a gut community dominated by two bacterial symbionts, Aeromonas veronii and Mucinivorans hirudinis , which sometimes account for as much as 97% of the total crop microbiota. The highly simplified gut anatomy and microbiome of H. verbana make it an excellent model organism for studying gut microbial dynamics. The North American medicinal leech, Macrobdella decora , is a hirudinid leech native to Canada and the northern United States. In this study, we show that M. decora symbiont communities are very similar to those in H. verbana. We performed an extensive study using field-caught M. decora and purchased H. verbana from two suppliers. Deep sequencing of the V4 region of the 16S rRNA gene allowed us to determine that the core microbiome of M. decora consists of Bacteroides , Aeromonas, Proteocatella , and Butyricicoccus. The analysis revealed that the compositions of the gut microbiomes of the two leech species were significantly different at all taxonomic levels. The R 2more »value was highest at the genus and amplicon sequence variant (ASV) levels and much lower at the phylum, class, and order levels. The gut and bladder microbial communities were distinct. We propose that M. decora is an alternative to H. verbana for studies of wild-caught animals and provide evidence for the conservation of digestive-tract and bladder symbionts in annelid models. IMPORTANCE Building evidence implicates the gut microbiome in critical animal functions such as regulating digestion, nutrition, immune regulation, and development. Simplified, phylogenetically diverse models for hypothesis testing are necessary because of the difficulty of assigning causative relationships in complex gut microbiomes. Previous research used Hirudo verbana as a tractable animal model of digestive-tract symbioses. Our data show that Macrobdella decora may work just as well without the drawback of being an endangered organism and with the added advantage of easy access to field-caught specimens. The similarity of the microbial community structures of species from two different continents reveals the highly conserved nature of the microbial symbionts in sanguivorous leeches.« less
  4. Large-scale microbiome studies investigating disease-inducing microbial roles base their findings on differences between microbial count data in contrasting environments (e.g., stool samples between cases and controls). These microbiome survey studies are often impeded by small sample sizes and database bias. Combining data from multiple survey studies often results in obvious batch effects, even when DNA preparation and sequencing methods are identical. Relatedly, predictive models trained on one microbial DNA dataset often do not generalize to outside datasets. In this study, we address these limitations by applying word embedding algorithms (GloVe) and PCA transformation to ASV data from the American Gut Project and generating translation matrices that can be applied to any 16S rRNA V4 region gut microbiome sequencing study. Because these approaches contextualize microbial occurrences in a larger dataset while reducing dimensionality of the feature space, they can improve generalization of predictive models that predict host phenotype from stool associated gut microbiota. The GMEmbeddings R package contains GloVe and PCA embedding transformation matrices at 50, 100 and 250 dimensions, each learned using ∼15,000 samples from the American Gut Project. It currently supports the alignment, matching, and matrix multiplication to allow users to transform their V4 16S rRNA data into thesemore »embedding spaces. We show how to correlate the properties in the new embedding space to KEGG functional pathways for biological interpretation of results. Lastly, we provide benchmarking on six gut microbiome datasets describing three phenotypes to demonstrate the ability of embedding-based microbiome classifiers to generalize to independent datasets. Future iterations of GMEmbeddings will include embedding transformation matrices for other biological systems. Available at: https://github.com/MaudeDavidLab/GMEmbeddings .« less
  5. Hird, Sarah M. (Ed.)
    The gut microbiome provides vital functions for mammalian hosts, yet research on its variability and function across adult life spans and multiple generations is limited in large mammalian carnivores. Here, we used 16S rRNA gene and metagenomic high-throughput sequencing to profile the bacterial taxonomic composition, genomic diversity, and metabolic function of fecal samples collected from 12 wild spotted hyenas ( Crocuta crocuta ) residing in the Masai Mara National Reserve, Kenya, over a 23-year period spanning three generations. The metagenomic data came from four of these hyenas and spanned two 2-year periods. With these data, we determined the extent to which host factors predicted variation in the gut microbiome and identified the core microbes present in the guts of hyenas. We also investigated novel genomic diversity in the mammalian gut by reporting the first metagenome-assembled genomes (MAGs) for hyenas. We found that gut microbiome taxonomic composition varied temporally, but despite this, a core set of 14 bacterial genera were identified. The strongest predictors of the microbiome were host identity and age, suggesting that hyenas possess individualized microbiomes and that these may change with age during adulthood. The gut microbiome functional profiles of the four adult hyenas were also individual specificmore »and were associated with prey abundance, indicating that the functions of the gut microbiome vary with host diet. We recovered 149 high-quality MAGs from the hyenas’ guts; some MAGs were classified as taxa previously reported for other carnivores, but many were novel and lacked species-level matches to genomes in existing reference databases. IMPORTANCE There is a gap in knowledge regarding the genomic diversity and variation of the gut microbiome across a host’s life span and across multiple generations of hosts in wild mammals. Using two types of sequencing approaches, we found that although gut microbiomes were individualized and temporally variable among hyenas, they correlated similarly to large-scale changes in the ecological conditions experienced by their hosts. We also recovered 149 high-quality MAGs from the hyena gut, greatly expanding the microbial genome repertoire known for hyenas, carnivores, and wild mammals in general. Some MAGs came from genera abundant in the gastrointestinal tracts of canid species and other carnivores, but over 80% of MAGs were novel and from species not previously represented in genome databases. Collectively, our novel body of work illustrates the importance of surveying the gut microbiome of nonmodel wild hosts, using multiple sequencing methods and computational approaches and at distinct scales of analysis.« less