skip to main content

This content will become publicly available on April 26, 2023

Title: GMEmbeddings: An R Package to Apply Embedding Techniques to Microbiome Data
Large-scale microbiome studies investigating disease-inducing microbial roles base their findings on differences between microbial count data in contrasting environments (e.g., stool samples between cases and controls). These microbiome survey studies are often impeded by small sample sizes and database bias. Combining data from multiple survey studies often results in obvious batch effects, even when DNA preparation and sequencing methods are identical. Relatedly, predictive models trained on one microbial DNA dataset often do not generalize to outside datasets. In this study, we address these limitations by applying word embedding algorithms (GloVe) and PCA transformation to ASV data from the American Gut Project and generating translation matrices that can be applied to any 16S rRNA V4 region gut microbiome sequencing study. Because these approaches contextualize microbial occurrences in a larger dataset while reducing dimensionality of the feature space, they can improve generalization of predictive models that predict host phenotype from stool associated gut microbiota. The GMEmbeddings R package contains GloVe and PCA embedding transformation matrices at 50, 100 and 250 dimensions, each learned using ∼15,000 samples from the American Gut Project. It currently supports the alignment, matching, and matrix multiplication to allow users to transform their V4 16S rRNA data into these more » embedding spaces. We show how to correlate the properties in the new embedding space to KEGG functional pathways for biological interpretation of results. Lastly, we provide benchmarking on six gut microbiome datasets describing three phenotypes to demonstrate the ability of embedding-based microbiome classifiers to generalize to independent datasets. Future iterations of GMEmbeddings will include embedding transformation matrices for other biological systems. Available at: https://github.com/MaudeDavidLab/GMEmbeddings . « less
Authors:
; ;
Award ID(s):
2025457
Publication Date:
NSF-PAR ID:
10386601
Journal Name:
Frontiers in Bioinformatics
Volume:
2
ISSN:
2673-7647
Sponsoring Org:
National Science Foundation
More Like this
  1. We introduce Operational Genomic Unit (OGU), a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent from taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldomly applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in one synthetic and two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome datasets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project dataset, and more accurate prediction of human age by the gut microbiomes in the Finnish population. We provide Woltka, amore »bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate OGU adoption in future metagenomics studies. Importance Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. However, current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution compared to 16S rRNA amplicon sequence variant analysis. To solve these challenges, we introduce Operational Genomic Units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition while (ii) permitting use of phylogeny-aware tools. Our analysis of real-world datasets shows several advantages over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGU as standard practice in metagenomic studies.« less
  2. Johnson, Karyn N. (Ed.)
    ABSTRACT Leeches are found in terrestrial, aquatic, and marine habitats on all continents. Sanguivorous leeches have been used in medicine for millennia. Modern scientific uses include studies of neurons, anticoagulants, and gut microbial symbioses. Hirudo verbana , the European medicinal leech, maintains a gut community dominated by two bacterial symbionts, Aeromonas veronii and Mucinivorans hirudinis , which sometimes account for as much as 97% of the total crop microbiota. The highly simplified gut anatomy and microbiome of H. verbana make it an excellent model organism for studying gut microbial dynamics. The North American medicinal leech, Macrobdella decora , is a hirudinid leech native to Canada and the northern United States. In this study, we show that M. decora symbiont communities are very similar to those in H. verbana. We performed an extensive study using field-caught M. decora and purchased H. verbana from two suppliers. Deep sequencing of the V4 region of the 16S rRNA gene allowed us to determine that the core microbiome of M. decora consists of Bacteroides , Aeromonas, Proteocatella , and Butyricicoccus. The analysis revealed that the compositions of the gut microbiomes of the two leech species were significantly different at all taxonomic levels. The R 2more »value was highest at the genus and amplicon sequence variant (ASV) levels and much lower at the phylum, class, and order levels. The gut and bladder microbial communities were distinct. We propose that M. decora is an alternative to H. verbana for studies of wild-caught animals and provide evidence for the conservation of digestive-tract and bladder symbionts in annelid models. IMPORTANCE Building evidence implicates the gut microbiome in critical animal functions such as regulating digestion, nutrition, immune regulation, and development. Simplified, phylogenetically diverse models for hypothesis testing are necessary because of the difficulty of assigning causative relationships in complex gut microbiomes. Previous research used Hirudo verbana as a tractable animal model of digestive-tract symbioses. Our data show that Macrobdella decora may work just as well without the drawback of being an endangered organism and with the added advantage of easy access to field-caught specimens. The similarity of the microbial community structures of species from two different continents reveals the highly conserved nature of the microbial symbionts in sanguivorous leeches.« less
  3. Methane seep systems along continental margins host diverse and dynamic microbial assemblages, sustained in large part through the microbially mediated process of sulfate-coupled Anaerobic Oxidation of Methane (AOM). This methanotrophic metabolism has been linked to consortia of anaerobic methane-oxidizing archaea (ANME) and sulfate-reducing bacteria (SRB). These two groups are the focus of numerous studies; however, less is known about the wide diversity of other seep associated microorganisms. We selected a hierarchical set of FISH probes targeting a range ofDeltaproteobacteriadiversity. Using the Magneto-FISH enrichment technique, we then magnetically captured CARD-FISH hybridized cells and their physically associated microorganisms from a methane seep sediment incubation. DNA from nested Magneto-FISH experiments was analyzed using Illumina tag 16S rRNA gene sequencing (iTag). Enrichment success and potential bias with iTag was evaluated in the context of full-length 16S rRNA gene clone libraries, CARD-FISH, functional gene clone libraries, and iTag mock communities. We determined commonly used Earth Microbiome Project (EMP) iTAG primers introduced bias in some common methane seep microbial taxa that reduced the ability to directly compare OTU relative abundances within a sample, but comparison of relative abundances between samples (in nearly all cases) and whole community-based analyses were robust. The iTag dataset was subjected tomore »statistical co-occurrence measures of the most abundant OTUs to determine which taxa in this dataset were most correlated across all samples. Many non-canonical microbial partnerships were statistically significant in our co-occurrence network analysis, most of which were not recovered with conventional clone library sequencing, demonstrating the utility of combining Magneto-FISH and iTag sequencing methods for hypothesis generation of associations within complex microbial communities. Network analysis pointed to many co-occurrences containing putatively heterotrophic, candidate phyla such as OD1,Atribacteria, MBG-B, and Hyd24-12 and the potential for complex sulfur cycling involvingEpsilon-,Delta-, andGammaproteobacteriain methane seep ecosystems.

    « less
  4. 16S rRNA gene profiling (amplicon sequencing) is a popular technique for understanding host-associated and environmental microbial communities. Most protocols for sequencing amplicon libraries follow a standardized pipeline that can differ slightly depending on laboratory facility and user. Given that the same variable region of the 16S gene is targeted, it is generally accepted that sequencing output from differing protocols are comparable and this assumption underlies our ability to identify universal patterns in microbial dynamics through meta-analyses. However, discrepant results from a combined 16S rRNA gene dataset prepared by two labs whose protocols differed only in DNA polymerase and sequencing platform led us to scrutinize the outputs and challenge the idea of confidently combining them for standard microbiome analysis. Using technical replicates of reef-building coral samples from two species, Montipora aequituberculata and Porites lobata , we evaluated the consistency of alpha and beta diversity metrics between data resulting from these highly similar protocols. While we found minimal variation in alpha diversity between platform, significant differences were revealed with most beta diversity metrics, dependent on host species. These inconsistencies persisted following removal of low abundance taxa and when comparing across higher taxonomic levels, suggesting that bacterial community differences associated with sequencing protocolmore »are likely to be context dependent and difficult to correct without extensive validation work. The results of this study encourage caution in the statistical comparison and interpretation of studies that combine rRNA gene sequence data from distinct protocols and point to a need for further work identifying mechanistic causes of these observed differences.« less
  5. Background Insects are the most diverse group of animals which have established intricate evolutionary interactions with bacteria. However, the importance of these interactions is still poorly understood. Few studies have focused on a closely related group of insect species, to test the similarities and differences between their microbiota. Heliconius butterflies are a charismatic recent insect radiation that evolved the unique ability to use pollen as a protein source, which affected life history traits and resulted in an elevated speciation rates. We hypothesize that different Heliconius butterflies sharing a similar trophic pollen niche, harbor a similar gut flora within species, population and sexes. Methods To test our hypothesis, we characterized the microbiota of 38 adult male and female butterflies representing six species of Heliconius butterflies and 2 populations of the same species. We sequenced the V4 region of the 16S rRNA gene with the Roche 454 system and analyzed the data with standard tools for microbiome analysis. Results Overall, we found a low microbial diversity with only 10 OTUs dominating across all individuals, mostly Proteobacteria and Firmicutes, which accounted for  99.5% of the bacterial reads. When rare reads were considered, we identified a total of 406 OTUs across our samples. Wemore »identified reads within Phyla Chlamydiae , found in 5 butterflies of four species. Interestingly, only three OTUs were shared among all 38 individuals ( Bacillus, Enterococcus and Enterobacteriaceae ). Altogether, the high individual variation overshadowed species and sex differences. Thus, bacterial communities were not structured randomly with 13% of beta-diversity explained by species, and 40 rare OTUs being significantly different across species. Finally, 13 OTUs, including the intercellular symbiont Spiroplasma, varied significantly in relative abundance between males and females. Discussion The Heliconius microbial communities in these 38 individuals show a low diversity with few differences in the rare microbes between females, males, species or populations. Indeed, Heliconius butterflies, similarly to other insects, are dominated by few OTUs, mainly from Proteobacteria and Firmicutes. The overall low microbial diversity observed contrasts with the high intra-species variation in microbiome composition. This could indicate that much of the microbiome maybe acquired from their surroundings. The significant differences between species and sexes were restricted to rare taxa, which could be important for microbial community stability under changing conditions as seen in other host-microbiome systems. The presence of symbionts like Spiroplasma or Chlamydiae , identified in this study for the first time in Heliconius , could play a vital role in their behavior and evolution by vertical transmission. Altogether, our study represents a step forward into the description of the microbial diversity in a charismatic group of closely related butterflies.« less