skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: GMEmbeddings: An R Package to Apply Embedding Techniques to Microbiome Data
Large-scale microbiome studies investigating disease-inducing microbial roles base their findings on differences between microbial count data in contrasting environments (e.g., stool samples between cases and controls). These microbiome survey studies are often impeded by small sample sizes and database bias. Combining data from multiple survey studies often results in obvious batch effects, even when DNA preparation and sequencing methods are identical. Relatedly, predictive models trained on one microbial DNA dataset often do not generalize to outside datasets. In this study, we address these limitations by applying word embedding algorithms (GloVe) and PCA transformation to ASV data from the American Gut Project and generating translation matrices that can be applied to any 16S rRNA V4 region gut microbiome sequencing study. Because these approaches contextualize microbial occurrences in a larger dataset while reducing dimensionality of the feature space, they can improve generalization of predictive models that predict host phenotype from stool associated gut microbiota. The GMEmbeddings R package contains GloVe and PCA embedding transformation matrices at 50, 100 and 250 dimensions, each learned using ∼15,000 samples from the American Gut Project. It currently supports the alignment, matching, and matrix multiplication to allow users to transform their V4 16S rRNA data into these embedding spaces. We show how to correlate the properties in the new embedding space to KEGG functional pathways for biological interpretation of results. Lastly, we provide benchmarking on six gut microbiome datasets describing three phenotypes to demonstrate the ability of embedding-based microbiome classifiers to generalize to independent datasets. Future iterations of GMEmbeddings will include embedding transformation matrices for other biological systems. Available at: https://github.com/MaudeDavidLab/GMEmbeddings .  more » « less
Award ID(s):
2025457
PAR ID:
10386601
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Frontiers in Bioinformatics
Volume:
2
ISSN:
2673-7647
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Staley, Christopher (Ed.)
    The microbiota gut-brain-axis is a bidirectional circuit that links the neural, endocrine, and immunological systems with gut microbial communities. The gut microbiome plays significant roles in human mind and behavior, specifically pain perception, learning capacity, memory, and temperament. Studies have shown that disruptions in the gut microbiota have been associated with substance use disorders. The interplay of gut microbiota in substance abuse disorders has not been elucidated; however, postmortem microbiome profiles may produce promising avenues for future forensic investigations. The goal of the current study was to determine gut microbiome composition in substance abuse disorder cases using transverse colon tissues of 21 drug overdose versus 19 non-overdose-related cases. We hypothesized that postmortem samples of the same cause of death will reveal similar microbial taxonomic relationships. We compared microbial diversity profiles using amplicon-based sequencing of the 16S rRNA gene V4 hypervariable region. The results demonstrated that the microbial abundance in younger-aged cases were found to have significantly more operational taxonomic units than older cases. Using weighted UniFrac analysis, the influence of substances in overdose cases was found to be a significant factor in determining microbiome similarity. The results also revealed that samples of the same cause of death cluster together, showing a high degree of similarity between samples and a low degree of similarity among samples of different causes of death. In conclusion, our examination of human transverse colon microflora in decomposing remains extends emerging literature on postmortem microbial communities, which will ultimately contribute to advanced knowledge of human putrefaction. 
    more » « less
  2. We introduce Operational Genomic Unit (OGU), a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent from taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldomly applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in one synthetic and two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome datasets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project dataset, and more accurate prediction of human age by the gut microbiomes in the Finnish population. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate OGU adoption in future metagenomics studies. Importance Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. However, current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution compared to 16S rRNA amplicon sequence variant analysis. To solve these challenges, we introduce Operational Genomic Units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition while (ii) permitting use of phylogeny-aware tools. Our analysis of real-world datasets shows several advantages over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGU as standard practice in metagenomic studies. 
    more » « less
  3. Johnson, Karyn N. (Ed.)
    ABSTRACT Leeches are found in terrestrial, aquatic, and marine habitats on all continents. Sanguivorous leeches have been used in medicine for millennia. Modern scientific uses include studies of neurons, anticoagulants, and gut microbial symbioses. Hirudo verbana , the European medicinal leech, maintains a gut community dominated by two bacterial symbionts, Aeromonas veronii and Mucinivorans hirudinis , which sometimes account for as much as 97% of the total crop microbiota. The highly simplified gut anatomy and microbiome of H. verbana make it an excellent model organism for studying gut microbial dynamics. The North American medicinal leech, Macrobdella decora , is a hirudinid leech native to Canada and the northern United States. In this study, we show that M. decora symbiont communities are very similar to those in H. verbana. We performed an extensive study using field-caught M. decora and purchased H. verbana from two suppliers. Deep sequencing of the V4 region of the 16S rRNA gene allowed us to determine that the core microbiome of M. decora consists of Bacteroides , Aeromonas, Proteocatella , and Butyricicoccus. The analysis revealed that the compositions of the gut microbiomes of the two leech species were significantly different at all taxonomic levels. The R 2 value was highest at the genus and amplicon sequence variant (ASV) levels and much lower at the phylum, class, and order levels. The gut and bladder microbial communities were distinct. We propose that M. decora is an alternative to H. verbana for studies of wild-caught animals and provide evidence for the conservation of digestive-tract and bladder symbionts in annelid models. IMPORTANCE Building evidence implicates the gut microbiome in critical animal functions such as regulating digestion, nutrition, immune regulation, and development. Simplified, phylogenetically diverse models for hypothesis testing are necessary because of the difficulty of assigning causative relationships in complex gut microbiomes. Previous research used Hirudo verbana as a tractable animal model of digestive-tract symbioses. Our data show that Macrobdella decora may work just as well without the drawback of being an endangered organism and with the added advantage of easy access to field-caught specimens. The similarity of the microbial community structures of species from two different continents reveals the highly conserved nature of the microbial symbionts in sanguivorous leeches. 
    more » « less
  4. The study of the thyroid is an emerging topic, particularly in postmortem microbiome studies, due to the organ’s ability to affect the endocrine system. Also, the submandibular gland is a promising, emerging gland of study due to its position relative to the oral cavity. Previous thanatomicrobiome studies have demonstrated that bacteria belonging to the phyla Firmicutes, Proteobacteria, Bacteroides, and Pseudomonadota predominate internal organs and have been considered an important biomarker for postmortem interval. Further, Clostridium species that dominate in internal organs are linked to the hypoxic change that occurs after death, which leads to the switch of bacteria to become obligate anaerobes. Therefore, obligate anaerobes dominate the body after death due to their ability to thrive off fermentation products. 16S rRNA gene sequencing has been critical in thanatomicrobiome studies, which refers to the human microbiome (microorganisms within the body) after death. Currently, it has not been elucidated regarding the microorganisms that are associated with the decay of submandibular and thyroid glands. We hypothesized that through sequencing of the 16S rRNA gene of the submandibular and thyroid glands, the presence of Firmicutes and Proteobacteria will indicate potential biomarkers for postmortem interval. The present study revealed the postmortem microbial signatures of the submandibular and thyroid glands using the 16S rRNA gene, specifically the V3-V4 hypervariable regions, using universal primers 341F and 805R. We investigated a total of 37 cadavers obtained from ongoing criminal casework, 17 submandibular samples and 20 thyroid samples, and found that there is a correlation between microbial abundance in these postmortem glands. The predominating phyla of interest found in both glands were Firmicutes and Proteobacteria. The predominating genera were Paeniclostridium and Streptococcus in both glands, respectively. Further experimentation of the submandibular and thyroid glands will help to link oral thanatomicrobiome communities to “microbial clock” determinations, thus enhancing postmortem interval estimation. 
    more » « less
  5. An inherent issue in high-throughput rRNA gene tag sequencing microbiome surveys is that they provide compositional data in relative abundances. This often leads to spurious correlations, making the interpretation of relationships to biogeochemical rates challenging. To overcome this issue, we quantitatively estimated the abundance of microorganisms by spiking in known amounts of internal DNA standards. Using a 3-year sample set of diverse microbial communities from the Western Antarctica Peninsula, we demonstrated that the internal standard method yielded community profiles and taxon cooccurrence patterns substantially different from those derived using relative abundances. We found that the method provided results consistent with the traditional CHEMTAX analysis of pigments and total bacterial counts by flow cytometry. Using the internal standard method, we also showed that chloroplast 16S rRNA gene data in microbial surveys can be used to estimate abundances of certain eukaryotic phototrophs such as cryptophytes and diatoms. In Phaeocystis, scatter in the 16S/18S rRNA gene ratio may be explained by physiological adaptation to environmental conditions. We conclude that the internal standard method, when applied to rRNA gene microbial community profiling, is quantitative and that its application will substantially improve our understanding of microbial ecosystems. 
    more » « less