skip to main content

Title: Enabling microbiome research on personal devices
Microbiome studies have recently transitioned from experimental designs with a few hundred samples to designs spanning tens of thousands of samples. Modern studies such as the Earth Microbiome Project (EMP) afford the statistics crucial for untangling the many factors that influence microbial community composition. Analyzing those data used to require access to a compute cluster, making it both expensive and inconvenient. We show that recent improvements in both hardware and software now allow to compute key bioinformatics tasks on EMP-sized data in minutes using a gaming-class laptop, enabling much faster and broader microbiome science insights.
Authors:
; ;
Award ID(s):
2038509 1826967
Publication Date:
NSF-PAR ID:
10335833
Journal Name:
ArXivorg
ISSN:
2331-8422
Sponsoring Org:
National Science Foundation
More Like this
  1. Methane seep systems along continental margins host diverse and dynamic microbial assemblages, sustained in large part through the microbially mediated process of sulfate-coupled Anaerobic Oxidation of Methane (AOM). This methanotrophic metabolism has been linked to consortia of anaerobic methane-oxidizing archaea (ANME) and sulfate-reducing bacteria (SRB). These two groups are the focus of numerous studies; however, less is known about the wide diversity of other seep associated microorganisms. We selected a hierarchical set of FISH probes targeting a range ofDeltaproteobacteriadiversity. Using the Magneto-FISH enrichment technique, we then magnetically captured CARD-FISH hybridized cells and their physically associated microorganisms from a methane seep sediment incubation. DNA from nested Magneto-FISH experiments was analyzed using Illumina tag 16S rRNA gene sequencing (iTag). Enrichment success and potential bias with iTag was evaluated in the context of full-length 16S rRNA gene clone libraries, CARD-FISH, functional gene clone libraries, and iTag mock communities. We determined commonly used Earth Microbiome Project (EMP) iTAG primers introduced bias in some common methane seep microbial taxa that reduced the ability to directly compare OTU relative abundances within a sample, but comparison of relative abundances between samples (in nearly all cases) and whole community-based analyses were robust. The iTag dataset was subjected tomore »statistical co-occurrence measures of the most abundant OTUs to determine which taxa in this dataset were most correlated across all samples. Many non-canonical microbial partnerships were statistically significant in our co-occurrence network analysis, most of which were not recovered with conventional clone library sequencing, demonstrating the utility of combining Magneto-FISH and iTag sequencing methods for hypothesis generation of associations within complex microbial communities. Network analysis pointed to many co-occurrences containing putatively heterotrophic, candidate phyla such as OD1,Atribacteria, MBG-B, and Hyd24-12 and the potential for complex sulfur cycling involvingEpsilon-,Delta-, andGammaproteobacteriain methane seep ecosystems.

    « less
  2. Abstract

    Differential abundance analysis is controversial throughout microbiome research. Gold standard approaches require laborious measurements of total microbial load, or absolute number of microorganisms, to accurately determine taxonomic shifts. Therefore, most studies rely on relative abundance data. Here, we demonstrate common pitfalls in comparing relative abundance across samples and identify two solutions that reveal microbial changes without the need to estimate total microbial load. We define the notion of “reference frames”, which provide deep intuition about the compositional nature of microbiome data. In an oral time series experiment, reference frames alleviate false positives and produce consistent results on both raw and cell-count normalized data. Furthermore, reference frames identify consistent, differentially abundant microbes previously undetected in two independent published datasets from subjects with atopic dermatitis. These methods allow reassessment of published relative abundance data to reveal reproducible microbial changes from standard sequencing output without the need for new assays.

  3. Large-scale microbiome studies investigating disease-inducing microbial roles base their findings on differences between microbial count data in contrasting environments (e.g., stool samples between cases and controls). These microbiome survey studies are often impeded by small sample sizes and database bias. Combining data from multiple survey studies often results in obvious batch effects, even when DNA preparation and sequencing methods are identical. Relatedly, predictive models trained on one microbial DNA dataset often do not generalize to outside datasets. In this study, we address these limitations by applying word embedding algorithms (GloVe) and PCA transformation to ASV data from the American Gut Project and generating translation matrices that can be applied to any 16S rRNA V4 region gut microbiome sequencing study. Because these approaches contextualize microbial occurrences in a larger dataset while reducing dimensionality of the feature space, they can improve generalization of predictive models that predict host phenotype from stool associated gut microbiota. The GMEmbeddings R package contains GloVe and PCA embedding transformation matrices at 50, 100 and 250 dimensions, each learned using ∼15,000 samples from the American Gut Project. It currently supports the alignment, matching, and matrix multiplication to allow users to transform their V4 16S rRNA data into thesemore »embedding spaces. We show how to correlate the properties in the new embedding space to KEGG functional pathways for biological interpretation of results. Lastly, we provide benchmarking on six gut microbiome datasets describing three phenotypes to demonstrate the ability of embedding-based microbiome classifiers to generalize to independent datasets. Future iterations of GMEmbeddings will include embedding transformation matrices for other biological systems. Available at: https://github.com/MaudeDavidLab/GMEmbeddings .« less
  4. Abstract Motivation Learning associations of traits with the microbial composition of a set of samples is a fundamental goal in microbiome studies. Recently, machine learning methods have been explored for this goal, with some promise. However, in comparison to other fields, microbiome data are high-dimensional and not abundant; leading to a high-dimensional low-sample-size under-determined system. Moreover, microbiome data are often unbalanced and biased. Given such training data, machine learning methods often fail to perform a classification task with sufficient accuracy. Lack of signal is especially problematic when classes are represented in an unbalanced way in the training data; with some classes under-represented. The presence of inter-correlations among subsets of observations further compounds these issues. As a result, machine learning methods have had only limited success in predicting many traits from microbiome. Data augmentation consists of building synthetic samples and adding them to the training data and is a technique that has proved helpful for many machine learning tasks. Results In this paper, we propose a new data augmentation technique for classifying phenotypes based on the microbiome. Our algorithm, called TADA, uses available data and a statistical generative model to create new samples augmenting existing ones, addressing issues of low-sample-size. Inmore »generating new samples, TADA takes into account phylogenetic relationships between microbial species. On two real datasets, we show that adding these synthetic samples to the training set improves the accuracy of downstream classification, especially when the training data have an unbalanced representation of classes. Availability and implementation TADA is available at https://github.com/tada-alg/TADA. Supplementary information Supplementary data are available at Bioinformatics online.« less
  5. Coelho, Luis Pedro (Ed.)
    It is challenging to associate features such as human health outcomes, diet, environmental conditions, or other metadata to microbial community measurements, due in part to their quantitative properties. Microbiome multi-omics are typically noisy, sparse (zero-inflated), high-dimensional, extremely non-normal, and often in the form of count or compositional measurements. Here we introduce an optimized combination of novel and established methodology to assess multivariable association of microbial community features with complex metadata in population-scale observational studies. Our approach, MaAsLin 2 (Microbiome Multivariable Associations with Linear Models), uses generalized linear and mixed models to accommodate a wide variety of modern epidemiological studies, including cross-sectional and longitudinal designs, as well as a variety of data types (e.g., counts and relative abundances) with or without covariates and repeated measurements. To construct this method, we conducted a large-scale evaluation of a broad range of scenarios under which straightforward identification of meta-omics associations can be challenging. These simulation studies reveal that MaAsLin 2’s linear model preserves statistical power in the presence of repeated measures and multiple covariates, while accounting for the nuances of meta-omics features and controlling false discovery. We also applied MaAsLin 2 to a microbial multi-omics dataset from the Integrative Human Microbiome (HMP2) project which,more »in addition to reproducing established results, revealed a unique, integrated landscape of inflammatory bowel diseases (IBD) across multiple time points and omics profiles.« less