skip to main content


Title: Evaluating supervised and unsupervised background noise correction in human gut microbiome data
The ability to predict human phenotypes and identify biomarkers of disease from metagenomic data is crucial for the development of therapeutics for microbiome-associated diseases. However, metagenomic data is commonly affected by technical variables unrelated to the phenotype of interest, such as sequencing protocol, which can make it difficult to predict phenotype and find biomarkers of disease. Supervised methods to correct for background noise, originally designed for gene expression and RNA-seq data, are commonly applied to microbiome data but may be limited because they cannot account for unmeasured sources of variation. Unsupervised approaches address this issue, but current methods are limited because they are ill-equipped to deal with the unique aspects of microbiome data, which is compositional, highly skewed, and sparse. We perform a comparative analysis of the ability of different denoising transformations in combination with supervised correction methods as well as an unsupervised principal component correction approach that is presently used in other domains but has not been applied to microbiome data to date. We find that the unsupervised principal component correction approach has comparable ability in reducing false discovery of biomarkers as the supervised approaches, with the added benefit of not needing to know the sources of variation apriori. However, in prediction tasks, it appears to only improve prediction when technical variables contribute to the majority of variance in the data. As new and larger metagenomic datasets become increasingly available, background noise correction will become essential for generating reproducible microbiome analyses.  more » « less
Award ID(s):
1705121
NSF-PAR ID:
10366177
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Segata, Nicola
Date Published:
Journal Name:
PLOS Computational Biology
Volume:
18
Issue:
2
ISSN:
1553-7358
Page Range / eLocation ID:
e1009838
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Background

    As mobile health (mHealth) studies become increasingly productive owing to the advancements in wearable and mobile sensor technology, our ability to monitor and model human behavior will be constrained by participant receptivity. Many health constructs are dependent on subjective responses, and without such responses, researchers are left with little to no ground truth to accompany our ever-growing biobehavioral data. This issue can significantly impact the quality of a study, particularly for populations known to exhibit lower compliance rates. To address this challenge, researchers have proposed innovative approaches that use machine learning (ML) and sensor data to modify the timing and delivery of surveys. However, an overarching concern is the potential introduction of biases or unintended influences on participants’ responses when implementing new survey delivery methods.

    Objective

    This study aims to demonstrate the potential impact of an ML-based ecological momentary assessment (EMA) delivery system (using receptivity as the predictor variable) on the participants’ reported emotional state. We examine the factors that affect participants’ receptivity to EMAs in a 10-day wearable and EMA–based emotional state–sensing mHealth study. We study the physiological relationships indicative of receptivity and affect while also analyzing the interaction between the 2 constructs.

    Methods

    We collected data from 45 healthy participants wearing 2 devices measuring electrodermal activity, accelerometer, electrocardiography, and skin temperature while answering 10 EMAs daily, containing questions about perceived mood. Owing to the nature of our constructs, we can only obtain ground truth measures for both affect and receptivity during responses. Therefore, we used unsupervised and supervised ML methods to infer affect when a participant did not respond. Our unsupervised method used k-means clustering to determine the relationship between physiology and receptivity and then inferred the emotional state during nonresponses. For the supervised learning method, we primarily used random forest and neural networks to predict the affect of unlabeled data points as well as receptivity.

    Results

    Our findings showed that using a receptivity model to trigger EMAs decreased the reported negative affect by >3 points or 0.29 SDs in our self-reported affect measure, scored between 13 and 91. The findings also showed a bimodal distribution of our predicted affect during nonresponses. This indicates that this system initiates EMAs more commonly during states of higher positive emotions.

    Conclusions

    Our results showed a clear relationship between affect and receptivity. This relationship can affect the efficacy of an mHealth study, particularly those that use an ML algorithm to trigger EMAs. Therefore, we propose that future work should focus on a smart trigger that promotes EMA receptivity without influencing affect during sampled time points.

     
    more » « less
  2. Stony coral tissue loss disease, first observed in Florida in 2014, has now spread along the entire Florida Reef Tract and on reefs in many Caribbean countries. The disease affects a variety of coral species with differential outcomes, and in many instances results in whole-colony mortality. We employed untargeted metabolomic profiling of Montastraea cavernosa corals affected by stony coral tissue loss disease to identify metabolic markers of disease. Herein, extracts from apparently healthy, diseased, and recovered Montastraea cavernosa collected at a reef site near Ft. Lauderdale, Florida were subjected to liquid-chromatography mass spectrometry-based metabolomics. Unsupervised principal component analysis reveals wide variation in metabolomic profiles of healthy corals of the same species, which differ from diseased corals. Using a combination of supervised and unsupervised data analyses tools, we describe metabolite features that explain variation between the apparently healthy corals, between diseased corals, and between the healthy and the diseased corals. By employing a culture-based approach, we assign sources of a subset of these molecules to the endosymbiotic dinoflagellates, Symbiodiniaceae. Specifically, we identify various endosymbiont- specific lipid classes, such as betaine lipids, glycolipids, and tocopherols, which differentiate samples taken from apparently healthy corals and diseased corals. Given the variation observed in metabolite fingerprints of corals, our data suggests that metabolomics is a viable approach to link metabolite profiles of different coral species with their susceptibility and resilience to numerous coral diseases spreading through reefs worldwide. 
    more » « less
  3. Abstract

    In clinical research and practice, landmark models are commonly used to predict the risk of an adverse future event, using patients' longitudinal biomarker data as predictors. However, these data are often observable only at intermittent visits, making their measurement times irregularly spaced and unsynchronized across different subjects. This poses challenges to conducting dynamic prediction at any post‐baseline time. A simple solution is the last‐value‐carry‐forward method, but this may result in bias for the risk model estimation and prediction. Another option is to jointly model the longitudinal and survival processes with a shared random effects model. However, when dealing with multiple biomarkers, this approach often results in high‐dimensional integrals without a closed‐form solution, and thus the computational burden limits its software development and practical use. In this article, we propose to process the longitudinal data by functional principal component analysis techniques, and then use the processed information as predictors in a class of flexible linear transformation models to predict the distribution of residual time‐to‐event occurrence. The measurement schemes for multiple biomarkers are allowed to be different within subject and across subjects. Dynamic prediction can be performed in a real‐time fashion. The advantages of our proposed method are demonstrated by simulation studies. We apply our approach to the African American Study of Kidney Disease and Hypertension, predicting patients' risk of kidney failure or death by using four important longitudinal biomarkers for renal functions.

     
    more » « less
  4. Large-scale microbiome studies investigating disease-inducing microbial roles base their findings on differences between microbial count data in contrasting environments (e.g., stool samples between cases and controls). These microbiome survey studies are often impeded by small sample sizes and database bias. Combining data from multiple survey studies often results in obvious batch effects, even when DNA preparation and sequencing methods are identical. Relatedly, predictive models trained on one microbial DNA dataset often do not generalize to outside datasets. In this study, we address these limitations by applying word embedding algorithms (GloVe) and PCA transformation to ASV data from the American Gut Project and generating translation matrices that can be applied to any 16S rRNA V4 region gut microbiome sequencing study. Because these approaches contextualize microbial occurrences in a larger dataset while reducing dimensionality of the feature space, they can improve generalization of predictive models that predict host phenotype from stool associated gut microbiota. The GMEmbeddings R package contains GloVe and PCA embedding transformation matrices at 50, 100 and 250 dimensions, each learned using ∼15,000 samples from the American Gut Project. It currently supports the alignment, matching, and matrix multiplication to allow users to transform their V4 16S rRNA data into these embedding spaces. We show how to correlate the properties in the new embedding space to KEGG functional pathways for biological interpretation of results. Lastly, we provide benchmarking on six gut microbiome datasets describing three phenotypes to demonstrate the ability of embedding-based microbiome classifiers to generalize to independent datasets. Future iterations of GMEmbeddings will include embedding transformation matrices for other biological systems. Available at: https://github.com/MaudeDavidLab/GMEmbeddings . 
    more » « less
  5. Latent Interacting Variable Effects (LIVE) modeling is a framework to integrate different types of microbiome multi-omics data by combining latent variables from single-omic models into a structured meta-model to determine discriminative, interacting multi-omics features driving disease status. We implemented and tested LIVE modeling in publicly available metagenomics and metabolomics datasets from Crohn’s Disease and Ulcerative Colitis patients. Here, LIVE modeling reduced the number of feature correlations from the original data set for CD and UC to tractable numbers and facilitated prioritization of biological associations between microbes, metabolites, enzymes and IBD status through the application of stringent thresholds on generated inferential statistics. We determined LIVE modeling confirmed previously reported IBD biomarkers and uncovered potentially novel disease mechanisms in IBD. LIVE modeling makes a distinct and complementary contribution to the current methods to integrate microbiome data to predict IBD status because of its flexibility to adapt to different types of microbiome multi-omics data, scalability for large and small cohort studies via reliance on latent variables and dimensionality reduction, and the intuitive interpretability of the linear meta-model integrating -omic data types. The results of LIVE modeling and the biological relationships can be represented in networks that connect local correlation structure of single omic data types with global community and omic structure in the latent variable VIP scores. This model arises as novel tool that allows researchers to be more selective about omic feature interaction without disrupting the structural correlation framework provided by sPLS-DA interaction effects modeling. It will lead to form testable hypothesis by identifying potential and unique interactions between metabolome and microbiome that must be considered for future studies. 
    more » « less