skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Evaluating supervised and unsupervised background noise correction in human gut microbiome data
The ability to predict human phenotypes and identify biomarkers of disease from metagenomic data is crucial for the development of therapeutics for microbiome-associated diseases. However, metagenomic data is commonly affected by technical variables unrelated to the phenotype of interest, such as sequencing protocol, which can make it difficult to predict phenotype and find biomarkers of disease. Supervised methods to correct for background noise, originally designed for gene expression and RNA-seq data, are commonly applied to microbiome data but may be limited because they cannot account for unmeasured sources of variation. Unsupervised approaches address this issue, but current methods are limited because they are ill-equipped to deal with the unique aspects of microbiome data, which is compositional, highly skewed, and sparse. We perform a comparative analysis of the ability of different denoising transformations in combination with supervised correction methods as well as an unsupervised principal component correction approach that is presently used in other domains but has not been applied to microbiome data to date. We find that the unsupervised principal component correction approach has comparable ability in reducing false discovery of biomarkers as the supervised approaches, with the added benefit of not needing to know the sources of variation apriori. However, in prediction tasks, it appears to only improve prediction when technical variables contribute to the majority of variance in the data. As new and larger metagenomic datasets become increasingly available, background noise correction will become essential for generating reproducible microbiome analyses.  more » « less
Award ID(s):
1705121
PAR ID:
10366177
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Segata, Nicola
Date Published:
Journal Name:
PLOS Computational Biology
Volume:
18
Issue:
2
ISSN:
1553-7358
Page Range / eLocation ID:
e1009838
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. With the increasing use of Unmanned Aerial Vehicles in military and civilian applications, the security of this technology has become one of the critical concerns. UAVs’ positioning and navigation activities are highly dependent on Global Positioning Systems as they provide accurate locations for these vehicles. However, due to the civilian GPS signals being open and unencrypted, malicious users can target them in multiple ways, including by launching Global Positioning System spoofing attacks. To address this security issue, numerous techniques have been proposed to detect and classify these attacks, including supervised machine learning techniques. However, no studies have focused on unsupervised models to detect these attacks. In this paper, we compare the performance of several supervised models with that of unsupervised models in terms of accuracy, probability of detection, probability of misdetection, probability of false alarm, processing time, training time, prediction time, and memory size. The supervised models are Gaussian Naïve Bayes, Classification and Regression Decision Tree, Logistic Regression, Random Forest, Linear-Support Vector Machine, and Artificial Neural Network. The unsupervised models are Principal Component Analysis, K-means clustering, and Autoencoder. The results show that the Classification and Regression Decision Tree model outperforms the other supervised and unsupervised models in detecting and classifying GPS spoofing attacks. 
    more » « less
  2. Latent Interacting Variable Effects (LIVE) modeling is a framework to integrate different types of microbiome multi-omics data by combining latent variables from single-omic models into a structured meta-model to determine discriminative, interacting multi-omics features driving disease status. We implemented and tested LIVE modeling in publicly available metagenomics and metabolomics datasets from Crohn’s Disease and Ulcerative Colitis patients. Here, LIVE modeling reduced the number of feature correlations from the original data set for CD and UC to tractable numbers and facilitated prioritization of biological associations between microbes, metabolites, enzymes and IBD status through the application of stringent thresholds on generated inferential statistics. We determined LIVE modeling confirmed previously reported IBD biomarkers and uncovered potentially novel disease mechanisms in IBD. LIVE modeling makes a distinct and complementary contribution to the current methods to integrate microbiome data to predict IBD status because of its flexibility to adapt to different types of microbiome multi-omics data, scalability for large and small cohort studies via reliance on latent variables and dimensionality reduction, and the intuitive interpretability of the linear meta-model integrating -omic data types. The results of LIVE modeling and the biological relationships can be represented in networks that connect local correlation structure of single omic data types with global community and omic structure in the latent variable VIP scores. This model arises as novel tool that allows researchers to be more selective about omic feature interaction without disrupting the structural correlation framework provided by sPLS-DA interaction effects modeling. It will lead to form testable hypothesis by identifying potential and unique interactions between metabolome and microbiome that must be considered for future studies. 
    more » « less
  3. Stony coral tissue loss disease, first observed in Florida in 2014, has now spread along the entire Florida Reef Tract and on reefs in many Caribbean countries. The disease affects a variety of coral species with differential outcomes, and in many instances results in whole-colony mortality. We employed untargeted metabolomic profiling of Montastraea cavernosa corals affected by stony coral tissue loss disease to identify metabolic markers of disease. Herein, extracts from apparently healthy, diseased, and recovered Montastraea cavernosa collected at a reef site near Ft. Lauderdale, Florida were subjected to liquid-chromatography mass spectrometry-based metabolomics. Unsupervised principal component analysis reveals wide variation in metabolomic profiles of healthy corals of the same species, which differ from diseased corals. Using a combination of supervised and unsupervised data analyses tools, we describe metabolite features that explain variation between the apparently healthy corals, between diseased corals, and between the healthy and the diseased corals. By employing a culture-based approach, we assign sources of a subset of these molecules to the endosymbiotic dinoflagellates, Symbiodiniaceae. Specifically, we identify various endosymbiont- specific lipid classes, such as betaine lipids, glycolipids, and tocopherols, which differentiate samples taken from apparently healthy corals and diseased corals. Given the variation observed in metabolite fingerprints of corals, our data suggests that metabolomics is a viable approach to link metabolite profiles of different coral species with their susceptibility and resilience to numerous coral diseases spreading through reefs worldwide. 
    more » « less
  4. Background: Type 1 diabetes (T1D) is a devastating autoimmune disease, and its rising prevalence in the United States and around the world presents a critical problem in public health. While some treatment options exist for patients already diagnosed, individuals considered at risk for developing T1D and who are still in the early stages of their disease pathogenesis without symptoms have no options for any preventive intervention. This is because of the uncertainty in determining their risk level and in predicting with high confidence who will progress, or not, to clinical diagnosis. Biomarkers that assess one’s risk with high certainty could address this problem and will inform decisions on early intervention, especially in children where the burden of justifying treatment is high. Single omics approaches (e.g., genomics, proteomics, metabolomics, etc.) have been applied to identify T1D biomarkers based on specific disturbances in association with the disease. However, reliable early biomarkers of T1D have remained elusive to date. To overcome this, we previously showed that parallel multi-omics provides a more comprehensive picture of the disease-associated disturbances and facilitates the identification of candidate T1D biomarkers. Methods: This paper evaluated the use of machine learning (ML) using data augmentation and supervised ML methods for the purpose of improving the identification of salient patterns in the data and the ultimate extraction of novel biomarker candidates in integrated parallel multi-omics datasets from a limited number of samples. We also examined different stages of data integration (early, intermediate, and late) to assess at which stage supervised parametric models can learn under conditions of high dimensionality and variation in feature counts across different omics. In the late integration scheme, we employed a multi-view ensemble comprising individual parametric models trained over single omics to address the computational challenges posed by the high dimensionality and variation in feature counts across the different yet integrated multi-omics datasets. Results: the multi-view ensemble improves the prediction of case vs. control and finds the most success in flagging a larger consistent set of associated features when compared with chance models, which may eventually be used downstream in identifying a novel composite biomarker signature of T1D risk. Conclusions: the current work demonstrates the utility of supervised ML in exploring integrated parallel multi-omics data in the ongoing quest for early T1D biomarkers, reinforcing the hope for identifying novel composite biomarker signatures of T1D risk via ML and ultimately informing early treatment decisions in the face of the escalating global incidence of this debilitating disease. 
    more » « less
  5. Abstract BackgroundImaging, cognitive and fluid data have been widely studied to identify quantitative biomarkers that can help predict the status and progression of Alzheimer’s disease (AD). However, it is still an underexplored topic whether there exist subpopulations with different genetic profiles across which the biomarker‐based prediction models may vary. We propose to use the Chow test (Chow 1960 Econometrica 28(3)) to perform genetically stratified analyses for identifying SNP‐based subpopulations coupled with precision AD biomarkers with varying effects on future diagnosis in these subpopulations. The investigation of such SNPs and precision biomarkers may eventually pave the way for increased customization of AD care. MethodParticipants included 1,324 subjects from the ADNI cohort with both AD biomarker and genotyping data available (http://www.pi4cs.org/qt‐pad‐challenge). 30 significant (P < 1.5E‐278) AD SNPs were sourced from (Jansen 2019 NatGen). Chow tests were performed to determine whether each of baseline visit measures of 16 AD biomarkers predicted AD diagnosis at the three‐year visit with varying slopes when stratifying upon the allelic dosage of each of 30 chosen SNPs. Bonferroni correction (P < 1.04E‐4) was employed to correct for multiple comparisons. ResultMultiple SNP‐biomarker pairs showed significant genetically driven deviations in the regression coefficients when predicting diagnosis in three years using baseline biomarkers (Figure 1). Top SNP hits involved rs769449 (Chr 19,APOE) and rs7561528 (Chr 2,LOC105373605), and almost all 16 studied biomarkers demonstrated differential slopes in different genotype groups to predict diagnosis in three years. To examine the details of these top findings, the regression coefficients calculated for each of the five most significant biomarkers of both SNPs were bootstrapped and plotted in Figure 2. ConclusionGenetic analysis of AD candidate SNPs in conjunction with AD biomarker data via the Chow test identified several SNPs coupled with precision AD biomarkers with varying prognosis effects in the corresponding genotype groups. These findings provide valuable information to reveal disease heterogeneity and help facilitate precision medicine. 
    more » « less