skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Multitask knowledge-primed neural network for predicting missing metadata and host phenotype based on human microbiome
Abstract MotivationMicrobial signatures in the human microbiome are closely associated with various human diseases, driving the development of machine learning models for microbiome-based disease prediction. Despite progress, challenges remain in enhancing prediction accuracy, generalizability, and interpretability. Confounding factors, such as host’s gender, age, and body mass index, significantly influence the human microbiome, complicating microbiome-based predictions. ResultsTo address these challenges, we developed MicroKPNN-MT, a unified model for predicting human phenotype based on microbiome data, as well as additional metadata like age and gender. This model builds on our earlier MicroKPNN framework, which incorporates prior knowledge of microbial species into neural networks to enhance prediction accuracy and interpretability. In MicroKPNN-MT, metadata, when available, serves as additional input features for prediction. Otherwise, the model predicts metadata from microbiome data using additional decoders. We applied MicroKPNN-MT to microbiome data collected in mBodyMap, covering healthy individuals and 25 different diseases, and demonstrated its potential as a predictive tool for multiple diseases, which at the same time provided predictions for the missing metadata. Our results showed that incorporating real or predicted metadata helped improve the accuracy of disease predictions, and more importantly, helped improve the generalizability of the predictive models. Availability and implementationhttps://github.com/mgtools/MicroKPNN-MT.  more » « less
Award ID(s):
2025451
PAR ID:
10647931
Author(s) / Creator(s):
; ;
Editor(s):
Lengauer, Thomas
Publisher / Repository:
Oxford Academics
Date Published:
Journal Name:
Bioinformatics Advances
Volume:
5
Issue:
1
ISSN:
2635-0041
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The microbiota has proved to be one of the critical factors for many diseases, and researchers have been using microbiome data for disease prediction. However, models trained on one independent microbiome study may not be easily applicable to other independent studies due to the high level of variability in microbiome data. In this study, we developed a method for improving the generalizability and interpretability of machine learning models for predicting three different diseases (colorectal cancer, Crohn’s disease, and immunotherapy response) using nine independent microbiome datasets. Our method involves combining a smaller dataset with a larger dataset, and we found that using at least 25% of the target samples in the source data resulted in improved model performance. We determined random forest as our top model and employed feature selection to identify common and important taxa for disease prediction across the different studies. Our results suggest that this leveraging scheme is a promising approach for improving the accuracy and interpretability of machine learning models for predicting diseases based on microbiome data. 
    more » « less
  2. Abstract The gut microbiome plays a fundamental role in human health and disease. Individual variations in the microbiome and the corresponding functional implications are key considerations to enhance precision health and medicine. Metaproteomics has recently revealed protein expression that might be associated with human health and disease. Existing studies focused on either human proteins or bacterial proteins that can be identified from (meta)proteomics data sets, but not both. In this study, we examined the feasibility of identifying both human and bacterial proteins that are differentially expressed between healthy and diseased individuals from metaproteomics data sets. We further evaluated different strategies of using identified peptides and proteins for building predictive models. By leveraging existing metaproteomics data sets and a tool that we have developed for metaproteomics data analysis (MetaProD), we were able to derive both human and bacterial differentially expressed proteins that could serve as potential biomarkers for all diseases we studied. We also built predictive models using identified peptides and proteins as features for prediction of human diseases. Our results showed peptide-based identifications over protein-based ones often produce the most accurate models and that feature selection can offer improvements. Prediction accuracy could be further improved, in some cases, by including bacterial identifications, but missing data in bacterial identifications remains problematic. 
    more » « less
  3. Abstract AimsElevated blood pressure (BP) is a prevalent modifiable risk factor for cardiovascular diseases and contributes to cognitive decline in late life. Despite the fact that functional changes may precede irreversible structural damage and emerge in an ongoing manner, studies have been predominantly informed by brain structure and group-level inferences. Here, we aim to delineate neurobiological correlates of BP at an individual level using machine learning and functional connectivity. Methods and resultsBased on whole-brain functional connectivity from the UK Biobank, we built a machine learning model to identify neural representations for individuals’ past (∼8.9 years before scanning, N = 35 882), current (N = 31 367), and future (∼2.4 years follow-up, N = 3 138) BP levels within a repeated cross-validation framework. We examined the impact of multiple potential covariates, as well as assessed these models’ generalizability across various contexts.The predictive models achieved significant correlations between predicted and actual systolic/diastolic BP and pulse pressure while controlling for multiple confounders. Predictions for participants not on antihypertensive medication were more accurate than for currently medicated patients. Moreover, the models demonstrated robust generalizability across contexts in terms of ethnicities, imaging centres, medication status, participant visits, gender, age, and body mass index. The identified connectivity patterns primarily involved the cerebellum, prefrontal, anterior insula, anterior cingulate cortex, supramarginal gyrus, and precuneus, which are key regions of the central autonomic network, and involved in cognition processing and susceptible to neurodegeneration in Alzheimer’s disease. Results also showed more involvement of default mode and frontoparietal networks in predicting future BP levels and in medicated participants. ConclusionThis study, based on the largest neuroimaging sample currently available and using machine learning, identifies brain signatures underlying BP, providing evidence for meaningful BP-associated neural representations in connectivity profiles. 
    more » « less
  4. Abstract MotivationThe human microbiome, which is linked to various diseases by growing evidence, has a profound impact on human health. Since changes in the composition of the microbiome across time are associated with disease and clinical outcomes, microbiome analysis should be performed in a longitudinal study. However, due to limited sample sizes and differing numbers of timepoints for different subjects, a significant amount of data cannot be utilized, directly affecting the quality of analysis results. Deep generative models have been proposed to address this lack of data issue. Specifically, a generative adversarial network (GAN) has been successfully utilized for data augmentation to improve prediction tasks. Recent studies have also shown improved performance of GAN-based models for missing value imputation in a multivariate time series dataset compared with traditional imputation methods. ResultsThis work proposes DeepMicroGen, a bidirectional recurrent neural network-based GAN model, trained on the temporal relationship between the observations, to impute the missing microbiome samples in longitudinal studies. DeepMicroGen outperforms standard baseline imputation methods, showing the lowest mean absolute error for both simulated and real datasets. Finally, the proposed model improved the predicted clinical outcome for allergies, by providing imputation for an incomplete longitudinal dataset used to train the classifier. Availability and implementationDeepMicroGen is publicly available at https://github.com/joungmin-choi/DeepMicroGen. 
    more » « less
  5. Abstract Reusing massive collections of publicly available biomedical data can significantly impact knowledge discovery. However, these public samples and studies are typically described using unstructured plain text, hindering the findability and further reuse of the data. To combat this problem, we propose txt2onto 2.0, a general-purpose method based on natural language processing and machine learning for annotating biomedical unstructured metadata to controlled vocabularies of diseases and tissues. Compared to the previous version (txt2onto 1.0), which uses numerical embeddings as features, this new version uses words as features, resulting in improved interpretability and performance, especially when few positive training instances are available. Txt2onto 2.0 uses embeddings from a large language model during prediction to deal with unseen-yet-relevant words related to each disease and tissue term being predicted from the input text, thereby explaining the basis of every annotation. We demonstrate the generalizability of txt2onto 2.0 by accurately predicting disease annotations for studies from independent datasets, using proteomics and clinical trials as examples. Overall, our approach can annotate biomedical text regardless of experimental types or sources. Code, data, and trained models are available at https://github.com/krishnanlab/txt2onto2.0. 
    more » « less