skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on January 1, 2026

Title: Machine learning‐based identification of general transcriptional predictors for plant disease
This study investigated the generalizability of Arabidopsis thaliana immune responses across diverse pathogens, including Botrytis cinerea, Sclerotinia sclerotiorum, and Pseudomonas syringae, using a data-driven, machine learning approach. Machine learning models were trained to predict disease development from early transcriptional responses. Feature selection techniques based on network science and topology were used to train models employing only a fraction of the transcriptome. Machine learning models trained on one pathosystem where then validated by predicting disease development in new pathosystems. The identified feature selection gene sets were enriched for pathways related to biotic, abiotic, and stress responses, though the specific genes involved differed between feature sets. This suggests common immune responses to diverse pathogens that operate via different gene sets.The study demonstrates that machine learning can uncover both established and novel components of the plant's immune response, offering insights into disease resistance mechanisms. These predictive models highlight the potential to advance our understanding of multigenic outcomes in plant immunity and can be further refined for applications in disease prediction.  more » « less
Award ID(s):
1932620
PAR ID:
10564793
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Wiley
Date Published:
Journal Name:
New Phytologist
Edition / Version:
1
Volume:
245
Issue:
2
ISSN:
0028-646X
Page Range / eLocation ID:
785 to 806
Subject(s) / Keyword(s):
Arabidopsis thaliana, feature selection, general stress response, machine learning, network science, plant–pathogen interaction, predictive biology.
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. DNA methylation is a process that can affect gene accessibility and therefore gene expression. In this study, a machine learning pipeline is proposed for the prediction of breast cancer and the identification of significant genes that contribute to the prediction. The current study utilized breast cancer methylation data from The Cancer Genome Atlas (TCGA), specifically the TCGA-BRCA dataset. Feature engineering techniques have been utilized to reduce data volume and make deep learning scalable. A comparative analysis of the proposed approach on Illumina 27K and 450K methylation data reveals that deep learning methodologies for cancer prediction can be coupled with feature selection models to enhance prediction accuracy. Prediction using 450K methylation markers can be accomplished in less than 13 s with an accuracy of 98.75%. Of the list of 685 genes in the feature selected 27K dataset, 578 were mapped to Ensemble Gene IDs. This reduced set was significantly (FDR < 0.05) enriched in five biological processes and one molecular function. Of the list of 1572 genes in the feature selected 450K data set, 1290 were mapped to Ensemble Gene IDs. This reduced set was significantly (FDR < 0.05) enriched in 95 biological processes and 17 molecular functions. Seven oncogene/tumor suppressor genes were common between the 27K and 450K feature selected gene sets. These genes were RTN4IP1, MYO18B, ANP32A, BRF1, SETBP1, NTRK1, and IGF2R. Our bioinformatics deep learning workflow, incorporating imputation and data balancing methods, is able to identify important methylation markers related to functionally important genes in breast cancer with high accuracy compared to deep learning or statistical models alone. 
    more » « less
  2. The microbiota has proved to be one of the critical factors for many diseases, and researchers have been using microbiome data for disease prediction. However, models trained on one independent microbiome study may not be easily applicable to other independent studies due to the high level of variability in microbiome data. In this study, we developed a method for improving the generalizability and interpretability of machine learning models for predicting three different diseases (colorectal cancer, Crohn’s disease, and immunotherapy response) using nine independent microbiome datasets. Our method involves combining a smaller dataset with a larger dataset, and we found that using at least 25% of the target samples in the source data resulted in improved model performance. We determined random forest as our top model and employed feature selection to identify common and important taxa for disease prediction across the different studies. Our results suggest that this leveraging scheme is a promising approach for improving the accuracy and interpretability of machine learning models for predicting diseases based on microbiome data. 
    more » « less
  3. Mapder, Tarunendu (Ed.)
    Reef-building corals contain a complex consortium of organisms, a holobiont, which responds dynamically to disease, making pathogen identification difficult. While coral transcriptomics and microbiome communities have previously been characterized, similarities and differences in their responses to different pathogenic sources has not yet been assessed. In this study, we inoculated four genets of the Caribbean branching coral Acropora palmata with a known coral pathogen ( Serratia marcescens ) and white band disease. We then characterized the coral’s transcriptomic and prokaryotic microbiomes’ (prokaryiome) responses to the disease inoculations, as well as how these responses were affected by a short-term heat stress prior to disease inoculation. We found strong commonality in both the transcriptomic and prokaryiomes responses, regardless of disease inoculation. Differences, however, were observed between inoculated corals that either remained healthy or developed active disease signs. Transcriptomic co-expression analysis identified that corals inoculated with disease increased gene expression of immune, wound healing, and fatty acid metabolic processes. Co-abundance analysis of the prokaryiome identified sets of both healthy-and-disease-state bacteria, while co-expression analysis of the prokaryiomes’ inferred metagenomic function revealed infected corals’ prokaryiomes shifted from free-living to biofilm states, as well as increasing metabolic processes. The short-term heat stress did not increase disease susceptibility for any of the four genets with any of the disease inoculations, and there was only a weak effect captured in the coral hosts’ transcriptomic and prokaryiomes response. Genet identity, however, was a major driver of the transcriptomic variance, primarily due to differences in baseline immune gene expression. Despite genotypic differences in baseline gene expression, we have identified a common response for components of the coral holobiont to different disease inoculations. This work has identified genes and prokaryiome members that can be focused on for future coral disease work, specifically, putative disease diagnostic tools. 
    more » « less
  4. Background: At the time of cancer diagnosis, it is crucial to accurately classify malignant gastric tumors and the possibility that patients will survive. Objective: This study aims to investigate the feasibility of identifying and applying a new feature extraction technique to predict the survival of gastric cancer patients. Methods: A retrospective dataset including the computed tomography (CT) images of 135 patients was assembled. Among them, 68 patients survived longer than three years. Several sets of radiomics features were extracted and were incorporated into a machine learning model, and their classification performance was characterized. To improve the classification performance, we further extracted another 27 texture and roughness parameters with 2484 superficial and spatial features to propose a new feature pool. This new feature set was added into the machine learning model and its performance was analyzed. To determine the best model for our experiment, Random Forest (RF) classifier, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Naïve Bayes (NB) (four of the most popular machine learning models) were utilized. The models were trained and tested using the five-fold cross-validation method. Results: Using the area under ROC curve (AUC) as an evaluation index, the model that was generated using the new feature pool yields AUC = 0.98 ± 0.01, which was significantly higher than the models created using the traditional radiomics feature set (p < 0.04). RF classifier performed better than the other machine learning models. Conclusions: This study demonstrated that although radiomics features produced good classification performance, creating new feature sets significantly improved the model performance. 
    more » « less
  5. Alba, Mar (Ed.)
    Abstract T cells are a type of white blood cell that play a critical role in the immune response against foreign pathogens through a process called T cell adaptive immunity (TCAI). However, the evolution of the genes and nucleotide sequences involved in TCAI is not well understood. To investigate this, we performed comparative studies of gene annotations and genome assemblies of 28 vertebrate species and identified sets of human genes that are involved in TCAI, carcinogenesis, and aging. We found that these gene sets share interaction pathways, which may have contributed to the evolution of longevity in the vertebrate lineage leading to humans. Our human gene age dating analyses revealed that there was rapid origination of genes with TCAI-related functions prior to the Cretaceous eutherian radiation and these new genes mainly encode negative regulators. We identified no new TCAI-related genes after the divergence of placental mammals, but we did detect an extensive number of amino acid substitutions under strong positive selection in recently evolved human immunity genes suggesting they are coevolving with adaptive immunity. More specifically, we observed that antigen processing and presentation and checkpoint genes are significantly enriched among new genes evolving under positive selection. These observations reveal evolutionary processes of TCAI that were associated with rapid gene duplication in the early stages of vertebrates and subsequent sequence changes in TCAI-related genes. The analysis of vertebrate genomes provides evidence that a "big bang" of adaptive immune genes occurred 300-500 million years ago. These processes together suggest an early genetic construction of the vertebrate immune system and subsequent molecular adaptation to diverse antigens. 
    more » « less