DNA methylation is a process that can affect gene accessibility and therefore gene expression. In this study, a machine learning pipeline is proposed for the prediction of breast cancer and the identification of significant genes that contribute to the prediction. The current study utilized breast cancer methylation data from The Cancer Genome Atlas (TCGA), specifically the TCGA-BRCA dataset. Feature engineering techniques have been utilized to reduce data volume and make deep learning scalable. A comparative analysis of the proposed approach on Illumina 27K and 450K methylation data reveals that deep learning methodologies for cancer prediction can be coupled with feature selection models to enhance prediction accuracy. Prediction using 450K methylation markers can be accomplished in less than 13 s with an accuracy of 98.75%. Of the list of 685 genes in the feature selected 27K dataset, 578 were mapped to Ensemble Gene IDs. This reduced set was significantly (FDR < 0.05) enriched in five biological processes and one molecular function. Of the list of 1572 genes in the feature selected 450K data set, 1290 were mapped to Ensemble Gene IDs. This reduced set was significantly (FDR < 0.05) enriched in 95 biological processes and 17 molecular functions. Seven oncogene/tumor suppressor genes were common between the 27K and 450K feature selected gene sets. These genes were RTN4IP1, MYO18B, ANP32A, BRF1, SETBP1, NTRK1, and IGF2R. Our bioinformatics deep learning workflow, incorporating imputation and data balancing methods, is able to identify important methylation markers related to functionally important genes in breast cancer with high accuracy compared to deep learning or statistical models alone.
more »
« less
Machine learning‐based identification of general transcriptional predictors for plant disease
This study investigated the generalizability of Arabidopsis thaliana immune responses across diverse pathogens, including Botrytis cinerea, Sclerotinia sclerotiorum, and Pseudomonas syringae, using a data-driven, machine learning approach. Machine learning models were trained to predict disease development from early transcriptional responses. Feature selection techniques based on network science and topology were used to train models employing only a fraction of the transcriptome. Machine learning models trained on one pathosystem where then validated by predicting disease development in new pathosystems. The identified feature selection gene sets were enriched for pathways related to biotic, abiotic, and stress responses, though the specific genes involved differed between feature sets. This suggests common immune responses to diverse pathogens that operate via different gene sets.The study demonstrates that machine learning can uncover both established and novel components of the plant's immune response, offering insights into disease resistance mechanisms. These predictive models highlight the potential to advance our understanding of multigenic outcomes in plant immunity and can be further refined for applications in disease prediction.
more »
« less
- Award ID(s):
- 1932620
- PAR ID:
- 10564793
- Publisher / Repository:
- Wiley
- Date Published:
- Journal Name:
- New Phytologist
- Edition / Version:
- 1
- Volume:
- 245
- Issue:
- 2
- ISSN:
- 0028-646X
- Page Range / eLocation ID:
- 785 to 806
- Subject(s) / Keyword(s):
- Arabidopsis thaliana, feature selection, general stress response, machine learning, network science, plant–pathogen interaction, predictive biology.
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
ABSTRACT Vaccine development requires innovative approaches to improve immune responses while reducing the number of immunizations. In this study, we explore the impact of controlled antigen release on immune activation and regulation using programmable infusion pumps and biodegradable biomaterials in OT‐II and wild‐type mice to understand the adaptive immune response through controlled antigen delivery in the absence of adjuvant. Ovalbumin (OVA) was delivered via an exponentially decreasing profile, mimicking clearance of infection, and an exponentially increasing profile, mimicking induction of infection. Exponentially decreasing OVA delivery through infusion pumps promoted regulatory T‐cell (Treg) activation in secondary lymphoid organs and suppressed pro‐inflammatory T‐helper type 17 (Th17) responses in blood. An exponentially increasing OVA profile enhanced central memory T‐cell (TCM) populations in submandibular blood and humoral immune responses in cardiac blood serum, demonstrating distinct immune modulation based on release kinetics. OVA was also delivered using a biodegradable PLGA‐PEG‐PLGA (PPP) depot, which provided controlled OVA release in an exponentially decreasing pattern. PPP‐OVA treatment significantly reduced the frequency of pro‐inflammatory T‐helper type 1 (Th1) cells while increasing CD25+FOXP3+Treg cells in the spleen. Moreover, to identify T‐cell populations that most accurately characterize the divergence in Treg and T‐helper response to OVA kinetics, a Sequential Feature Selection (SFS) algorithm with Machine Learning (ML) models was used. ML algorithms identified gMFI of RORγt+as a key feature in submandibular blood and the ratio of gMFI of FOXP3+to GATA3+as the marker that was significantly changed by treatments in inguinal lymph nodes (iLN) when infusion pumps were used to deliver OVA. In addition, ML‐based SFS identified CD25+FOXP3+regulatory T cells as the most important feature, influencing the expression of other cell types in both inguinal lymph nodes (iLN) and spleen when PPP was used to deliver OVA. This finding suggests that the exponentially decreasing profile may generate anti‐inflammatory responses. Overall, these findings suggest that controlled antigen delivery enhances immune regulation and memory T cells, providing new insights into immune responses mediated by the release kinetics.more » « less
-
The microbiota has proved to be one of the critical factors for many diseases, and researchers have been using microbiome data for disease prediction. However, models trained on one independent microbiome study may not be easily applicable to other independent studies due to the high level of variability in microbiome data. In this study, we developed a method for improving the generalizability and interpretability of machine learning models for predicting three different diseases (colorectal cancer, Crohn’s disease, and immunotherapy response) using nine independent microbiome datasets. Our method involves combining a smaller dataset with a larger dataset, and we found that using at least 25% of the target samples in the source data resulted in improved model performance. We determined random forest as our top model and employed feature selection to identify common and important taxa for disease prediction across the different studies. Our results suggest that this leveraging scheme is a promising approach for improving the accuracy and interpretability of machine learning models for predicting diseases based on microbiome data.more » « less
-
Mapder, Tarunendu (Ed.)Reef-building corals contain a complex consortium of organisms, a holobiont, which responds dynamically to disease, making pathogen identification difficult. While coral transcriptomics and microbiome communities have previously been characterized, similarities and differences in their responses to different pathogenic sources has not yet been assessed. In this study, we inoculated four genets of the Caribbean branching coral Acropora palmata with a known coral pathogen ( Serratia marcescens ) and white band disease. We then characterized the coral’s transcriptomic and prokaryotic microbiomes’ (prokaryiome) responses to the disease inoculations, as well as how these responses were affected by a short-term heat stress prior to disease inoculation. We found strong commonality in both the transcriptomic and prokaryiomes responses, regardless of disease inoculation. Differences, however, were observed between inoculated corals that either remained healthy or developed active disease signs. Transcriptomic co-expression analysis identified that corals inoculated with disease increased gene expression of immune, wound healing, and fatty acid metabolic processes. Co-abundance analysis of the prokaryiome identified sets of both healthy-and-disease-state bacteria, while co-expression analysis of the prokaryiomes’ inferred metagenomic function revealed infected corals’ prokaryiomes shifted from free-living to biofilm states, as well as increasing metabolic processes. The short-term heat stress did not increase disease susceptibility for any of the four genets with any of the disease inoculations, and there was only a weak effect captured in the coral hosts’ transcriptomic and prokaryiomes response. Genet identity, however, was a major driver of the transcriptomic variance, primarily due to differences in baseline immune gene expression. Despite genotypic differences in baseline gene expression, we have identified a common response for components of the coral holobiont to different disease inoculations. This work has identified genes and prokaryiome members that can be focused on for future coral disease work, specifically, putative disease diagnostic tools.more » « less
-
Background: At the time of cancer diagnosis, it is crucial to accurately classify malignant gastric tumors and the possibility that patients will survive. Objective: This study aims to investigate the feasibility of identifying and applying a new feature extraction technique to predict the survival of gastric cancer patients. Methods: A retrospective dataset including the computed tomography (CT) images of 135 patients was assembled. Among them, 68 patients survived longer than three years. Several sets of radiomics features were extracted and were incorporated into a machine learning model, and their classification performance was characterized. To improve the classification performance, we further extracted another 27 texture and roughness parameters with 2484 superficial and spatial features to propose a new feature pool. This new feature set was added into the machine learning model and its performance was analyzed. To determine the best model for our experiment, Random Forest (RF) classifier, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Naïve Bayes (NB) (four of the most popular machine learning models) were utilized. The models were trained and tested using the five-fold cross-validation method. Results: Using the area under ROC curve (AUC) as an evaluation index, the model that was generated using the new feature pool yields AUC = 0.98 ± 0.01, which was significantly higher than the models created using the traditional radiomics feature set (p < 0.04). RF classifier performed better than the other machine learning models. Conclusions: This study demonstrated that although radiomics features produced good classification performance, creating new feature sets significantly improved the model performance.more » « less
An official website of the United States government

