DNA methylation is a process that can affect gene accessibility and therefore gene expression. In this study, a machine learning pipeline is proposed for the prediction of breast cancer and the identification of significant genes that contribute to the prediction. The current study utilized breast cancer methylation data from The Cancer Genome Atlas (TCGA), specifically the TCGA-BRCA dataset. Feature engineering techniques have been utilized to reduce data volume and make deep learning scalable. A comparative analysis of the proposed approach on Illumina 27K and 450K methylation data reveals that deep learning methodologies for cancer prediction can be coupled with feature selection models to enhance prediction accuracy. Prediction using 450K methylation markers can be accomplished in less than 13 s with an accuracy of 98.75%. Of the list of 685 genes in the feature selected 27K dataset, 578 were mapped to Ensemble Gene IDs. This reduced set was significantly (FDR < 0.05) enriched in five biological processes and one molecular function. Of the list of 1572 genes in the feature selected 450K data set, 1290 were mapped to Ensemble Gene IDs. This reduced set was significantly (FDR < 0.05) enriched in 95 biological processes and 17 molecular functions. Seven oncogene/tumor suppressor genes were common between the 27K and 450K feature selected gene sets. These genes were RTN4IP1, MYO18B, ANP32A, BRF1, SETBP1, NTRK1, and IGF2R. Our bioinformatics deep learning workflow, incorporating imputation and data balancing methods, is able to identify important methylation markers related to functionally important genes in breast cancer with high accuracy compared to deep learning or statistical models alone.
more »
« less
Feature Reduction Method Comparison Towards Explainability and Efficiency in Cybersecurity Intrusion Detection Systems
In the realm of cybersecurity, intrusion detection systems (IDS) detect and prevent attacks based on collected computer and network data. In recent research, IDS models have been constructed using machine learning (ML) and deep learning (DL) methods such as Random Forest (RF) and deep neural networks (DNN). Feature selection (FS) can be used to construct faster, more interpretable, and more accurate models. We look at three different FS techniques; RF information gain (RF-IG), correlation feature selection using the Bat Algorithm (CFS-BA), and CFS using the Aquila Optimizer (CFS-AO). Our results show CFS-BA to be the most efficient of the FS methods, building in 55% of the time of the best RF-IG model while achieving 99.99% of its accuracy. This reinforces prior contributions attesting to CFS-BA’s accuracy while building upon the relationship between subset size, CFS score, and RF-IG score in final results.
more »
« less
- Award ID(s):
- 2100729
- PAR ID:
- 10436879
- Date Published:
- Journal Name:
- 21st IEEE ICMLA (International Conference on Machine Learning and Applications)
- Page Range / eLocation ID:
- 1326-1333
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Distributed denial-of-service (DDoS) attack is a malicious cybersecurity attack that has become a global threat. Machine learning (ML) as an advanced technology has been proven to be an effective way against DDoS attacks. Feature selection is a crucial step in ML, and researchers have put endless efforts to mitigate the “Curse of Dimensionality”. Feature selection is also causing problems to ML models, such as a decrease in prediction accuracy. Four supervised classification techniques, namely, Decision Tree (DT), k-Nearest Neighbors (KNN), Logistic Regression (LR), and Random Forest (RF), are tested using mutual information score ranking to study the necessity of feature selection in DDoS detection.more » « less
-
Background and Objectives: Sepsis is a leading cause of mortality in intensive care units (ICUs). The development of a robust prognostic model utilizing patients’ clinical data could significantly enhance clinicians’ ability to make informed treatment decisions, potentially improving outcomes for septic patients. This study aims to create a novel machine-learning framework for constructing prognostic tools capable of predicting patient survival or mortality outcome. Methods: A novel dataset is created using concatenated triples of static data, temporal data, and clinical outcomes to expand data size. This structured input trains five machine learning classifiers (KNN, Logistic Regression, SVM, RF, and XGBoost) with advanced feature engineering. Models are evaluated on an independent cohort using AUROC and a new metric, 𝛾, which incorporates the F1 score, to assess discriminative power and generalizability. Results: We developed five prognostic models using the concatenated triple dataset with 10 dynamic features from patient medical records. Our analysis shows that the Extreme Gradient Boosting (XGBoost) model (AUROC = 0.777, F1 score = 0.694) and the Random Forest (RF) model (AUROC = 0.769, F1 score = 0.647), when paired with an ensemble under-sampling strategy, outperform other models. The RF model improves AUROC by 6.66% and reduces overfitting by 54.96%, while the XGBoost model shows a 0.52% increase in AUROC and a 77.72% reduction in overfitting. These results highlight our framework’s ability to enhance predictive accuracy and generalizability, particularly in sepsis prognosis. Conclusion: This study presents a novel modeling framework for predicting treatment outcomes in septic patients, designed for small, imbalanced, and high-dimensional datasets. By using temporal feature encoding, advanced sampling, and dimension reduction techniques, our approach enhances standard classifier performance. The resulting models show improved accuracy with limited data, offering valuable prognostic tools for sepsis management. This framework demonstrates the potential of machine learning in small medical datasets.more » « less
-
Traditional machine learning approaches for recognizing modes of transportation rely heavily on hand-crafted feature extraction methods which require domain knowledge. So, we propose a hybrid deep learning model: Deep Convolutional Bidirectional-LSTM (DCBL) which combines convolutional and bidirectional LSTM layers and is trained directly on raw sensor data to predict the transportation modes. We compare our model to the traditional machine learning approaches of training Support Vector Machines and Multilayer Perceptron models on extracted features. In our experiments, DCBL performs better than the feature selection methods in terms of accuracy and simplifies the data processing pipeline. The models are trained on the Sussex-Huawei Locomotion-Transportation (SHL) dataset. The submission of our team, Vahan, to SHL recognition challenge uses an ensemble of DCBL models trained on raw data using the different combination of sensors and window sizes and achieved an F1-score of 0.96 on our test data.more » « less
-
Limestone calcined clay cement (LC3) is a sustainable alternative to ordinary Portland cement, capable of reducing the binder’s carbon footprint by 40% while satisfying all key performance metrics. The inherent compositional heterogeneity in select components of LC3, combined with their convoluted chemical interactions, poses challenges to conventional analytical models when predicting mechanical properties. Although some studies have employed machine learning (ML) to predict the mechanical properties of LC3, many have overlooked the pivotal role of feature selection. Proper feature selection not only refines and simplifies the structure of ML models but also enhances these models’ prediction performance and interpretability. This research harnesses the power of the random forest (RF) model to predict the compressive strength of LC3. Three feature reduction methods—Pearson correlation, SHapley Additive exPlanations, and variable importance—are employed to analyze the influence of LC3 components and mixture design on compressive strength. Practical guidelines for utilizing these methods on cementitious materials are elucidated. Through the rigorous screening of insignificant variables from the database, the RF model conserves computational resources while also producing high-fidelity predictions. Additionally, a feature enhancement method is utilized, consolidating numerous input variables into a singular feature while feeding the RF model with richer information, resulting in a substantial improvement in prediction accuracy. Overall, this study provides a novel pathway to apply ML to LC3, emphasizing the need to tailor ML models to cement chemistry rather than employing them generically.more » « less
An official website of the United States government

