BackgroundMetastatic cancer remains one of the leading causes of cancer-related mortality worldwide. Yet, the prediction of survivability in this population remains limited by heterogeneous clinical presentations and high-dimensional molecular features. Advances in machine learning (ML) provide an opportunity to integrate diverse patient- and tumor-level factors into explainable predictive ML models. Leveraging large real-world datasets and modern ML techniques can enable improved risk stratification and precision oncology. ObjectiveThis study aimed to develop and interpret ML models for predicting overall survival in patients with metastatic cancer using the Memorial Sloan Kettering-Metastatic (MSK-MET) dataset and to identify key prognostic biomarkers through explainable artificial intelligence techniques. MethodsWe performed a retrospective analysis of the MSK-MET cohort, comprising 25,775 patients across 27 tumor types. After data cleaning and balancing, 20,338 patients were included. Overall survival was defined as deceased versus living at last follow-up. Five classifiers (extreme gradient boosting [XGBoost], logistic regression, random forest, decision tree, and naive Bayes) were trained using an 80/20 stratified split and optimized via grid search with 5-fold cross-validation. Model performance was assessed using accuracy, area under the curve (AUC), precision, recall, and F1-score. Model explainability was achieved using Shapley additive explanations (SHAP). Survival analyses included Kaplan-Meier estimates, Cox proportional hazards models, and an XGBoost-Cox model for time-to-event prediction. The positive predictive value and negative predictive value were calculated at the Youden index–optimal threshold. ResultsXGBoost achieved the highest performance (accuracy=0.74; AUC=0.82), outperforming other classifiers. In survival analyses, the XGBoost-Cox model with a concordance index (C-index) of 0.70 exceeded the traditional Cox model (C-index=0.66). SHAP analysis and Cox models consistently identified metastatic site count, tumor mutational burden, fraction of genome altered, and the presence of distant liver and bone metastases as among the strongest prognostic factors, a pattern that held at both the pan-cancer level and recurrently across cancer-specific models. At the cancer-specific level, performance varied; prostate cancer achieved the highest predictive accuracy (AUC=0.88), while pancreatic cancer was notably more challenging (AUC=0.68). Kaplan-Meier analyses demonstrated marked survival separation between patients with and without metastases (80-month survival: approximately 0.80 vs 0.30). At the Youden-optimal threshold, positive predictive value and negative predictive value were approximately 70% and 80%, respectively, supporting clinical use for risk stratification. ConclusionsExplainable ML models, particularly XGBoost combined with SHAP, can strongly predict survivability in metastatic cancers while highlighting clinically meaningful features. These findings support the use of ML-based tools for patient counseling, treatment planning, and integration into precision oncology workflows. Future work should include external validation on independent cohorts, integration with electronic health records via Fast Healthcare Interoperability Resources–based dashboards, and prospective clinician-in-the-loop evaluation to assess real-world use.
more »
« less
Detecting survival-associated biomarkers from heterogeneous populations
Abstract Detection of prognostic factors associated with patients’ survival outcome helps gain insights into a disease and guide treatment decisions. The rapid advancement of high-throughput technologies has yielded plentiful genomic biomarkers as candidate prognostic factors, but most are of limited use in clinical application. As the price of the technology drops over time, many genomic studies are conducted to explore a common scientific question in different cohorts to identify more reproducible and credible biomarkers. However, new challenges arise from heterogeneity in study populations and designs when jointly analyzing the multiple studies. For example, patients from different cohorts show different demographic characteristics and risk profiles. Existing high-dimensional variable selection methods for survival analysis, however, are restricted to single study analysis. We propose a novel Cox model based two-stage variable selection method called “Cox-TOTEM” to detect survival-associated biomarkers common in multiple genomic studies. Simulations showed our method greatly improved the sensitivity of variable selection as compared to the separate applications of existing methods to each study, especially when the signals are weak or when the studies are heterogeneous. An application of our method to TCGA transcriptomic data identified essential survival associated genes related to the common disease mechanism of five Pan-Gynecologic cancers.
more »
« less
- Award ID(s):
- 2014971
- PAR ID:
- 10232645
- Date Published:
- Journal Name:
- Scientific Reports
- Volume:
- 11
- Issue:
- 1
- ISSN:
- 2045-2322
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract BackgroundPrognostic indices for patients with brain metastases (BM) are needed to individualize treatment and stratify clinical trials. Two frequently used tools to estimate survival in patients with BM are the recursive partitioning analysis (RPA) and the diagnosis-specific graded prognostic assessment (DS-GPA). Given recent advances in therapies and improved survival for patients with BM, this study aims to validate and analyze these 2 models in a modern cohort. MethodsPatients diagnosed with BM were identified via our institution’s Tumor Board meetings. Data were retrospectively collected from the date of diagnosis with BM. The concordance of the RPA and GPA was calculated using Harrell’s C index. A Cox proportional hazards model with backwards elimination was used to generate a parsimonious model predictive of survival. ResultsOur study consisted of 206 patients diagnosed with BM between 2010 and 2019. The RPA had a prediction performance characterized by Harrell’s C index of 0.588. The DS-GPA demonstrated a Harrell’s C index of 0.630. A Cox proportional hazards model assessing the effect of age, presence of lung, or liver metastases, and Eastern Cooperative Oncology Group (ECOG) performance status score of 3/4 on survival yielded a Harrell’s C index of 0.616. Revising the analysis with an uncategorized ECOG demonstrated a C index of 0.648. ConclusionsWe found that the performance of the RPA remains unchanged from previous validation studies a decade earlier. The DS-GPA outperformed the RPA in predicting overall survival in our modern cohort. Analyzing variables shared by the RPA and DS-GPA produced a model that performed analogously to the DS-GPA.more » « less
-
Subgroup analysis has emerged as an important tool to identify unknown subgroup memberships in the presence of heterogeneity. However, much of the existing work focused on the low-dimensional scenario where only a few candidate variables are considered for modeling the subgroup membership. In this paper, we propose a two-component structured mixture model with a Bayesian variable selection approach for identifying predictive and prognostic variables separately in the high-dimensional setting. By employing spike and slab priors, we achieve the selection of predictive and prognostic variables and the estimation of the treatment effect in the selected subgroup simultaneously. We establish theoretical properties by showing strong variable selection consistency and posterior contraction behavior of our method, and demonstrate its performance using simulation studies. Finally, we apply the proposed method to data from the National Supported Work and the AIDS Clinical Trials Group 320 study, identifying predictive and prognostic variables associated with subgroups exhibiting differential treatment effects.more » « less
-
BACKGROUND: Lung transplantation is the gold standard for a carefully selected patient population with end-stage lung disease. We sought to create a unique risk stratification model using only preoperative recipient data to predict one-year postoperative mortality during our pre-transplant assessment. METHODS: Data of lung transplant recipients at Houston Methodist Hospital (HMH) from 1/2009 to 12/2014 were extracted from the United Network for Organ Sharing (UNOS) database. Patients were randomly divided into development and validation cohorts. Cox proportional-hazards models were conducted. Variables associated with 1-year mortality post-transplant were assigned weights based on the beta coefficients, and risk scores were derived. Patients were stratified into low-, medium- and high-risk categories. Our model was validated using the validation dataset and data from other US transplant centers in the UNOS database RESULTS: We randomized 633 lung recipients from HMH into the development (n=317 patients) and validation cohort (n=316). One-year survival after transplant was significantly different among risk groups: 95% (low-risk), 84% (medium-risk), and 72% (high-risk) (p<0.001) with a C-statistic of 0.74. Patient survival in the validation cohort was also significantly different among risk groups (85%, 77% and 65%, respectively, p<0.001). Validation of the model with the UNOS dataset included 9,920 patients and found 1-year survival to be 91%, 86% and 82%, respectively (p < 0.001). CONCLUSIONS: Using only recipient data collected at the time of pre-listing evaluation, our simple scoring system has good discrimination power and can be a practical tool in the assessment and selection of potential lung transplant recipients.more » « less
-
Abstract With advances in biomedical research, biomarkers are becoming increasingly important prognostic factors for predicting overall survival, while the measurement of biomarkers is often censored due to instruments' lower limits of detection. This leads to two types of censoring: random censoring in overall survival outcomes and fixed censoring in biomarker covariates, posing new challenges in statistical modeling and inference. Existing methods for analyzing such data focus primarily on linear regression ignoring censored responses or semiparametric accelerated failure time models with covariates under detection limits (DL). In this paper, we propose a quantile regression for survival data with covariates subject to DL. Comparing to existing methods, the proposed approach provides a more versatile tool for modeling the distribution of survival outcomes by allowing covariate effects to vary across conditional quantiles of the survival time and requiring no parametric distribution assumptions for outcome data. To estimate the quantile process of regression coefficients, we develop a novel multiple imputation approach based on another quantile regression for covariates under DL, avoiding stringent parametric restrictions on censored covariates as often assumed in the literature. Under regularity conditions, we show that the estimation procedure yields uniformly consistent and asymptotically normal estimators. Simulation results demonstrate the satisfactory finite‐sample performance of the method. We also apply our method to the motivating data from a study of genetic and inflammatory markers of Sepsis.more » « less
An official website of the United States government

