BackgroundMetastatic cancer remains one of the leading causes of cancer-related mortality worldwide. Yet, the prediction of survivability in this population remains limited by heterogeneous clinical presentations and high-dimensional molecular features. Advances in machine learning (ML) provide an opportunity to integrate diverse patient- and tumor-level factors into explainable predictive ML models. Leveraging large real-world datasets and modern ML techniques can enable improved risk stratification and precision oncology. ObjectiveThis study aimed to develop and interpret ML models for predicting overall survival in patients with metastatic cancer using the Memorial Sloan Kettering-Metastatic (MSK-MET) dataset and to identify key prognostic biomarkers through explainable artificial intelligence techniques. MethodsWe performed a retrospective analysis of the MSK-MET cohort, comprising 25,775 patients across 27 tumor types. After data cleaning and balancing, 20,338 patients were included. Overall survival was defined as deceased versus living at last follow-up. Five classifiers (extreme gradient boosting [XGBoost], logistic regression, random forest, decision tree, and naive Bayes) were trained using an 80/20 stratified split and optimized via grid search with 5-fold cross-validation. Model performance was assessed using accuracy, area under the curve (AUC), precision, recall, and F1-score. Model explainability was achieved using Shapley additive explanations (SHAP). Survival analyses included Kaplan-Meier estimates, Cox proportional hazards models, and an XGBoost-Cox model for time-to-event prediction. The positive predictive value and negative predictive value were calculated at the Youden index–optimal threshold. ResultsXGBoost achieved the highest performance (accuracy=0.74; AUC=0.82), outperforming other classifiers. In survival analyses, the XGBoost-Cox model with a concordance index (C-index) of 0.70 exceeded the traditional Cox model (C-index=0.66). SHAP analysis and Cox models consistently identified metastatic site count, tumor mutational burden, fraction of genome altered, and the presence of distant liver and bone metastases as among the strongest prognostic factors, a pattern that held at both the pan-cancer level and recurrently across cancer-specific models. At the cancer-specific level, performance varied; prostate cancer achieved the highest predictive accuracy (AUC=0.88), while pancreatic cancer was notably more challenging (AUC=0.68). Kaplan-Meier analyses demonstrated marked survival separation between patients with and without metastases (80-month survival: approximately 0.80 vs 0.30). At the Youden-optimal threshold, positive predictive value and negative predictive value were approximately 70% and 80%, respectively, supporting clinical use for risk stratification. ConclusionsExplainable ML models, particularly XGBoost combined with SHAP, can strongly predict survivability in metastatic cancers while highlighting clinically meaningful features. These findings support the use of ML-based tools for patient counseling, treatment planning, and integration into precision oncology workflows. Future work should include external validation on independent cohorts, integration with electronic health records via Fast Healthcare Interoperability Resources–based dashboards, and prospective clinician-in-the-loop evaluation to assess real-world use.
more »
« less
Social determinants of health and the prediction of missed breast imaging appointments
Abstract Background Predictive models utilizing social determinants of health (SDH), demographic data, and local weather data were trained to predict missed imaging appointments (MIA) among breast imaging patients at the Boston Medical Center (BMC). Patients were characterized by many different variables, including social needs, demographics, imaging utilization, appointment features, and weather conditions on the date of the appointment. Methods This HIPAA compliant retrospective cohort study was IRB approved. Informed consent was waived. After data preprocessing steps, the dataset contained 9,970 patients and 36,606 appointments from 1/1/2015 to 12/31/2019. We identified 57 potentially impactful variables used in the initial prediction model and assessed each patient for MIA. We then developed a parsimonious model via recursive feature elimination, which identified the 25 most predictive variables. We utilized linear and non-linear models including support vector machines (SVM), logistic regression (LR), and random forest (RF) to predict MIA and compared their performance. Results The highest-performing full model is the nonlinear RF, achieving the highest Area Under the ROC Curve (AUC) of 76% and average F1 score of 85%. Models limited to the most predictive variables were able to attain AUC and F1 scores comparable to models with all variables included. The variables most predictive of missed appointments included timing, prior appointment history, referral department of origin, and socioeconomic factors such as household income and access to caregiving services. Conclusions Prediction of MIA with the data available is inherently limited by the complex, multifactorial nature of MIA. However, the algorithms presented achieved acceptable performance and demonstrated that socioeconomic factors were useful predictors of MIA. In contrast with non-modifiable demographic factors, we can address SDH to decrease the incidence of MIA.
more »
« less
- PAR ID:
- 10424876
- Date Published:
- Journal Name:
- BMC Health Services Research
- Volume:
- 22
- Issue:
- 1
- ISSN:
- 1472-6963
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Objective To develop predictive models of coronavirus disease 2019 (COVID-19) outcomes, elucidate the influence of socioeconomic factors, and assess algorithmic racial fairness using a racially diverse patient population with high social needs. Materials and Methods Data included 7,102 patients with positive (RT-PCR) severe acute respiratory syndrome coronavirus 2 test at a safety-net system in Massachusetts. Linear and nonlinear classification methods were applied. A score based on a recurrent neural network and a transformer architecture was developed to capture the dynamic evolution of vital signs. Combined with patient characteristics, clinical variables, and hospital occupancy measures, this dynamic vital score was used to train predictive models. Results Hospitalizations can be predicted with an area under the receiver-operating characteristic curve (AUC) of 92% using symptoms, hospital occupancy, and patient characteristics, including social determinants of health. Parsimonious models to predict intensive care, mechanical ventilation, and mortality that used the most recent labs and vitals exhibited AUCs of 92.7%, 91.2%, and 94%, respectively. Early predictive models, using labs and vital signs closer to admission had AUCs of 81.1%, 84.9%, and 92%, respectively. Discussion The most accurate models exhibit racial bias, being more likely to falsely predict that Black patients will be hospitalized. Models that are only based on the dynamic vital score exhibited accuracies close to the best parsimonious models, although the latter also used laboratories. Conclusions This large study demonstrates that COVID-19 severity may accurately be predicted using a score that accounts for the dynamic evolution of vital signs. Further, race, social determinants of health, and hospital occupancy play an important role.more » « less
-
Objectives: Early identification of sepsis is critical, as delayed diagnosis significantly increases morbidity and mortality. We aimed to develop and validate AI/ML models for early sepsis prediction using structured electronic health record (EHR) data, waveform data, and a combination of both. Methods: We conducted a retrospective observational study using the AIM-AHEAD60 subset of the CHoRUS dataset. Adult patients (≥18 years) with a final diagnosis of sepsis were included. Structured EHR data (demographics, initial vital signs, laboratory results) and waveform data (continuous vital signs) from the first hour of hospital arrival were extracted. Three algorithms (i.e., XGBoost, LightGBM, and HistGB) were developed with a focus on maximizing the performance metric of recall. Other performance metrics were also assessed, including accuracy, precision, F1 score, and the area under the receiver operating characteristic curve (AUROC). Results: A total of 11,312 unique patients met the inclusion criteria, among whom 2245 individuals (19.85%) were diagnosed with sepsis at least once. Using structured EHR data alone, laboratory variables such as lactate and leukocyte count were most predictive. Waveform models identified respiratory rate, systolic blood pressure, and temperature trends in the first hour as key predictors. Combined models highlighted mean temperature and mean systolic blood pressure as top features. XGBoost achieved the highest AUROC (0.922) across all data configurations, with a recall above 80%, demonstrating robust performance despite substantial missing data. Conclusions: High-performing AI/ML models for early sepsis prediction can be developed from publicly available datasets using only first-hour clinical information. XGBoost models demonstrate strong potential for real-time clinical screening.more » « less
-
Abstract BackgroundNeglected tropical diseases affect the most vulnerable populations and cause chronic and debilitating disorders. Socioeconomic vulnerability is a well-known and important determinant of neglected tropical diseases. For example, poverty and sanitation could influence parasite transmission. Nevertheless, the quantitative impact of socioeconomic conditions on disease transmission risk remains poorly explored. MethodsThis study investigated the role of socioeconomic variables in the predictive capacity of risk models of neglected tropical zoonoses using a decade of epidemiological data (2007–2018) from Brazil. Vector-borne diseases investigated in this study included dengue, malaria, Chagas disease, leishmaniasis, and Brazilian spotted fever, while directly-transmitted zoonotic diseases included schistosomiasis, leptospirosis, and hantaviruses. Environmental and socioeconomic predictors were combined with infectious disease data to build environmental and socioenvironmental sets of ecological niche models and their performances were compared. ResultsSocioeconomic variables were found to be as important as environmental variables in influencing the estimated likelihood of disease transmission across large spatial scales. The combination of socioeconomic and environmental variables improved overall model accuracy (or predictive power) by 10% on average (P < 0.01), reaching a maximum of 18% in the case of dengue fever. Gross domestic product was the most important socioeconomic variable (37% relative variable importance, all individual models exhibitedP < 0.00), showing a decreasing relationship with disease indicating poverty as a major factor for disease transmission. Loss of natural vegetation cover between 2008 and 2018 was the most important environmental variable (42% relative variable importance,P < 0.05) among environmental models, exhibiting a decreasing relationship with disease probability, showing that these diseases are especially prevalent in areas where natural ecosystem destruction is on its initial stages and lower when ecosystem destruction is on more advanced stages. ConclusionsDestruction of natural ecosystems coupled with low income explain macro-scale neglected tropical and zoonotic disease probability in Brazil. Addition of socioeconomic variables improves transmission risk forecasts on tandem with environmental variables. Our results highlight that to efficiently address neglected tropical diseases, public health strategies must target both reduction of poverty and cessation of destruction of natural forests and savannas.more » « less
-
null (Ed.)Abstract The strain on healthcare resources brought forth by the recent COVID-19 pandemic has highlighted the need for efficient resource planning and allocation through the prediction of future consumption. Machine learning can predict resource utilization such as the need for hospitalization based on past medical data stored in electronic medical records (EMR). We conducted this study on 3194 patients (46% male with mean age 56.7 (±16.8), 56% African American, 7% Hispanic) flagged as COVID-19 positive cases in 12 centers under Emory Healthcare network from February 2020 to September 2020, to assess whether a COVID-19 positive patient’s need for hospitalization can be predicted at the time of RT-PCR test using the EMR data prior to the test. Five main modalities of EMR, i.e., demographics, medication, past medical procedures, comorbidities, and laboratory results, were used as features for predictive modeling, both individually and fused together using late, middle, and early fusion. Models were evaluated in terms of precision, recall, F1-score (within 95% confidence interval). The early fusion model is the most effective predictor with 84% overall F1-score [CI 82.1–86.1]. The predictive performance of the model drops by 6 % when using recent clinical data while omitting the long-term medical history. Feature importance analysis indicates that history of cardiovascular disease, emergency room visits in the past year prior to testing, and demographic factors are predictive of the disease trajectory. We conclude that fusion modeling using medical history and current treatment data can forecast the need for hospitalization for patients infected with COVID-19 at the time of the RT-PCR test.more » « less
An official website of the United States government

