Abstract Background: Smoking has not been an established risk factor for prostate cancer (PCa), and has not been emphasized in PCa prevention. However, recent studies have shown increasing evidence that there is a higher risk of biochemical recurrence, PCa mortality, and metastasis among current smokers, presenting an urgent need in re-evaluating the association between smoking and aggressive PCa. This study aimed to determine whether smoking increase the likelihood of developing a more aggressive prostate cancer. Methods: Equal numbers of African Americans (AAs) and European Americans (EAs) by smoking status (never/former/current) matched with PCa aggressiveness, BMI, 5-year age group, and year of baseline recruitment, totaling 480 participants, were included in the metabolomics study. For metabolomics analysis, fold change and BH-adjusted p-value from t-test adjusted for age for univariate analysis, and PCA adjusted for age and PLS-DA supervised statistical analysis for multivariate analysis were employed to decipher the underlying metabolomic patterns, and identify significantly dysregulated metabolites for the variables of interest. Results: AA participants were significantly younger (mean=61.4, SD=7.7) compared with EAs (mean=63.5, SD=7.5). Current smokers had a 2.4 times higher risk of high aggressive PCa. When stratified by race, the risk diminished for EAs but increased for AAs. Global metabolic profiles detected a total of 1,487 compounds of known identity. After excluding metabolites with missing values in more than 20% of the samples and with small standard variation, we observed a distinct cluster of participants from AA aggressive PCa patients and current smokers that were separated from EAs and never smokers. With BH-adjusted p-value < 0.05 and fold change > 2, we identified 10 significantly dysregulated metabolites between AA and EA among high aggressive PCa and current smokers. Further, 36 metabolites between current and never smokers among AA high aggressive PCa were significantly dysregulated, but none of them are annotated as tobacco metabolites. Conclusion: Our study presented distinctive metabolomics profiles specific to AA current smokers who had high aggressive PCa. Furthermore, the distinctive patterns were not driven by the tobacco metabolites, with the potential to identify metabolites that might help to understand the relationships between smoking and aggressive PCa in AA. Citation Format: Se-Ran Jun, L. Joseph Su, Eryn Matich, Ping-Ching Hsu. Distinctive metabolomics profiles associated with African American current smokers who have high aggressive prostate cancer [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 3680.
more »
« less
Race adjustments in clinical algorithms can help correct for racial disparities in data quality
Despite ethical and historical arguments for removing race from clinical algorithms, the consequences of removal remain unclear. Here, we highlight a largely undiscussed consideration in this debate: varying data quality of input features across race groups. For example, family history of cancer is an essential predictor in cancer risk prediction algorithms but is less reliably documented for Black participants and may therefore be less predictive of cancer outcomes. Using data from the Southern Community Cohort Study, we assessed whether race adjustments could allow risk prediction models to capture varying data quality by race, focusing on colorectal cancer risk prediction. We analyzed 77,836 adults with no history of colorectal cancer at baseline. The predictive value of self-reported family history was greater for White participants than for Black participants. We compared two cancer risk prediction algorithms—a race-blind algorithm which included standard colorectal cancer risk factors but not race, and a race-adjusted algorithm which additionally included race. Relative to the race-blind algorithm, the race-adjusted algorithm improved predictive performance, as measured by goodness of fit in a likelihood ratio test (P-value: <0.001) and area under the receiving operating characteristic curve among Black participants (P-value: 0.006). Because the race-blind algorithm underpredicted risk for Black participants, the race-adjusted algorithm increased the fraction of Black participants among the predicted high-risk group, potentially increasing access to screening. More broadly, this study shows that race adjustments may be beneficial when the data quality of key predictors in clinical algorithms differs by race group.
more »
« less
- Award ID(s):
- 2516270
- PAR ID:
- 10594909
- Publisher / Repository:
- PNAS
- Date Published:
- Journal Name:
- Proceedings of the National Academy of Sciences
- Volume:
- 121
- Issue:
- 34
- ISSN:
- 0027-8424
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
BackgroundMetastatic cancer remains one of the leading causes of cancer-related mortality worldwide. Yet, the prediction of survivability in this population remains limited by heterogeneous clinical presentations and high-dimensional molecular features. Advances in machine learning (ML) provide an opportunity to integrate diverse patient- and tumor-level factors into explainable predictive ML models. Leveraging large real-world datasets and modern ML techniques can enable improved risk stratification and precision oncology. ObjectiveThis study aimed to develop and interpret ML models for predicting overall survival in patients with metastatic cancer using the Memorial Sloan Kettering-Metastatic (MSK-MET) dataset and to identify key prognostic biomarkers through explainable artificial intelligence techniques. MethodsWe performed a retrospective analysis of the MSK-MET cohort, comprising 25,775 patients across 27 tumor types. After data cleaning and balancing, 20,338 patients were included. Overall survival was defined as deceased versus living at last follow-up. Five classifiers (extreme gradient boosting [XGBoost], logistic regression, random forest, decision tree, and naive Bayes) were trained using an 80/20 stratified split and optimized via grid search with 5-fold cross-validation. Model performance was assessed using accuracy, area under the curve (AUC), precision, recall, and F1-score. Model explainability was achieved using Shapley additive explanations (SHAP). Survival analyses included Kaplan-Meier estimates, Cox proportional hazards models, and an XGBoost-Cox model for time-to-event prediction. The positive predictive value and negative predictive value were calculated at the Youden index–optimal threshold. ResultsXGBoost achieved the highest performance (accuracy=0.74; AUC=0.82), outperforming other classifiers. In survival analyses, the XGBoost-Cox model with a concordance index (C-index) of 0.70 exceeded the traditional Cox model (C-index=0.66). SHAP analysis and Cox models consistently identified metastatic site count, tumor mutational burden, fraction of genome altered, and the presence of distant liver and bone metastases as among the strongest prognostic factors, a pattern that held at both the pan-cancer level and recurrently across cancer-specific models. At the cancer-specific level, performance varied; prostate cancer achieved the highest predictive accuracy (AUC=0.88), while pancreatic cancer was notably more challenging (AUC=0.68). Kaplan-Meier analyses demonstrated marked survival separation between patients with and without metastases (80-month survival: approximately 0.80 vs 0.30). At the Youden-optimal threshold, positive predictive value and negative predictive value were approximately 70% and 80%, respectively, supporting clinical use for risk stratification. ConclusionsExplainable ML models, particularly XGBoost combined with SHAP, can strongly predict survivability in metastatic cancers while highlighting clinically meaningful features. These findings support the use of ML-based tools for patient counseling, treatment planning, and integration into precision oncology workflows. Future work should include external validation on independent cohorts, integration with electronic health records via Fast Healthcare Interoperability Resources–based dashboards, and prospective clinician-in-the-loop evaluation to assess real-world use.more » « less
-
null (Ed.)Abstract Accurate prediction of suicide risk among children and adolescents within an actionable time frame is an important but challenging task. Very few studies have comprehensively considered the clinical risk factors available to produce quantifiable risk scores for estimation of short- and long-term suicide risk for pediatric population. In this paper, we built machine learning models for predicting suicidal behavior among children and adolescents based on their longitudinal clinical records, and determining short- and long-term risk factors. This retrospective study used deidentified structured electronic health records (EHR) from the Connecticut Children’s Medical Center covering the period from 1 October 2011 to 30 September 2016. Clinical records of 41,721 young patients (10–18 years old) were included for analysis. Candidate predictors included demographics, diagnosis, laboratory tests, and medications. Different prediction windows ranging from 0 to 365 days were adopted. For each prediction window, candidate predictors were first screened by univariate statistical tests, and then a predictive model was built via a sequential forward feature selection procedure. We grouped the selected predictors and estimated their contributions to risk prediction at different prediction window lengths. The developed predictive models predicted suicidal behavior across all prediction windows with AUCs varying from 0.81 to 0.86. For all prediction windows, the models detected 53–62% of suicide-positive subjects with 90% specificity. The models performed better with shorter prediction windows and predictor importance varied across prediction windows, illustrating short- and long-term risks. Our findings demonstrated that routinely collected EHRs can be used to create accurate predictive models for suicide risk among children and adolescents.more » « less
-
Abstract BackgroundThe clinical utility of machine-learning (ML) algorithms for breast cancer risk prediction and screening practices is unknown. We compared classification of lifetime breast cancer risk based on ML and the BOADICEA model. We explored the differences in risk classification and their clinical impact on screening practices. MethodsWe used three different ML algorithms and the BOADICEA model to estimate lifetime breast cancer risk in a sample of 112,587 individuals from 2481 families from the Oncogenetic Unit, Geneva University Hospitals. Performance of algorithms was evaluated using the area under the receiver operating characteristic (AU-ROC) curve. Risk reclassification was compared for 36,146 breast cancer-free women of ages 20–80. The impact on recommendations for mammography surveillance was based on the Swiss Surveillance Protocol. ResultsThe predictive accuracy of ML-based algorithms (0.843 ≤ AU-ROC ≤ 0.889) was superior to BOADICEA (AU-ROC = 0.639) and reclassified 35.3% of women in different risk categories. The largest reclassification (20.8%) was observed in women characterised as ‘near population’ risk by BOADICEA. Reclassification had the largest impact on screening practices of women younger than 50. ConclusionML-based reclassification of lifetime breast cancer risk occurred in approximately one in three women. Reclassification is important for younger women because it impacts clinical decision- making for the initiation of screening.more » « less
-
There is a profound need to identify modifiable risk factors to screen and prevent pancreatic cancer. Air pollution, including fine particulate matter (PM2.5), is increasingly recognized as a risk factor for cancer. We conducted a case-control study using data from the electronic health record (EHR) of Duke University Health System, 15-year residential history, NASA satellite fine particulate matter (PM2.5), and neighborhood socioeconomic data. Using deterministic and probabilistic linkage algorithms, we linked residential history and EHR data to quantify long-term PM2.5 exposure. Logistic regression models quantified the association between a 1 interquartile range (IQR) increase in PM2.5 concentration and pancreatic cancer risk. The study included 203 cases and 5027 controls (median age of 59 years, 62% female, 26% Black). Individuals with pancreatic cancer had higher average annual exposure (9.4 μg/m3) as compared to an IQR increase in average annual PM2.5, which was associated with greater odds of pancreatic cancer (odds ratio = 1.20; 95% CI, 1.00-1.44). These findings highlight the link between elevated PM2.5 exposure and increased pancreatic cancer risk. They may inform screening strategies for high-risk populations and guide air pollution policies to mitigate exposure. This article is part of a Special Collection on Environmental Epidemiology.more » « less
An official website of the United States government

