Generalizability of a Machine Learning Model for Improving Utilization of Parathyroid Hormone-Related Peptide Testing across Multiple Clinical Centers

Yang, He S; Pan, Weishen; Wang, Yingheng; Zaydman, Mark A; Spies, Nicholas C; Zhao, Zhen; Guise, Theresa A; Meng, Qing H; Wang, Fei

doi:10.1093/clinchem/hvad141

Citation Details

Generalizability of a Machine Learning Model for Improving Utilization of Parathyroid Hormone-Related Peptide Testing across Multiple Clinical Centers

Abstract BackgroundMeasuring parathyroid hormone-related peptide (PTHrP) helps diagnose the humoral hypercalcemia of malignancy, but is often ordered for patients with low pretest probability, resulting in poor test utilization. Manual review of results to identify inappropriate PTHrP orders is a cumbersome process. MethodsUsing a dataset of 1330 patients from a single institute, we developed a machine learning (ML) model to predict abnormal PTHrP results. We then evaluated the performance of the model on two external datasets. Different strategies (model transporting, retraining, rebuilding, and fine-tuning) were investigated to improve model generalizability. Maximum mean discrepancy (MMD) was adopted to quantify the shift of data distributions across different datasets. ResultsThe model achieved an area under the receiver operating characteristic curve (AUROC) of 0.936, and a specificity of 0.842 at 0.900 sensitivity in the development cohort. Directly transporting this model to two external datasets resulted in a deterioration of AUROC to 0.838 and 0.737, with the latter having a larger MMD corresponding to a greater data shift compared to the original dataset. Model rebuilding using site-specific data improved AUROC to 0.891 and 0.837 on the two sites, respectively. When external data is insufficient for retraining, a fine-tuning strategy also improved model utility. ConclusionsML offers promise to improve PTHrP test utilization while relieving the burden of manual review. Transporting a ready-made model to external datasets may lead to performance deterioration due to data distribution shift. Model retraining or rebuilding could improve generalizability when there are enough data, and model fine-tuning may be favorable when site-specific data is limited. more »