Revisiting the Trustworthiness of Saliency Methods in Radiology AI

Zhang, Jiajin; Chao, Hanqing; Dasegowda, Giridhar; Wang, Ge; Kalra, Mannudeep K.; Yan, Pingkun

doi:10.1148/ryai.220221

Purpose: To determine if saliency maps in radiology artificial intelligence (AI) are vulnerable to subtle perturbations of the input, which could potentially lead to misleading interpretations, using Prediction-Saliency Correlation (PSC) for evaluating the sensitivity and robustness of saliency methods. Materials and Methods: In this retrospective study, locally trained deep learning models and a research prototype provided by a commercial vender were systematically evaluated on 191,229 chest radiographs from the CheXpert dataset(1,2) and 7,022 MRI images of human brain tumor classification dataset(3). Two radiologists performed a reader study on 270 chest radiographs pairs. A model-agnostic approach for computing the PSC coefficient was used to evaluate the sensitivity and robustness of seven commonly used saliency methods. Results: Leveraging locally trained model parameters, we revealed the saliency methods’ low sensitivity (maximum PSC = 0.25, 95% CI: 0.12, 0.38) and weak robustness (maximum PSC = 0.12, 95% CI: 0.0, 0.25) on the CheXpert dataset. Without model specifics, we also showed that the saliency maps from a commercial prototype could be irrelevant to the model output (area under the receiver operating characteristic curve dropped by 8.6% without affecting the saliency map). The human observer studies confirmed that is difficult for experts to identify the perturbed images, who had less than 44.8% correctness. Conclusion: Popular saliency methods scored low PSC values on the two datasets of perturbed chest radiographs, indicating weak sensitivity and robustness. The proposed PSC metric provides a valuable quantification tool for validating the trustworthiness of medical AI explainability. Abbreviations: AI = artificial intelligence, PSC = prediction-saliency correlation, AUC = area under the receiver operating characteristic curve, SSIM = structural similarity index measure. Summary: Systematic evaluation of saliency methods through subtle perturbations in chest radiographs and brain MRI images demonstrated low sensitivity and robustness of those methods, warranting caution when using saliency methods that may misrepresent changes in AI model prediction.

More Like this