Abstract BackgroundLung cancer is the deadliest and second most common cancer in the United States due to the lack of symptoms for early diagnosis. Pulmonary nodules are small abnormal regions that can be potentially correlated to the occurrence of lung cancer. Early detection of these nodules is critical because it can significantly improve the patient's survival rates. Thoracic thin‐sliced computed tomography (CT) scanning has emerged as a widely used method for diagnosing and prognosis lung abnormalities. PurposeThe standard clinical workflow of detecting pulmonary nodules relies on radiologists to analyze CT images to assess the risk factors of cancerous nodules. However, this approach can be error‐prone due to the various nodule formation causes, such as pollutants and infections. Deep learning (DL) algorithms have recently demonstrated remarkable success in medical image classification and segmentation. As an ever more important assistant to radiologists in nodule detection, it is imperative ensure the DL algorithm and radiologist to better understand the decisions from each other. This study aims to develop a framework integrating explainable AI methods to achieve accurate pulmonary nodule detection. MethodsA robust and explainable detection (RXD) framework is proposed, focusing on reducing false positives in pulmonary nodule detection. Its implementation is based on an explanation supervision method, which uses nodule contours of radiologists as supervision signals to force the model to learn nodule morphologies, enabling improved learning ability on small dataset, and enable small dataset learning ability. In addition, two imputation methods are applied to the nodule region annotations to reduce the noise within human annotations and allow the model to have robust attributions that meet human expectations. The 480, 265, and 265 CT image sets from the public Lung Image Database Consortium and Image Database Resource Initiative (LIDC‐IDRI) dataset are used for training, validation, and testing. ResultsUsing only 10, 30, 50, and 100 training samples sequentially, our method constantly improves the classification performance and explanation quality of baseline in terms of Area Under the Curve (AUC) and Intersection over Union (IoU). In particular, our framework with a learnable imputation kernel improves IoU from baseline by 24.0% to 80.0%. A pre‐defined Gaussian imputation kernel achieves an even greater improvement, from 38.4% to 118.8% from baseline. Compared to the baseline trained on 100 samples, our method shows less drop in AUC when trained on fewer samples. A comprehensive comparison of interpretability shows that our method aligns better with expert opinions. ConclusionsA pulmonary nodule detection framework was demonstrated using public thoracic CT image datasets. The framework integrates the robust explanation supervision (RES) technique to ensure the performance of nodule classification and morphology. The method can reduce the workload of radiologists and enable them to focus on the diagnosis and prognosis of the potential cancerous pulmonary nodules at the early stage to improve the outcomes for lung cancer patients.
more »
« less
Confidence-Aware Severity Assessment of Lung Disease from Chest X-Rays Using Deep Neural Network on a Multi-Reader Dataset
Abstract In this study, we present a method based on Monte Carlo Dropout (MCD) as Bayesian neural network (BNN) approximation for confidence-aware severity classification of lung diseases in COVID-19 patients using chest X-rays (CXRs). Trained and tested on 1208 CXRs from Hospital 1 in the USA, the model categorizes severity into four levels (i.e., normal, mild, moderate, and severe) based on lung consolidation and opacity. Severity labels, determined by the median consensus of five radiologists, serve as the reference standard. The model’s performance is internally validated against evaluations from an additional radiologist and two residents that were excluded from the median. The performance of the model is further evaluated on additional internal and external datasets comprising 2200 CXRs from the same hospital and 1300 CXRs from Hospital 2 in South Korea. The model achieves an average area under the curve (AUC) of 0.94 ± 0.01 across all classes in the primary dataset, surpassing human readers in each severity class and achieves a higher Kendall correlation coefficient (KCC) of 0.80 ± 0.03. The performance of the model is consistent across varied datasets, highlighting its generalization. A key aspect of the model is its predictive uncertainty (PU), which is inversely related to the level of agreement among radiologists, particularly in mild and moderate cases. The study concludes that the model outperforms human readers in severity assessment and maintains consistent accuracy across diverse datasets. Its ability to provide confidence measures in predictions is pivotal for potential clinical use, underscoring the BNN’s role in enhancing diagnostic precision in lung disease analysis through CXR.
more »
« less
- Award ID(s):
- 2205152
- PAR ID:
- 10535155
- Publisher / Repository:
- Springer Science + Business Media
- Date Published:
- Journal Name:
- Journal of Imaging Informatics in Medicine
- Volume:
- 38
- Issue:
- 2
- ISSN:
- 2948-2933
- Format(s):
- Medium: X Size: p. 793-803
- Size(s):
- p. 793-803
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Brankov, Jovan G; Anastasio, Mark A (Ed.)Artificial intelligence (AI) tools are designed to improve the efficacy and efficiency of data analysis and interpretation by the human decision maker. However, we know little about the optimal ways to present AI output to providers. This study used radiology image interpretation with AI-based decision support to explore the impact of different forms of AI output on reader performance. Readers included 5 experienced radiologists and 3 radiology residents reporting on a series of COVID chest x-ray images. Four different forms (1 word summarizing diagnoses (normal, mild, moderate, severe), probability graph, heatmap, heatmap plus probability graph) of AI outputs (plus no AI feedback) were evaluated. Results reveal that most decisions regarding presence/absence of COVID without AI were correct and overall remained unchanged across all types of AI outputs. Fewer than 1% of decisions that were changed as a function of seeing the AI output were negative (true positive to false negative or true negative to false positive) regarding presence/absence of COVID; and about 1% were positive (false negative to true positive, false positive to true negative). More complex output formats (e.g., heat map plus a probability graph) tend to increase reading time and the number of scans between the clinical image and the AI outputs as revealed through eyetracking. The key to the success of AI tools in medical imaging will be to incorporate the human into the overall process to optimize and synergize the human-computer dyad, since at least for the foreseeable future, the human is and will be the ultimate decision maker. Our results demonstrate that the form of the AI output is important as it can impact clinical decision making and efficiency.more » « less
-
Pollard, Tom J. (Ed.)Modern predictive models require large amounts of data for training and evaluation, absence of which may result in models that are specific to certain locations, populations in them and clinical practices. Yet, best practices for clinical risk prediction models have not yet considered such challenges to generalizability. Here we ask whether population- and group-level performance of mortality prediction models vary significantly when applied to hospitals or geographies different from the ones in which they are developed. Further, what characteristics of the datasets explain the performance variation? In this multi-center cross-sectional study, we analyzed electronic health records from 179 hospitals across the US with 70,126 hospitalizations from 2014 to 2015. Generalization gap, defined as difference between model performance metrics across hospitals, is computed for area under the receiver operating characteristic curve (AUC) and calibration slope. To assess model performance by the race variable, we report differences in false negative rates across groups. Data were also analyzed using a causal discovery algorithm “Fast Causal Inference” that infers paths of causal influence while identifying potential influences associated with unmeasured variables. When transferring models across hospitals, AUC at the test hospital ranged from 0.777 to 0.832 (1st-3rd quartile or IQR; median 0.801); calibration slope from 0.725 to 0.983 (IQR; median 0.853); and disparity in false negative rates from 0.046 to 0.168 (IQR; median 0.092). Distribution of all variable types (demography, vitals, and labs) differed significantly across hospitals and regions. The race variable also mediated differences in the relationship between clinical variables and mortality, by hospital/region. In conclusion, group-level performance should be assessed during generalizability checks to identify potential harms to the groups. Moreover, for developing methods to improve model performance in new environments, a better understanding and documentation of provenance of data and health processes are needed to identify and mitigate sources of variation.more » « less
-
Medical imaging data annotation is expensive and time-consuming. Supervised deep learning approaches may encounter overfitting if trained with limited medical data, and further affect the robustness of computer-aided diagnosis (CAD) on CT scans collected by various scanner vendors. Additionally, the high false-positive rate in automatic lung nodule detection methods prevents their applications in daily clinical routine diagnosis. To tackle these issues, we first introduce a novel self-learning schema to train a pre-trained model by learning rich feature representatives from large-scale unlabeled data without extra annotation, which guarantees a consistent detection performance over novel datasets. Then, a 3D feature pyramid network ( 3DFPN ) is proposed for high-sensitivity nodule detection by extracting multi-scale features, where the weights of the backbone network are initialized by the pre-trained model and then fine-tuned in a supervised manner. Further, a High Sensitivity and Specificity ( HS 2 ) network is proposed to reduce false positives by tracking the appearance changes among continuous CT slices on Location History Images (LHI) for the detected nodule candidates. The proposed method’s performance and robustness are evaluated on several publicly available datasets, including LUNA16, SPIE-AAPM, LungTIME, and HMS. Our proposed detector achieves the state-of-the-art result of 90.6 % sensitivity at 1 / 8 false positive per scan on the LUNA16 dataset. The proposed framework’s generalizability has been evaluated on three additional datasets (i.e., SPIE-AAPM, LungTIME, and HMS) captured by different types of CT scanners.more » « less
-
We introduce an active, semisupervised algorithm that utilizes Bayesian experimental design to address the shortage of annotated images required to train and validate Artificial Intelligence (AI) models for lung cancer screening with computed tomography (CT) scans. Our approach incorporates active learning with semisupervised expectation maximization to emulate the human in the loop for additional ground truth labels to train, evaluate, and update the neural network models. Bayesian experimental design is used to intelligently identify which unlabeled samples need ground truth labels to enhance the model’s performance. We evaluate the proposed Active Semi-supervised Expectation Maximization for Computer aided diagnosis (CAD) tasks (ASEM-CAD) using three public CT scans datasets: the National Lung Screening Trial (NLST), the Lung Image Database Consortium (LIDC), and Kaggle Data Science Bowl 2017 for lung cancer classification using CT scans. ASEM-CAD can accurately classify suspicious lung nodules and lung cancer cases with an area under the curve (AUC) of 0.94 (Kaggle), 0.95 (NLST), and 0.88 (LIDC) with significantly fewer labeled images compared to a fully supervised model. This study addresses one of the significant challenges in early lung cancer screenings using low-dose computed tomography (LDCT) scans and is a valuable contribution towards the development and validation of deep learning algorithms for lung cancer screening and other diagnostic radiology examinations.more » « less
An official website of the United States government
