skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Confidence-Aware Severity Assessment of Lung Disease from Chest X-Rays Using Deep Neural Network on a Multi-Reader Dataset
Abstract In this study, we present a method based on Monte Carlo Dropout (MCD) as Bayesian neural network (BNN) approximation for confidence-aware severity classification of lung diseases in COVID-19 patients using chest X-rays (CXRs). Trained and tested on 1208 CXRs from Hospital 1 in the USA, the model categorizes severity into four levels (i.e., normal, mild, moderate, and severe) based on lung consolidation and opacity. Severity labels, determined by the median consensus of five radiologists, serve as the reference standard. The model’s performance is internally validated against evaluations from an additional radiologist and two residents that were excluded from the median. The performance of the model is further evaluated on additional internal and external datasets comprising 2200 CXRs from the same hospital and 1300 CXRs from Hospital 2 in South Korea. The model achieves an average area under the curve (AUC) of 0.94 ± 0.01 across all classes in the primary dataset, surpassing human readers in each severity class and achieves a higher Kendall correlation coefficient (KCC) of 0.80 ± 0.03. The performance of the model is consistent across varied datasets, highlighting its generalization. A key aspect of the model is its predictive uncertainty (PU), which is inversely related to the level of agreement among radiologists, particularly in mild and moderate cases. The study concludes that the model outperforms human readers in severity assessment and maintains consistent accuracy across diverse datasets. Its ability to provide confidence measures in predictions is pivotal for potential clinical use, underscoring the BNN’s role in enhancing diagnostic precision in lung disease analysis through CXR.  more » « less
Award ID(s):
2205152
PAR ID:
10535155
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Journal of Imaging Informatics in Medicine
Volume:
38
Issue:
2
ISSN:
2948-2933
Format(s):
Medium: X Size: p. 793-803
Size(s):
p. 793-803
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract BackgroundLung cancer is the deadliest and second most common cancer in the United States due to the lack of symptoms for early diagnosis. Pulmonary nodules are small abnormal regions that can be potentially correlated to the occurrence of lung cancer. Early detection of these nodules is critical because it can significantly improve the patient's survival rates. Thoracic thin‐sliced computed tomography (CT) scanning has emerged as a widely used method for diagnosing and prognosis lung abnormalities. PurposeThe standard clinical workflow of detecting pulmonary nodules relies on radiologists to analyze CT images to assess the risk factors of cancerous nodules. However, this approach can be error‐prone due to the various nodule formation causes, such as pollutants and infections. Deep learning (DL) algorithms have recently demonstrated remarkable success in medical image classification and segmentation. As an ever more important assistant to radiologists in nodule detection, it is imperative ensure the DL algorithm and radiologist to better understand the decisions from each other. This study aims to develop a framework integrating explainable AI methods to achieve accurate pulmonary nodule detection. MethodsA robust and explainable detection (RXD) framework is proposed, focusing on reducing false positives in pulmonary nodule detection. Its implementation is based on an explanation supervision method, which uses nodule contours of radiologists as supervision signals to force the model to learn nodule morphologies, enabling improved learning ability on small dataset, and enable small dataset learning ability. In addition, two imputation methods are applied to the nodule region annotations to reduce the noise within human annotations and allow the model to have robust attributions that meet human expectations. The 480, 265, and 265 CT image sets from the public Lung Image Database Consortium and Image Database Resource Initiative (LIDC‐IDRI) dataset are used for training, validation, and testing. ResultsUsing only 10, 30, 50, and 100 training samples sequentially, our method constantly improves the classification performance and explanation quality of baseline in terms of Area Under the Curve (AUC) and Intersection over Union (IoU). In particular, our framework with a learnable imputation kernel improves IoU from baseline by 24.0% to 80.0%. A pre‐defined Gaussian imputation kernel achieves an even greater improvement, from 38.4% to 118.8% from baseline. Compared to the baseline trained on 100 samples, our method shows less drop in AUC when trained on fewer samples. A comprehensive comparison of interpretability shows that our method aligns better with expert opinions. ConclusionsA pulmonary nodule detection framework was demonstrated using public thoracic CT image datasets. The framework integrates the robust explanation supervision (RES) technique to ensure the performance of nodule classification and morphology. The method can reduce the workload of radiologists and enable them to focus on the diagnosis and prognosis of the potential cancerous pulmonary nodules at the early stage to improve the outcomes for lung cancer patients. 
    more » « less
  2. Pollard, Tom J. (Ed.)
    Modern predictive models require large amounts of data for training and evaluation, absence of which may result in models that are specific to certain locations, populations in them and clinical practices. Yet, best practices for clinical risk prediction models have not yet considered such challenges to generalizability. Here we ask whether population- and group-level performance of mortality prediction models vary significantly when applied to hospitals or geographies different from the ones in which they are developed. Further, what characteristics of the datasets explain the performance variation? In this multi-center cross-sectional study, we analyzed electronic health records from 179 hospitals across the US with 70,126 hospitalizations from 2014 to 2015. Generalization gap, defined as difference between model performance metrics across hospitals, is computed for area under the receiver operating characteristic curve (AUC) and calibration slope. To assess model performance by the race variable, we report differences in false negative rates across groups. Data were also analyzed using a causal discovery algorithm “Fast Causal Inference” that infers paths of causal influence while identifying potential influences associated with unmeasured variables. When transferring models across hospitals, AUC at the test hospital ranged from 0.777 to 0.832 (1st-3rd quartile or IQR; median 0.801); calibration slope from 0.725 to 0.983 (IQR; median 0.853); and disparity in false negative rates from 0.046 to 0.168 (IQR; median 0.092). Distribution of all variable types (demography, vitals, and labs) differed significantly across hospitals and regions. The race variable also mediated differences in the relationship between clinical variables and mortality, by hospital/region. In conclusion, group-level performance should be assessed during generalizability checks to identify potential harms to the groups. Moreover, for developing methods to improve model performance in new environments, a better understanding and documentation of provenance of data and health processes are needed to identify and mitigate sources of variation. 
    more » « less
  3. Medical imaging data annotation is expensive and time-consuming. Supervised deep learning approaches may encounter overfitting if trained with limited medical data, and further affect the robustness of computer-aided diagnosis (CAD) on CT scans collected by various scanner vendors. Additionally, the high false-positive rate in automatic lung nodule detection methods prevents their applications in daily clinical routine diagnosis. To tackle these issues, we first introduce a novel self-learning schema to train a pre-trained model by learning rich feature representatives from large-scale unlabeled data without extra annotation, which guarantees a consistent detection performance over novel datasets. Then, a 3D feature pyramid network ( 3DFPN ) is proposed for high-sensitivity nodule detection by extracting multi-scale features, where the weights of the backbone network are initialized by the pre-trained model and then fine-tuned in a supervised manner. Further, a High Sensitivity and Specificity ( HS 2 ) network is proposed to reduce false positives by tracking the appearance changes among continuous CT slices on Location History Images (LHI) for the detected nodule candidates. The proposed method’s performance and robustness are evaluated on several publicly available datasets, including LUNA16, SPIE-AAPM, LungTIME, and HMS. Our proposed detector achieves the state-of-the-art result of 90.6 % sensitivity at 1 / 8 false positive per scan on the LUNA16 dataset. The proposed framework’s generalizability has been evaluated on three additional datasets (i.e., SPIE-AAPM, LungTIME, and HMS) captured by different types of CT scanners. 
    more » « less
  4. We introduce an active, semisupervised algorithm that utilizes Bayesian experimental design to address the shortage of annotated images required to train and validate Artificial Intelligence (AI) models for lung cancer screening with computed tomography (CT) scans. Our approach incorporates active learning with semisupervised expectation maximization to emulate the human in the loop for additional ground truth labels to train, evaluate, and update the neural network models. Bayesian experimental design is used to intelligently identify which unlabeled samples need ground truth labels to enhance the model’s performance. We evaluate the proposed Active Semi-supervised Expectation Maximization for Computer aided diagnosis (CAD) tasks (ASEM-CAD) using three public CT scans datasets: the National Lung Screening Trial (NLST), the Lung Image Database Consortium (LIDC), and Kaggle Data Science Bowl 2017 for lung cancer classification using CT scans. ASEM-CAD can accurately classify suspicious lung nodules and lung cancer cases with an area under the curve (AUC) of 0.94 (Kaggle), 0.95 (NLST), and 0.88 (LIDC) with significantly fewer labeled images compared to a fully supervised model. This study addresses one of the significant challenges in early lung cancer screenings using low-dose computed tomography (LDCT) scans and is a valuable contribution towards the development and validation of deep learning algorithms for lung cancer screening and other diagnostic radiology examinations. 
    more » « less
  5. null (Ed.)
    Computer-aided diagnosis (CAD) systems must constantly cope with the perpetual changes in data distribution caused by different sensing technologies, imaging protocols, and patient populations. Adapting these systems to new domains often requires significant amounts of labeled data for re-training. This process is labor-intensive and time-consuming. We propose a memory-augmented capsule network for the rapid adaptation of CAD models to new domains. It consists of a capsule network that is meant to extract feature embeddings from some high-dimensional input, and a memory-augmented task network meant to exploit its stored knowledge from the target domains. Our network is able to efficiently adapt to unseen domains using only a few annotated samples. We evaluate our method using a large-scale public lung nodule dataset (LUNA), coupled with our own collected lung nodules and incidental lung nodules datasets. When trained on the LUNA dataset, our network requires only 30 additional samples from our collected lung nodule and incidental lung nodule datasets to achieve clinically relevant performance (0.925 and 0.891 area under receiving operating characteristic curves (AUROC), respectively). This result is equivalent to using two orders of magnitude less labeled training data while achieving the same performance. We further evaluate our method by introducing heavy noise, artifacts, and adversarial attacks. Under these severe conditions, our network’s AUROC remains above 0.7 while the performance of state-of-the-art approaches reduce to chance level 
    more » « less