skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Risk-Aware Machine Learning Classifier for Skin Lesion Diagnosis
Knowing when a machine learning system is not confident about its prediction is crucial in medical domains where safety is critical. Ideally, a machine learning algorithm should make a prediction only when it is highly certain about its competency, and refer the case to physicians otherwise. In this paper, we investigate how Bayesian deep learning can improve the performance of the machine–physician team in the skin lesion classification task. We used the publicly available HAM10000 dataset, which includes samples from seven common skin lesion categories: Melanoma (MEL), Melanocytic Nevi (NV), Basal Cell Carcinoma (BCC), Actinic Keratoses and Intraepithelial Carcinoma (AKIEC), Benign Keratosis (BKL), Dermatofibroma (DF), and Vascular (VASC) lesions. Our experimental results show that Bayesian deep networks can boost the diagnostic performance of the standard DenseNet-169 model from 81.35% to 83.59% without incurring additional parameters or heavy computation. More importantly, a hybrid physician–machine workflow reaches a classification accuracy of 90 % while only referring 35 % of the cases to physicians. The findings are expected to generalize to other medical diagnosis applications. We believe that the availability of risk-aware machine learning methods will enable a wider adoption of machine learning technology in clinical settings.  more » « less
Award ID(s):
1910973
PAR ID:
10282839
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Journal of Clinical Medicine
Volume:
8
Issue:
8
ISSN:
2077-0383
Page Range / eLocation ID:
1241
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Inadequate at-home management and self-awareness of heart failure (HF) exacerbations are known to be leading causes of the greater than 1 million estimated HF-related hospitalizations in the USA alone. Most current at-home HF management protocols include paper guidelines or exploratory health applications that lack rigor and validation at the level of the individual patient. We report on a novel triage methodology that uses machine learning predictions for real-time detection and assessment of exacerbations. Medical specialist opinions on statistically and clinically comprehensive, simulated patient cases were used to train and validate prediction algorithms. Model performance was assessed by comparison to physician panel consensus in a representative, out-of-sample validation set of 100 vignettes. Algorithm prediction accuracy and safety indicators surpassed all individual specialists in identifying consensus opinion on existence/severity of exacerbations and appropriate treatment response. The algorithms also scored the highest sensitivity, specificity, and PPV when assessing the need for emergency care. Lay summary Here we develop a machine-learning approach for providing real-time decision support to adults diagnosed with congestive heart failure. The algorithm achieves higher exacerbation and triage classification performance than any individual physician when compared to physician consensus opinion. Graphical abstract 
    more » « less
  2. Background Online physician reviews are an important source of information for prospective patients. In addition, they represent an untapped resource for studying the effects of gender on the doctor-patient relationship. Understanding gender differences in online reviews is important because it may impact the value of those reviews to patients. Documenting gender differences in patient experience may also help to improve the doctor-patient relationship. This is the first large-scale study of physician reviews to extensively investigate gender bias in online reviews or offer recommendations for improvements to online review systems to correct for gender bias and aid patients in selecting a physician. Objective This study examines 154,305 reviews from across the United States for all medical specialties. Our analysis includes a qualitative and quantitative examination of review content and physician rating with regard to doctor and reviewer gender. Methods A total of 154,305 reviews were sampled from Google Place reviews. Reviewer and doctor gender were inferred from names. Reviews were coded for overall patient experience (negative or positive) by collapsing a 5-star scale and coded for general categories (process, positive/negative soft skills), which were further subdivided into themes. Computational text processing methods were employed to apply this codebook to the entire data set, rendering it tractable to quantitative methods. Specifically, we estimated binary regression models to examine relationships between physician rating, patient experience themes, physician gender, and reviewer gender). Results Female reviewers wrote 60% more reviews than men. Male reviewers were more likely to give negative reviews (odds ratio [OR] 1.15, 95% CI 1.10-1.19; P<.001). Reviews of female physicians were considerably more negative than those of male physicians (OR 1.99, 95% CI 1.94-2.14; P<.001). Soft skills were more likely to be mentioned in the reviews written by female reviewers and about female physicians. Negative reviews of female doctors were more likely to mention candor (OR 1.61, 95% CI 1.42-1.82; P<.001) and amicability (OR 1.63, 95% CI 1.47-1.90; P<.001). Disrespect was associated with both female physicians (OR 1.42, 95% CI 1.35-1.51; P<.001) and female reviewers (OR 1.27, 95% CI 1.19-1.35; P<.001). Female patients were less likely to report disrespect from female doctors than expected from the base ORs (OR 1.19, 95% CI 1.04-1.32; P=.008), but this effect overrode only the effect for female reviewers. Conclusions This work reinforces findings in the extensive literature on gender differences and gender bias in patient-physician interaction. Its novel contribution lies in highlighting gender differences in online reviews. These reviews inform patients’ choice of doctor and thus affect both patients and physicians. The evidence of gender bias documented here suggests review sites may be improved by providing information about gender differences, controlling for gender when presenting composite ratings for physicians, and helping users write less biased reviews. 
    more » « less
  3. Sleep staging has a very important role in diagnosing patients with sleep disorders. In general, this task is very time-consuming for physicians to perform. Deep learning shows great potential to automate this process and remove physician bias from decision making. In this study, we aim to identify recent trends on performance improvement and the causes for these trends. Recent papers on sleep stage classification and interpretability are investigated to explore different modeling and data manipulation techniques, their efficiency, and recent advances. We identify an improvement in performance up to 12% on standard datasets over the last 5 years. The improvements in performance do not appear to be necessarily correlated to the size of the models, but instead seem to be caused by incorporating new architectural components, such as the use of transformers and contrastive learning. 
    more » « less
  4. Explainability is essential for AI models, especially in clinical settings where understanding the model’s decisions is crucial. Despite their impressive performance, black-box AI models are unsuitable for clinical use if their operations cannot be explained to clinicians. While deep neural networks (DNNs) represent the forefront of model performance, their explanations are often not easily interpreted by humans. On the other hand, hand-crafted features extracted to represent different aspects of the input data and traditional machine learning models are generally more understandable. However, they often lack the effectiveness of advanced models due to human limitations in feature design. To address this, we propose ExShall-CNN, a novel explainable shallow convolutional neural network for medical image processing. This model improves upon hand-crafted features to maintain human interpretability, ensuring that its decisions are transparent and understandable. We introduce the explainable shallow convolutional neural network (ExShall-CNN), which combines the interpretability of hand-crafted features with the performance of advanced deep convolutional networks like U-Net for medical image segmentation. Built on recent advancements in machine learning, ExShall-CNN incorporates widely used kernels while ensuring transparency, making its decisions visually interpretable by physicians and clinicians. This balanced approach offers both the accuracy of deep learning models and the explainability needed for clinical applications. 
    more » « less
  5. While deep networks have achieved broad success in analyzing natural images, when applied to medical scans, they often fail in unexpected situations. This study investigates model sensitivity to domain shifts, such as data sampled from different hospitals or confounded by demographic variables like sex and race, focusing on chest X-rays and skin lesion images. The key finding is that existing visual backbones lack an appropriate prior for reliable generalization in these settings. Inspired by medical training, the authors propose incorporating explicit medical knowledge communicated in natural language into deep networks. They introduce Knowledge-enhanced Bottlenecks (KnoBo), a class of concept bottleneck models that integrate knowledge priors, enabling reasoning with clinically relevant factors found in medical textbooks or PubMed. KnoBo utilizes retrieval-augmented language models to design an appropriate concept space, paired with an automatic training procedure for recognizing these concepts. Evaluations across 20 datasets demonstrate that KnoBo outperforms fine-tuned models on confounded datasets by 32.4% on average. Additionally, PubMed is identified as a promising resource for enhancing model robustness to domain shifts, outperforming other resources in both information diversity and prediction performance. 
    more » « less