skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Evaluating GPT-4V (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs
This study examined the application of GPT-4 with vision (GPT-4V), a multimodal large language model with visual recognition, in detecting radiologic findings from a set of 100 chest radiographs and suggests that GPT-4V is currently not ready for real-world diagnostic usage in interpreting chest radiographs.  more » « less
Award ID(s):
2145640
PAR ID:
10580229
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Publisher / Repository:
RSNA
Date Published:
Journal Name:
Radiology
Volume:
311
Issue:
2
ISSN:
0033-8419
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Purpose: To determine if saliency maps in radiology artificial intelligence (AI) are vulnerable to subtle perturbations of the input, which could potentially lead to misleading interpretations, using Prediction-Saliency Correlation (PSC) for evaluating the sensitivity and robustness of saliency methods. Materials and Methods: In this retrospective study, locally trained deep learning models and a research prototype provided by a commercial vender were systematically evaluated on 191,229 chest radiographs from the CheXpert dataset(1,2) and 7,022 MRI images of human brain tumor classification dataset(3). Two radiologists performed a reader study on 270 chest radiographs pairs. A model-agnostic approach for computing the PSC coefficient was used to evaluate the sensitivity and robustness of seven commonly used saliency methods. Results: Leveraging locally trained model parameters, we revealed the saliency methods’ low sensitivity (maximum PSC = 0.25, 95% CI: 0.12, 0.38) and weak robustness (maximum PSC = 0.12, 95% CI: 0.0, 0.25) on the CheXpert dataset. Without model specifics, we also showed that the saliency maps from a commercial prototype could be irrelevant to the model output (area under the receiver operating characteristic curve dropped by 8.6% without affecting the saliency map). The human observer studies confirmed that is difficult for experts to identify the perturbed images, who had less than 44.8% correctness. Conclusion: Popular saliency methods scored low PSC values on the two datasets of perturbed chest radiographs, indicating weak sensitivity and robustness. The proposed PSC metric provides a valuable quantification tool for validating the trustworthiness of medical AI explainability. Abbreviations: AI = artificial intelligence, PSC = prediction-saliency correlation, AUC = area under the receiver operating characteristic curve, SSIM = structural similarity index measure. Summary: Systematic evaluation of saliency methods through subtle perturbations in chest radiographs and brain MRI images demonstrated low sensitivity and robustness of those methods, warranting caution when using saliency methods that may misrepresent changes in AI model prediction. 
    more » « less
  2. null (Ed.)
    Abstract Background This study outlines an image processing algorithm for accurate and consistent lung segmentation in chest radiographs of critically ill adults and children typically obscured by medical equipment. In particular, this work focuses on applications in analysis of acute respiratory distress syndrome – a critical illness with a mortality rate of 40% that affects 200,000 patients in the United States and 3 million globally each year. Methods Chest radiographs were obtained from critically ill adults (n = 100), adults diagnosed with acute respiratory distress syndrome (ARDS) (n = 25), and children (n = 100) hospitalized at Michigan Medicine. Physicians annotated the lung field of each radiograph to establish the ground truth. A Total Variation-based Active Contour (TVAC) lung segmentation algorithm was developed and compared to multiple state-of-the-art methods including a deep learning model (U-Net), a random walker algorithm, and an active spline model, using the Sørensen–Dice coefficient to measure segmentation accuracy. Results The TVAC algorithm accurately segmented lung fields in all patients in the study. For the adult cohort, an averaged Dice coefficient of 0.86 ±0.04 (min: 0.76) was reported for TVAC, 0.89 ±0.12 (min: 0.01) for U-Net, 0.74 ±0.19 (min: 0.15) for the random walker algorithm, and 0.64 ±0.17 (min: 0.20) for the active spline model. For the pediatric cohort, a Dice coefficient of 0.85 ±0.04 (min: 0.75) was reported for TVAC, 0.87 ±0.09 (min: 0.56) for U-Net, 0.67 ±0.18 (min: 0.18) for the random walker algorithm, and 0.61 ±0.18 (min: 0.18) for the active spline model. Conclusion The proposed algorithm demonstrates the most consistent performance of all segmentation methods tested. These results suggest that TVAC can accurately identify lung fields in chest radiographs in critically ill adults and children. 
    more » « less
  3. Abstract Decoder-only Transformer models such as Generative Pre-trained Transformers (GPT) have demonstrated exceptional performance in text generation by autoregressively predicting the next token. However, the efficiency of running GPT on current hardware systems is bounded by low compute-to-memory-ratio and high memory access. In this work, we propose a Process-in-memory (PIM) GPT accelerator, PIM-GPT, which achieves end-to-end acceleration of GPT inference with high performance and high energy efficiency. PIM-GPT leverages DRAM-based PIM designs for executing multiply-accumulate (MAC) operations directly in the DRAM chips, eliminating the need to move matrix data off-chip. Non-linear functions and data communication are supported by an application specific integrated chip (ASIC). At the software level, mapping schemes are designed to maximize data locality and computation parallelism. Overall, PIM-GPT achieves 41 − 137 × , 631 − 1074 × speedup and 123 − 383 × , 320 − 602 × energy efficiency over GPU and CPU baseline on 8 GPT models with up to 1.4 billion parameters. 
    more » « less
  4. This study explores the use of Large Language Models (LLMs), specifically GPT, for different safety management applications in the construction industry. Many studies have explored the integration of GPT in construction safety for various applications; their primary focus has been on the feasibility of such integration, often using GPT models for specific applications rather than a thorough evaluation of GPT’s limitations and capabilities. In contrast, this study aims to provide a comprehensive assessment of GPT’s performance based on established key criteria. Using structured use cases, this study explores GPT’s strength and weaknesses in four construction safety areas: (1) delivering personalized safety training and educational content tailored to individual learner needs; (2) automatically analyzing post-accident reports to identify root causes and suggest preventive measures; (3) generating customized safety guidelines and checklists to support site compliance; and (4) providing real-time assistance for managing daily safety tasks and decision-making on construction sites. LLMs and NLP have already been employed in each of these four areas for improvement, making them suitable areas for further investigation. GPT demonstrated acceptable performance in delivering evidence-based, regulation-aligned responses, making it valuable for scaling personalized training, automating accident analyses, and developing safety protocols. Additionally, it provided real-time safety support through interactive dialogues. However, the model showed limitations in deeper critical analysis, extrapolating information, and adapting to dynamic environments. The study concludes that while GPT holds significant promise for enhancing construction safety, further refinement is necessary. This includes fine-tuning for more relevant safety-specific outcomes, integrating real-time data for contextual awareness, and developing a nuanced understanding of safety risks. These improvements, coupled with human oversight, could make GPT a robust tool for safety management. 
    more » « less
  5. The success of GPT with coding tasks has made it important to consider the impact of GPT and similar models on teaching programming. Students’ use of GPT to solve programming problems can hinder their learning. However, they might also get significant benefits such as quality feedback on programming style, explanations of howa given piece of codeworks, helpwith debugging code, and the ability to see valuable alternatives to their code solutions. We propose a newdesign for interactingwith GPT calledMediated GPT with the goals of (a) providing students with access to GPT but allowing instructors to programmatically modify responses to prevent hindrances to student learning and combat common GPT response concerns, (b) helping students generate and learn to create effective prompts to GPT, and (c) tracking how students use GPT to get help on programming exercises. We demonstrate a first-pass implementation of this design called NotebookGPT. 
    more » « less