skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Detecting Spurious Correlations With Sanity Tests for Artificial Intelligence Guided Radiology Systems
Artificial intelligence (AI) has been successful at solving numerous problems in machine perception. In radiology, AI systems are rapidly evolving and show progress in guiding treatment decisions, diagnosing, localizing disease on medical images, and improving radiologists' efficiency. A critical component to deploying AI in radiology is to gain confidence in a developed system's efficacy and safety. The current gold standard approach is to conduct an analytical validation of performance on a generalization dataset from one or more institutions, followed by a clinical validation study of the system's efficacy during deployment. Clinical validation studies are time-consuming, and best practices dictate limited re-use of analytical validation data, so it is ideal to know ahead of time if a system is likely to fail analytical or clinical validation. In this paper, we describe a series of sanity tests to identify when a system performs well on development data for the wrong reasons. We illustrate the sanity tests' value by designing a deep learning system to classify pancreatic cancer seen in computed tomography scans.  more » « less
Award ID(s):
1909696
PAR ID:
10350727
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Frontiers in Digital Health
Volume:
3
ISSN:
2673-253X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Brankov, Jovan G; Anastasio, Mark A (Ed.)
    Artificial intelligence (AI) tools are designed to improve the efficacy and efficiency of data analysis and interpretation by the human decision maker. However, we know little about the optimal ways to present AI output to providers. This study used radiology image interpretation with AI-based decision support to explore the impact of different forms of AI output on reader performance. Readers included 5 experienced radiologists and 3 radiology residents reporting on a series of COVID chest x-ray images. Four different forms (1 word summarizing diagnoses (normal, mild, moderate, severe), probability graph, heatmap, heatmap plus probability graph) of AI outputs (plus no AI feedback) were evaluated. Results reveal that most decisions regarding presence/absence of COVID without AI were correct and overall remained unchanged across all types of AI outputs. Fewer than 1% of decisions that were changed as a function of seeing the AI output were negative (true positive to false negative or true negative to false positive) regarding presence/absence of COVID; and about 1% were positive (false negative to true positive, false positive to true negative). More complex output formats (e.g., heat map plus a probability graph) tend to increase reading time and the number of scans between the clinical image and the AI outputs as revealed through eyetracking. The key to the success of AI tools in medical imaging will be to incorporate the human into the overall process to optimize and synergize the human-computer dyad, since at least for the foreseeable future, the human is and will be the ultimate decision maker. Our results demonstrate that the form of the AI output is important as it can impact clinical decision making and efficiency. 
    more » « less
  2. Purpose: Despite tremendous gains from deep learning and the promise of AI in medicine to improve diagnosis and save costs, there exists a large translational gap to implement and use AI products in real-world clinical situations. Adoption of standards like the TRIPOD, CONSORT, and CLAIM checklists is increasing to improve the peer review process and reporting of AI tools. However, no such standards exist for product level review. Methods: A review of the clinical trials shows a paucity of evidence for radiology AI products; thus, we developed a 10-question assessment tool for reviewing AI products with an emphasis on their validation and result dissemination. We applied the assessment tool to commercial and open-source algorithms used for diagnosis to extract evidence on the clinical utility of the tools. Results: We find that there is limited technical information on methodologies for FDA approved algorithms compared to open source products, likely due to concerns of intellectual property. Furthermore, we find that FDA approved products use much smaller datasets compared to open-source AI tools, as the terms of use of public datasets are limited to academic and non-commercial entities which preclude their use in commercial products. Conclusion: Overall, we observe a broad spectrum of maturity and clinical use of AI products, but a large gap exists in exploring the actual performance of AI tools in clinical practice. 
    more » « less
  3. Abstract ObjectivesTo evaluate the proficiency of a HIPAA-compliant version of GPT-4 in identifying actionable, incidental findings from unstructured radiology reports of Emergency Department patients. To assess appropriateness of artificial intelligence (AI)-generated, patient-facing summaries of these findings. Materials and MethodsRadiology reports extracted from the electronic health record of a large academic medical center were manually reviewed to identify non-emergent, incidental findings with high likelihood of requiring follow-up, further sub-stratified as “definitely actionable” (DA) or “possibly actionable—clinical correlation” (PA-CC). Instruction prompts to GPT-4 were developed and iteratively optimized using a validation set of 50 reports. The optimized prompt was then applied to a test set of 430 unseen reports. GPT-4 performance was primarily graded on accuracy identifying either DA or PA-CC findings, then secondarily for DA findings alone. Outputs were reviewed for hallucinations. AI-generated patient-facing summaries were assessed for appropriateness via Likert scale. ResultsFor the primary outcome (DA or PA-CC), GPT-4 achieved 99.3% recall, 73.6% precision, and 84.5% F-1. For the secondary outcome (DA only), GPT-4 demonstrated 95.2% recall, 77.3% precision, and 85.3% F-1. No findings were “hallucinated” outright. However, 2.8% of cases included generated text about recommendations that were inferred without specific reference. The majority of True Positive AI-generated summaries required no or minor revision. ConclusionGPT-4 demonstrates proficiency in detecting actionable, incidental findings after refined instruction prompting. AI-generated patient instructions were most often appropriate, but rarely included inferred recommendations. While this technology shows promise to augment diagnostics, active clinician oversight via “human-in-the-loop” workflows remains critical for clinical implementation. 
    more » « less
  4. Verification and validation of AI systems, particularly learning-enabled systems, is hard because often they lack formal specifications and rely instead on incomplete data and human subjective feedback. Aligning the behavior of such systems with the intended objectives and values of human designers and stakeholders is very challenging, and deploying AI systems that are misaligned can be risky. We propose to use both existing and new forms of explanations to improve the verification and validation of AI systems. Toward that goal, we preseant a framework, the agent explains its behavior and a critic signals whether the explanation passes a test. In cases where the explanation fails, the agent offers alternative explanations to gather feedback, which is then used to improve the system's alignment. We discuss examples of this approach that proved to be effective, and how to extend the scope of explanations and minimize human effort involved in this process. 
    more » « less
  5. Abstract Despite major advances in HIV testing, ultrasensitive detection of early infection remains challenging, especially for the viral capsid protein p24, which is an early virological biomarker of HIV-1 infection. Here, To improve p24 detection in patients missed by immunological tests that dominate the diagnostics market, we show a click chemistry amplified nanopore (CAN) assay for ultrasensitive quantitative detection. This strategy achieves a 20.8 fM (0.5 pg/ml) limit of detection for HIV-1 p24 antigen in human serum, demonstrating 20~100-fold higher analytical sensitivity than nanocluster-based immunoassays and clinically used enzyme-linked immunosorbent assay, respectively. Clinical validation of the CAN assay in a pilot cohort shows p24 quantification at ultra-low concentration range and correlation with CD4 count and viral load. We believe that this strategy can improve the utility of p24 antigen in detecting early infection and monitoring HIV progression and treatment efficacy, and also can be readily modified to detect other infectious diseases. 
    more » « less