skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Evaluation of GPT-4 ability to identify and generate patient instructions for actionable incidental radiology findings
Abstract ObjectivesTo evaluate the proficiency of a HIPAA-compliant version of GPT-4 in identifying actionable, incidental findings from unstructured radiology reports of Emergency Department patients. To assess appropriateness of artificial intelligence (AI)-generated, patient-facing summaries of these findings. Materials and MethodsRadiology reports extracted from the electronic health record of a large academic medical center were manually reviewed to identify non-emergent, incidental findings with high likelihood of requiring follow-up, further sub-stratified as “definitely actionable” (DA) or “possibly actionable—clinical correlation” (PA-CC). Instruction prompts to GPT-4 were developed and iteratively optimized using a validation set of 50 reports. The optimized prompt was then applied to a test set of 430 unseen reports. GPT-4 performance was primarily graded on accuracy identifying either DA or PA-CC findings, then secondarily for DA findings alone. Outputs were reviewed for hallucinations. AI-generated patient-facing summaries were assessed for appropriateness via Likert scale. ResultsFor the primary outcome (DA or PA-CC), GPT-4 achieved 99.3% recall, 73.6% precision, and 84.5% F-1. For the secondary outcome (DA only), GPT-4 demonstrated 95.2% recall, 77.3% precision, and 85.3% F-1. No findings were “hallucinated” outright. However, 2.8% of cases included generated text about recommendations that were inferred without specific reference. The majority of True Positive AI-generated summaries required no or minor revision. ConclusionGPT-4 demonstrates proficiency in detecting actionable, incidental findings after refined instruction prompting. AI-generated patient instructions were most often appropriate, but rarely included inferred recommendations. While this technology shows promise to augment diagnostics, active clinician oversight via “human-in-the-loop” workflows remains critical for clinical implementation.  more » « less
Award ID(s):
2129076 1928614
PAR ID:
10534839
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the American Medical Informatics Association
Volume:
31
Issue:
9
ISSN:
1067-5027
Format(s):
Medium: X Size: p. 1983-1993
Size(s):
p. 1983-1993
Sponsoring Org:
National Science Foundation
More Like this
  1. As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports are not only long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Cohere. We find that GPT-3.5 and Cohere fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude seems to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4. The generated summaries and evaluation code are available at https://github.com/ChicagoHAI/characterizing-multimodal-long-form-summarization. 
    more » « less
  2. Abstract The concentration of dopamine (DA) and tyrosine (Tyr) reflects the condition of patients with Parkinson's disease, whereas moderate paracetamol (PA) can help relieve their pain. Therefore, real‐time measurements of these bioanalytes have important clinical implications for patients with Parkinson's disease. However, previous sensors suffer from either limited sensitivity or complex fabrication and integration processes. This work introduces a simple and cost‐effective method to prepare high‐quality, flexible titanium dioxide (TiO2) thin films with highly reactive (001)‐facets. The as‐fabricated TiO2film supported by a carbon cloth electrode (i.e., TiO2–CC) allows excellent electrochemical specificity and sensitivity to DA (1.390 µA µM−1 cm−2), Tyr (0.126 µA µM−1 cm−2), and PA (0.0841 µA µM−1 cm−2). More importantly, accurate DA concentration in varied pH conditions can be obtained by decoupling them within a single differential pulse voltammetry measurement without additional sensing units. The TiO2–CC electrochemical sensor can be integrated into a smart diaper to detect the trace amount of DA or an integrated skin‐interfaced patch with microfluidic sampling and wireless transmission units for real‐time detection of the sweat Try and PA concentration. The wearable sensor based on TiO2–CC prepared by facile manufacturing methods holds great potential in the daily health monitoring and care of patients with neurological disorders. 
    more » « less
  3. Abstract MotivationThousands of genomes are publicly available, however, most genes in those genomes have poorly defined functions. This is partly due to a gap between previously published, experimentally characterized protein activities and activities deposited in databases. This activity deposition is bottlenecked by the time-consuming biocuration process. The emergence of large language models presents an opportunity to speed up the text-mining of protein activities for biocuration. ResultsWe developed FuncFetch—a workflow that integrates NCBI E-Utilities, OpenAI’s GPT-4, and Zotero—to screen thousands of manuscripts and extract enzyme activities. Extensive validation revealed high precision and recall of GPT-4 in determining whether the abstract of a given paper indicates the presence of a characterized enzyme activity in that paper. Provided the manuscript, FuncFetch extracted data such as species information, enzyme names, sequence identifiers, substrates, and products, which were subjected to extensive quality analyses. Comparison of this workflow against a manually curated dataset of BAHD acyltransferase activities demonstrated a precision/recall of 0.86/0.64 in extracting substrates. We further deployed FuncFetch on nine large plant enzyme families. Screening 26 543 papers, FuncFetch retrieved 32 605 entries from 5459 selected papers. We also identified multiple extraction errors including incorrect associations, nontarget enzymes, and hallucinations, which highlight the need for further manual curation. The BAHD activities were verified, resulting in a comprehensive functional fingerprint of this family and revealing that ∼70% of the experimentally characterized enzymes are uncurated in the public domain. FuncFetch represents an advance in biocuration and lays the groundwork for predicting the functions of uncharacterized enzymes. Availability and implementationCode and minimally curated activities are available at: https://github.com/moghelab/funcfetch and https://tools.moghelab.org/funczymedb. 
    more » « less
  4. ObjectivesTo identify what patient-related characteristics have been reported to be associated with the occurrence of shared decision-making (SDM) about treatment. DesignScoping review. Eligibility criteriaPeer-reviewed articles in English or Dutch reporting on associations between patient-related characteristics and the occurrence of SDM for actual treatment decisions. Information sourcesCOCHRANE Library, Embase, MEDLINE, PsycInfo, PubMed and Web of Science were systematically searched for articles published until 25 March 2019. ResultsThe search yielded 5289 hits of which 53 were retained. Multiple categories of patient characteristics were identified: (1) sociodemographic characteristics (eg, gender), (2) general health and clinical characteristics (eg, symptom severity), (3) psychological characteristics and coping with illness (eg, self-efficacy) and (4) SDM style or preference. Many characteristics showed no association or unclear relationships with SDM occurrence. For example, for female gender positive, negative and, most frequently, non-significant associations were seen. ConclusionsA large variety of patient-related characteristics have been studied, but for many the association with SDM occurrence remains unclear. The results will caution often-made assumptions about associations and provide an important step to target effective interventions to foster SDM with all patients. 
    more » « less
  5. Large language models, particularly GPT-3, are able to produce high quality summaries of general domain news articles in few- and zero-shot settings. However, it is unclear if such models are similarly capable in more specialized domains such as biomedicine. In this paper we enlist domain experts (individuals with medical training) to evaluate summaries of biomedical articles generated by GPT-3, given no supervision. We consider both single- and multi-document settings. In the former, GPT-3 is tasked with generating regular and plain-language summaries of articles describing randomized controlled trials; in the latter, we assess the degree to which GPT-3 is able to synthesize evidence reported across a collection of articles. We design an annotation scheme for evaluating model outputs, with an emphasis on assessing the factual accuracy of generated summaries. We find that while GPT-3 is able to summarize and simplify single biomedical articles faithfully, it struggles to provide accurate aggregations of findings over multiple documents. We release all data, code, and annotations used in this work. 
    more » « less