skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on September 1, 2026

Title: Critical Assessment of Large Language Models’ (ChatGPT) Performance in Data Extraction for Systematic Reviews: Exploratory Study
Abstract BackgroundSystematic literature reviews (SLRs) are foundational for synthesizing evidence across diverse fields and are especially important in guiding research and practice in health and biomedical sciences. However, they are labor intensive due to manual data extraction from multiple studies. As large language models (LLMs) gain attention for their potential to automate research tasks and extract basic information, understanding their ability to accurately extract explicit data from academic papers is critical for advancing SLRs. ObjectiveOur study aimed to explore the capability of LLMs to extract both explicitly outlined study characteristics and deeper, more contextual information requiring nuanced evaluations, using ChatGPT (GPT-4). MethodsWe screened the full text of a sample of COVID-19 modeling studies and analyzed three basic measures of study settings (ie, analysis location, modeling approach, and analyzed interventions) and three complex measures of behavioral components in models (ie, mobility, risk perception, and compliance). To extract data on these measures, two researchers independently extracted 60 data elements using manual coding and compared them with the responses from ChatGPT to 420 queries spanning 7 iterations. ResultsChatGPT’s accuracy improved as prompts were refined, showing improvements of 33% and 23% between the initial and final iterations for extracting study settings and behavioral components, respectively. In the initial prompts, 26 (43.3%) of 60 ChatGPT responses were correct. However, in the final iteration, ChatGPT extracted 43 (71.7%) of the 60 data elements, showing better performance in extracting explicitly stated study settings (28/30, 93.3%) than in extracting subjective behavioral components (15/30, 50%). Nonetheless, the varying accuracy across measures highlighted its limitations. ConclusionsOur findings underscore LLMs’ utility in extracting basic as well as explicit data in SLRs by using effective prompts. However, the results reveal significant limitations in handling nuanced, subjective criteria, emphasizing the necessity for human oversight.  more » « less
Award ID(s):
2229819
PAR ID:
10635616
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
JMIR
Date Published:
Journal Name:
JMIR AI
Volume:
4
ISSN:
2817-1705
Page Range / eLocation ID:
e68097 to e68097
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. ObjectiveThis study explores subjective and objective driving style similarity to identify how similarity can be used to develop driver-compatible vehicle automation. BackgroundSimilarity in the ways that interaction partners perform tasks can be measured subjectively, through questionnaires, or objectively by characterizing each agent’s actions. Although subjective measures have advantages in prediction, objective measures are more useful when operationalizing interventions based on these measures. Showing how objective and subjective similarity are related is therefore prudent for aligning future machine performance with human preferences. MethodsA driving simulator study was conducted with stop-and-go scenarios. Participants experienced conservative, moderate, and aggressive automated driving styles and rated the similarity between their own driving style and that of the automation. Objective similarity between the manual and automated driving speed profiles was calculated using three distance measures: dynamic time warping, Euclidean distance, and time alignment measure. Linear mixed effects models were used to examine how different components of the stopping profile and the three objective similarity measures predicted subjective similarity. ResultsObjective similarity using Euclidean distance best predicted subjective similarity. However, this was only observed for participants’ approach to the intersection and not their departure. ConclusionDeveloping driving styles that drivers perceive to be similar to their own is an important step toward driver-compatible automation. In determining what constitutes similarity, it is important to (a) use measures that reflect the driver’s perception of similarity, and (b) understand what elements of the driving style govern subjective similarity. 
    more » « less
  2. The advanced capabilities of Large Language Models (LLMs) have made them invaluable across various applications, from conversational agents and content creation to data analysis, research, and innovation. However, their effectiveness and accessibility also render them susceptible to abuse for generating malicious content, including phishing attacks. This study explores the potential of using four popular commercially available LLMs, i.e., ChatGPT (GPT 3.5 Turbo), GPT 4, Claude, and Bard, to generate functional phishing attacks using a series of malicious prompts. We discover that these LLMs can generate both phishing websites and emails that can convincingly imitate well-known brands and also deploy a range of evasive tactics that are used to elude detection mechanisms employed by anti-phishing systems. These attacks can be generated using unmodified or "vanilla" versions of these LLMs without requiring any prior adversarial exploits such as jailbreaking. We evaluate the performance of the LLMs towards generating these attacks and find that they can also be utilized to create malicious prompts that, in turn, can be fed back to the model to generate phishing scams - thus massively reducing the prompt-engineering effort required by attackers to scale these threats. As a countermeasure, we build a BERT-based automated detection tool that can be used for the early detection of malicious prompts to prevent LLMs from generating phishing content. Our model is transferable across all four commercial LLMs, attaining an average accuracy of 96% for phishing website prompts and 94% for phishing email prompts. We also disclose the vulnerabilities to the concerned LLMs, with Google acknowledging it as a severe issue. Our detection model is available for use at Hugging Face, as well as a ChatGPT Actions plugin. 
    more » « less
  3. Abstract ObjectiveData extraction from the published literature is the most laborious step in conducting living systematic reviews (LSRs). We aim to build a generalizable, automated data extraction workflow leveraging large language models (LLMs) that mimics the real-world 2-reviewer process. Materials and MethodsA dataset of 10 trials (22 publications) from a published LSR was used, focusing on 23 variables related to trial, population, and outcomes data. The dataset was split into prompt development (n = 5) and held-out test sets (n = 17). GPT-4-turbo and Claude-3-Opus were used for data extraction. Responses from the 2 LLMs were considered concordant if they were the same for a given variable. The discordant responses from each LLM were provided to the other LLM for cross-critique. Accuracy, ie, the total number of correct responses divided by the total number of responses, was computed to assess performance. ResultsIn the prompt development set, 110 (96%) responses were concordant, achieving an accuracy of 0.99 against the gold standard. In the test set, 342 (87%) responses were concordant. The accuracy of the concordant responses was 0.94. The accuracy of the discordant responses was 0.41 for GPT-4-turbo and 0.50 for Claude-3-Opus. Of the 49 discordant responses, 25 (51%) became concordant after cross-critique, increasing accuracy to 0.76. DiscussionConcordant responses by the LLMs are likely to be accurate. In instances of discordant responses, cross-critique can further increase the accuracy. ConclusionLarge language models, when simulated in a collaborative, 2-reviewer workflow, can extract data with reasonable performance, enabling truly “living” systematic reviews. 
    more » « less
  4. Abstract Qualitative coding, or content analysis, is more than just labeling text: it is a reflexive interpretive practice that shapes research questions, refines theoretical insights, and illuminates subtle social dynamics. As large language models (LLMs) become increasingly adept at nuanced language tasks, questions arise about whether—and how—they can assist in large-scale coding without eroding the interpretive depth that distinguishes qualitative analysis from traditional machine learning and other quantitative approaches to natural language processing. In this paper, we present a hybrid approach that preserves hermeneutic value while incorporating LLMs to scale the application of codes to large data sets that are impractical for manual coding. Our workflow retains the traditional cycle of codebook development and refinement, adding an iterative step to adapt definitions for machine comprehension, before ultimately replacing manual with automated text categorization. We demonstrate how to rewrite code descriptions for LLM-interpretation, as well as how structured prompts and prompting the model to explain its coding decisions (chain-of-thought) can substantially improve fidelity. Empirically, our case study of socio-historical codes highlights the promise of frontier AI language models to reliably interpret paragraph-long passages representative of a humanistic study. Throughout, we emphasize ethical and practical considerations, preserving space for critical reflection, and the ongoing need for human researchers’ interpretive leadership. These strategies can guide both traditional and computational scholars aiming to harness automation effectively and responsibly—maintaining the creative, reflexive rigor of qualitative coding while capitalizing on the efficiency afforded by LLMs. 
    more » « less
  5. Purpose:Telepractice is a growing service model that delivers aural rehabilitation to deaf and hard-of hearing children via telecommunications technology. Despite known benefits of telepractice, this delivery approach may increase patients' listening effort (LE) characterized as an allocation of cognitive resources toward an auditory task. The study tested techniques for collecting physiological measures of LE in normal-hearing (NH) children during remote (referred to as tele-) and in-person communication using the wearable Empatica E4 wristband. Method:Participants were 10 children (age range: 9–12 years old) who came to two tele- and two in-person weekly sessions, order counterbalanced. During each session, the children heard a short passage read by the clinical provider, completed an auditory passage comprehension task, and self-rated their effort as a part of the larger study. Measures of electrodermal activity and blood volume pulse amplitude were collected from the child E4 wristband. Results:No differences in child subjective, physiological measures of LE or passage comprehension scores were found between in-person sessions and telesessions. However, an effect of treatment duration on subjective and physiological measures of LE was identified. Children self-reported a significant increase in LE over time. However, their physiological measures demonstrated a trend indicating a decrease in LE. A significant association between subjective measures and the passage comprehension task was found suggesting that those children who reported more effort demonstrated a higher proportion of correct responses. Conclusions:The study demonstrated the feasibility of collection of physiological measures of LE in NH children during remote and in-person communication using the E4 wristband. The results suggest that measures of LE are multidimensional and may reflect different sources of, or cognitive responses to, increased listening demand. Supplemental Material:https://doi.org/10.23641/asha.27122064 
    more » « less