Collaborative large language models for automated data extraction in living systematic reviews

Khan, Muhammad Ali; Ayub, Umair; Naqvi, Syed_Arsalan Ahmed; Khakwani, Kaneez_Zahra Rubab; Sipra, Zaryab_bin Riaz; Raina, Ammad; Zhou, Sihan; He, Huan; Saeidi, Amir; Hasan, Bashar; Rumble, Robert Bryan; Bitterman, Danielle S; Warner, Jeremy L; Zou, Jia; Tevaarwerk, Amye J; Leventakos, Konstantinos; Kehl, Kenneth L; Palmer, Jeanne M; Murad, Mohammad Hassan; Baral, Chitta; Riaz, Irbaz bin

doi:10.1093/jamia/ocae325

Abstract ObjectiveData extraction from the published literature is the most laborious step in conducting living systematic reviews (LSRs). We aim to build a generalizable, automated data extraction workflow leveraging large language models (LLMs) that mimics the real-world 2-reviewer process. Materials and MethodsA dataset of 10 trials (22 publications) from a published LSR was used, focusing on 23 variables related to trial, population, and outcomes data. The dataset was split into prompt development (n = 5) and held-out test sets (n = 17). GPT-4-turbo and Claude-3-Opus were used for data extraction. Responses from the 2 LLMs were considered concordant if they were the same for a given variable. The discordant responses from each LLM were provided to the other LLM for cross-critique. Accuracy, ie, the total number of correct responses divided by the total number of responses, was computed to assess performance. ResultsIn the prompt development set, 110 (96%) responses were concordant, achieving an accuracy of 0.99 against the gold standard. In the test set, 342 (87%) responses were concordant. The accuracy of the concordant responses was 0.94. The accuracy of the discordant responses was 0.41 for GPT-4-turbo and 0.50 for Claude-3-Opus. Of the 49 discordant responses, 25 (51%) became concordant after cross-critique, increasing accuracy to 0.76. DiscussionConcordant responses by the LLMs are likely to be accurate. In instances of discordant responses, cross-critique can further increase the accuracy. ConclusionLarge language models, when simulated in a collaborative, 2-reviewer workflow, can extract data with reasonable performance, enabling truly “living” systematic reviews.

More Like this