Medical systematic reviews play a vital role in healthcare decision making and policy. However, their production is time-consuming, limiting the availability of high-quality and up-to-date evidence summaries. Recent advancements in LLMs offer the potential to automatically generate literature reviews on demand, addressing this issue. However, LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission. In healthcare, this can make LLMs unusable at best and dangerous at worst. We conducted 16 interviews with international systematic review experts to characterize the perceived utility and risks of LLMs in the specific context of medical evidence reviews. Experts indicated that LLMs can assist in the writing process by drafting summaries, generating templates, distilling information, and crosschecking information. They also raised concerns regarding confidently composed but inaccurate LLM outputs and other potential downstream harms, including decreased accountability and proliferation of low-quality reviews. Informed by this qualitative analysis, we identify criteria for rigorous evaluation of biomedical LLMs aligned with domain expert views. 
                        more » 
                        « less   
                    This content will become publicly available on February 3, 2026
                            
                            LLM-TA: An LLM-Enhanced Thematic Analysis Pipeline for Transcripts from Parents of Children with Congenital Heart Disease
                        
                    
    
            Thematic Analysis (TA) is a fundamental method in healthcare research for analyzing transcript data, but it is resource-intensive and difficult to scale for large, complex datasets. This study investigates the potential of large language models (LLMs) to augment the inductive TA process in high-stakes healthcare settings. Focusing on interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we propose an LLM-Enhanced Thematic Analysis (LLM-TA) pipeline. Our pipeline integrates an affordable state-of-the-art LLM (GPT-4o mini), LangChain, and prompt engineering with chunking techniques to analyze nine detailed transcripts following the inductive TA framework. We evaluate the LLM-generated themes against human-generated results using thematic similarity metrics, LLM-assisted assessments, and expert reviews. Results demonstrate that our pipeline outperforms existing LLM-assisted TA methods significantly. While the pipeline alone has not yet reached human-level quality in inductive TA, it shows great potential to improve scalability, efficiency, and accuracy while reducing analyst workload when working collaboratively with domain experts. We provide practical recommendations for incorporating LLMs into high-stakes TA workflows and emphasize the importance of close collaboration with domain experts to address challenges related to real-world applicability and dataset complexity. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2505865
- PAR ID:
- 10631497
- Publisher / Repository:
- https://doi.org/10.48550/arXiv.2502.01620
- Date Published:
- ISSN:
- 2502.01620
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            AI-assisted decision making becomes increasingly prevalent, yet individuals often fail to utilize AI-based decision aids appropriately especially when the AI explanations are absent, potentially as they do not reflect on AI’s decision recommendations critically. Large language models (LLMs), with their exceptional conversational and analytical capabilities, present great opportunities to enhance AI-assisted decision making in the absence of AI explanations by providing natural-language-based analysis of AI’s decision recommendation, e.g., how each feature of a decision making task might contribute to the AI recommendation. In this paper, via a randomized experiment, we first show that presenting LLM-powered analysis of each task feature, either sequentially or concurrently, does not significantly improve people’s AI-assisted decision performance. To enable decision makers to better leverage LLM-powered analysis, we then propose an algorithmic framework to characterize the effects of LLM-powered analysis on human decisions and dynamically decide which analysis to present. Our evaluation with human subjects shows that this approach effectively improves decision makers’ appropriate reliance on AI in AI-assisted decision making.more » « less
- 
            As demand grows for job-ready data science professionals, there is increasing recognition that traditional training often falls short in cultivating the higher-order reasoning and real-world problem-solving skills essential to the field. A foundational step toward addressing this gap is the identification and organization of knowledge components (KCs) that underlie data science problem solving (DSPS). KCs represent conditional knowledge—knowing about appropriate actions given particular contexts or conditions—and correspond to the critical decisions data scientists must make throughout the problem-solving process. While existing taxonomies in data science education support curriculum development, they often lack the granularity and focus needed to support the assessment and development of DSPS skills. In this paper, we present a novel framework that combines the strengths of large language models (LLMs) and human expertise to identify, define, and organize KCs specific to DSPS. We treat LLMs as ``knowledge engineering assistants" capable of generating candidate KCs by drawing on their extensive training data, which includes a vast amount of domain knowledge and diverse sets of real-world DSPS cases. Our process involves prompting multiple LLMs to generate decision points, synthesizing and refining KC definitions across models, and using sentence-embedding models to infer the underlying structure of the resulting taxonomy. Human experts then review and iteratively refine the taxonomy to ensure validity. This human-AI collaborative workflow offers a scalable and efficient proof-of-concept for LLM-assisted knowledge engineering. The resulting KC taxonomy lays the groundwork for developing fine-grained assessment tools and adaptive learning systems that support deliberate practice in DSPS. Furthermore, the framework illustrates the potential of LLMs not just as content generators but as partners in structuring domain knowledge to inform instructional design. Future work will involve extending the framework by generating a directed graph of KCs based on their input-output dependencies and validating the taxonomy through expert consensus and learner studies. This approach contributes to both the practical advancement of DSPS coaching in data science education and the broader methodological toolkit for AI-supported knowledge engineering.more » « less
- 
            Introduction: Recent AI advances, particularly the introduction of large language models (LLMs), have expanded the capacity to automate various tasks, including the analysis of text. This capability may be especially helpful in education research, where lack of resources often hampers the ability to perform various kinds of analyses, particularly those requiring a high level of expertise in a domain and/or a large set of textual data. For instance, we recently coded approximately 10,000 state K-12 computer science standards, requiring over 200 hours of work by subject matter experts. If LLMs are capable of completing a task such as this, the savings in human resources would be immense. Research Questions: This study explores two research questions: (1) How do LLMs compare to humans in the performance of an education research task? and (2) What do errors in LLM performance on this task suggest about current LLM capabilities and limitations? Methodology: We used a random sample of state K-12 computer science standards. We compared the output of three LLMs – ChatGPT, Llama, and Claude – to the work of human subject matter experts in coding the relationship between each state standard and a set of national K-12 standards. Specifically, the LLMs and the humans determined whether each state standard was identical to, similar to, based on, or different from the national standards and (if it was not different) which national standard it resembled. Results: Each of the LLMs identified a different national standard than the subject matter expert in about half of instances. When the LLM identified the same standard, it usually categorized the type of relationship (i.e., identical to, similar to, based on) in the same way as the human expert. However, the LLMs sometimes misidentified ‘identical’ standards. Discussion: Our results suggest that LLMs are not currently capable of matching human performance on the task of classifying learning standards. The mis-identification of some state standards as identical to national standards – when they clearly were not – is an interesting error, given that traditional computing technologies can easily identify identical text. Similarly, some of the mismatches between the LLM and human performance indicate clear errors on the part of the LLMs. However, some of the mismatches are difficult to assess, given the ambiguity inherent in this task and the potential for human error. We conclude the paper with recommendations for the use of LLMs in education research based on these findings.more » « less
- 
            Geospatial data visualization on a map is an essential tool for modern data exploration tools. However, these tools require users to manually configure the visualization style including color scheme and attribute selection, a process that is both complex and domain-specific. Large Language Models (LLMs) provide an opportunity to intelligently assist in styling based on the underlying data distribution and characteristics. This paper demonstrates LASEK, an LLM-assisted visualization framework that automates attribute selection and styling in large-scale spatio-temporal datasets. The system leverages LLMs to determine which attributes should be highlighted for visual distinction and even suggests how to integrate them in styling options improving interpretability and efficiency. We demonstrate our approach through interactive visualization scenarios, showing how LLM-driven attribute selection enhances clarity, reduces manual effort, and provides data-driven justifications for styling decisions.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
