Surgical pathology reports contain essential diagnostic information, in free-text form, required for cancer staging, treatment planning, and cancer registry documentation. However, their unstructured nature and variability across tumor types and institutions pose challenges for automated data extraction. We present a consensus-driven, reasoning-based framework that uses multiple locally deployed large language models (LLMs) to extract six key diagnostic variables: site, laterality, histology, stage, grade, and behavior. Each LLM produces structured outputs with accompanying justifications, which are evaluated for accuracy and coherence by a separate reasoning model. Final consensus values are determined through aggregation, and expert validation is conducted by board-certified or equivalent pathologists. The framework was applied to over 4,000 pathology reports from The Cancer Genome Atlas (TCGA) and Moffitt Cancer Center. Expert review confirmed high agreement in the TCGA dataset for behavior (100.0%), histology (98.5%), site (95.2%), and grade (95.6%), with lower performance for stage (87.6%) and laterality (84.8%). In the pathology reports from Moffitt (brain, breast, and lung), accuracy remained high across variables, with histology (95.6%), behavior (98.3%), and stage (92.4%), achieving strong agreement. However, certain challenges emerged, such as inconsistent mention of sentinel lymph node details or anatomical ambiguity in biopsy site interpretations. Statistical analyses revealed significant main effects of model type, variable, and organ system, as well as model × variable × organ interactions, emphasizing the role of clinical context in model performance. These results highlight the importance of stratified, multi-organ evaluation frameworks in LLM benchmarking for clinical applications. Textual justifications enhanced interpretability and enabled human reviewers to audit model outputs. Overall, this consensus-based approach demonstrates that locally deployed LLMs can provide a transparent, accurate, and auditable solution for integrating AI-driven data extraction into real-world pathology workflows, including cancer registry abstraction and synoptic reporting. 
                        more » 
                        « less   
                    This content will become publicly available on April 12, 2026
                            
                            Reasoning Beyond Accuracy: Expert Evaluation of Large Language Models in Diagnostic Pathology
                        
                    
    
            Abstract BackgroundDiagnostic pathology depends on complex, structured reasoning to interpret clinical, histologic, and molecular data. Replicating this cognitive process algorithmically remains a significant challenge. As large language models (LLMs) gain traction in medicine, it is critical to determine whether they have clinical utility by providing reasoning in highly specialized domains such as pathology. MethodsWe evaluated the performance of four reasoning LLMs (OpenAI o1, OpenAI o3-mini, Gemini 2.0 Flash Thinking Experimental, and DeepSeek-R1 671B) on 15 board-style open-ended pathology questions. Responses were independently reviewed by 11 pathologists using a structured framework that assessed language quality (accuracy, relevance, coherence, depth, and conciseness) and seven diagnostic reasoning strategies. Scores were normalized and aggregated for analysis. We also evaluated inter-observer agreement to assess scoring consistency. Model comparisons were conducted using one-way ANOVA and Tukey’s Honestly Significant Difference (HSD) test. ResultsGemini and DeepSeek significantly outperformed OpenAI o1 and OpenAI o3-mini in overall reasoning quality (p < 0.05), particularly in analytical depth and coherence. While all models achieved comparable accuracy, only Gemini and DeepSeek consistently applied expert-like reasoning strategies, including algorithmic, inductive, and Bayesian approaches. Performance varied by reasoning type: models performed best in algorithmic and deductive reasoning and poorest in heuristic and pattern recognition. Inter-observer agreement was highest for Gemini (p < 0.05), indicating greater consistency and interpretability. Models with more in-depth reasoning (Gemini and DeepSeek) were generally less concise. ConclusionAdvanced LLMs such as Gemini and DeepSeek can approximate aspects of expert-level diagnostic reasoning in pathology, particularly in algorithmic and structured approaches. However, limitations persist in contextual reasoning, heuristic decision-making, and consistency across questions. Addressing these gaps, along with trade-offs between depth and conciseness, will be essential for the safe and effective integration of AI tools into clinical pathology workflows. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2234468
- PAR ID:
- 10599123
- Publisher / Repository:
- medRxiv
- Date Published:
- Format(s):
- Medium: X
- Institution:
- medRxiv
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Large Language Models (LLMs) have made significant strides in various intelligent tasks but still struggle with complex action reasoning tasks that require systematic search. To address this limitation, we introduce a method that bridges the natural language understanding capability of LLMs with the symbolic reasoning capability of action languages---formal languages for reasoning about actions. Our approach, termed {\sf LLM+AL}, leverages the LLM's strengths in semantic parsing and commonsense knowledge generation alongside the action language's expertise in automated reasoning based on encoded knowledge. We compare {\sf LLM+AL} against state-of-the-art LLMs, including {\sc ChatGPT-4}, {\sc Claude 3 Opus}, {\sc Gemini Ultra 1.0}, and {\sc o1-preview}, using benchmarks for complex reasoning about actions. Our findings indicate that while all methods exhibit various errors, {\sf LLM+AL}, with relatively simple human corrections, consistently leads to correct answers, whereas using LLMs alone does not yield improvements even after human intervention. {\sf LLM+AL} also contributes to automated generation of action languages.more » « less
- 
            As large language models (LLMs) show great promise in generating a wide spectrum of educational materials, robust yet cost-effective assessment of the quality and effectiveness of such materials becomes an important challenge. Traditional approaches, including expert-based quality assessment and student-centered evaluation, are resource-consuming, and do not scale efficiently. In this work, we explored the use of pre-existing student learning data as a promising approach to evaluate LLM-generated learning materials. Specifically, we used a dataset where students were completing the program construction challenges by picking the correct answers among human-authored distractors to evaluate the quality of LLM-generated distractors for the same challenges. The dataset included responses from 1,071 students across 22 classes taught from Fall 2017 to Spring 2023. We evaluated five prominent LLMs (OpenAI-o1, GPT-4, GPT-4o, GPT-4o-mini, and Llama-3.1-8b) across three different prompts to see which combinations result in more effective distractors, i.e., those that are plausible (often picked by students), and potentially based on common misconceptions. Our results suggest that GPT-4o was the most effective model, matching close to 50% of the functional distractors originally authored by humans. At the same time, all of the evaluated LLMs generated many novel distractors, i.e., those that did not match the pre-existing human-authored ones. Our preliminary analysis shows that those appear to be promising. Establishing their effectiveness in real-world classroom settings is left for future work.more » « less
- 
            Large Language Models (LLMs) are increasingly applied across business, education, and cybersecurity domains. However, LLMs can yield varied outputs for the same query due to differences in architecture, training data, and response generation mechanisms. This paper examines model variability and uncertainty by comparing the responses of three LLMs—ChatGPT-4o, Gemini 2.0 Flash, and DeepSeek-V3--to a query on ranking practical intrusion detection systems (IDS). The analysis highlights key similarities and differences in the models’ outputs, offering insight into their respective reasoning and consistency.more » « less
- 
            Abstract PurposeMeasurable results of efforts to teach empathy to engineering students are sparse and somewhat mixed. This study’s objectives are (O1) to understand how empathy training affects students’ professional development relative to other educational experiences, (O2) to track empathy changes due to training over multiple years, and (O3) to understand how and what students learn in empathy training environments. MethodsStudents in a multiple-semester empathy course completed surveys ranking the career development impact of the empathy program against other college experiences (O1), rating learning of specific empathy skills (O2), and ranking program elements’ impact on empathy skills (O3). Intervention and control groups completed the Interpersonal Reactivity Index and Jefferson Scale of Empathy at four time points (O2). Cohort students participated in post-program interviews (O1, O3). ResultsO1: Empathy training impacted career development more than several typical college activities but less than courses in major. O2: Students reported gains in four taught empathy skills. Cohort students showed significant increases in the Jefferson Scale while the control group did not. There were no significant changes in Interpersonal Reactivity Index scores. O3: interactive exercises had a significant effect on students’ learning all empathy skills while interactions with people with disabilities had significant effect on learning to encounter others with genuineness. Students valued building a safe in-class community facilitating their success in experiential environments. ConclusionsThis study highlights empathy skills’ importance in engineering students’ development, shows gains in empathy with training, and uncovers key factors in students’ learning experience that can be incorporated into engineering curricula.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
