skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 1, 2026

Title: Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
Few studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study employed three LLMs and the ML-based scoring engine, EvoGrader. We measured scoring reliability (percentage agreement, kappa, precision, recall, F1), processing time, and explored contextual factors like ethics and cost. Results showed that with very basic prompt engineering, ChatGPT-4o achieved the highest performance across LLMs. Proprietary LLMs outperformed open-weight LLMs for most concepts. GPT-4o achieved robust but less accurate scoring than EvoGrader (~500 additional scoring errors). Ethical concerns over data ownership, reliability, and replicability over time were LLM limitations. EvoGrader offered superior accuracy, reliability, and replicability, but required, in its development a large, high-quality, human-scored corpus, domain expertise, and restricted assessment items. These findings highlight the diversity of considerations that should be used when considering LLM and ML scoring in science education. Despite impressive LLM advances, ML approaches may remain valuable in some contexts, particularly those prioritizing precision, reliability, replicability, privacy, and controlled implementation.  more » « less
Award ID(s):
2318346
PAR ID:
10630501
Author(s) / Creator(s):
;
Publisher / Repository:
Multidisciplinary Digital Publishing Institute
Date Published:
Journal Name:
Education Sciences
Volume:
15
Issue:
6
ISSN:
2227-7102
Page Range / eLocation ID:
676
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and facilitate the debugging of existing software systems. However, LLM-generated code often faces challenges due to inherent inefficiencies, including redundant logical structures, factually inconsistent content (hallucinations), and programming errors. To address this issue, our research rigorously evaluated the computational efficiency of Python code generated by three prominent LLMs: GPT-4o-Mini, GPT-3.5-Turbo, and GPT-4-Turbo. The evaluation metrics encompass execution time, memory utilization, and peak memory consumption, while maintaining the functional correctness of the generated code. Leveraging the EffiBench benchmark datasets within the Google Vertex AI Workbench environment, across a spectrum of machine configurations, the study implemented a consistent seed parameter to ensure experimental reproducibility. Furthermore, we investigated the impact of two distinct optimization strategies: Chain-of-Thought (CoT) prompting and model fine-tuning. Our findings reveal a significant enhancement in efficiency metrics for GPT-4o-Mini and GPT-3.5-Turbo when employing CoT prompting; however, this trend was not observed for GPT-4-Turbo. Based on its promising performance with CoT prompting, we selected the GPT-4o-Mini model for subsequent fine-tuning, aiming to further enhance both its computational efficiency and accuracy. However, contrary to our expectations, fine-tuning the GPT-4o-Mini model led to a discernible degradation in both its accuracy and computational efficiency. In conclusion, this study provides empirical evidence suggesting that the deployment of high-CPU machine configurations, in synergy with the utilization of the GPT-4o-Mini model and CoT prompting techniques, yields demonstrably more efficient and accurate LLM-generated Python code, particularly within computationally intensive application scenarios. 
    more » « less
  2. As large language models (LLMs) show great promise in generating a wide spectrum of educational materials, robust yet cost-effective assessment of the quality and effectiveness of such materials becomes an important challenge. Traditional approaches, including expert-based quality assessment and student-centered evaluation, are resource-consuming, and do not scale efficiently. In this work, we explored the use of pre-existing student learning data as a promising approach to evaluate LLM-generated learning materials. Specifically, we used a dataset where students were completing the program construction challenges by picking the correct answers among human-authored distractors to evaluate the quality of LLM-generated distractors for the same challenges. The dataset included responses from 1,071 students across 22 classes taught from Fall 2017 to Spring 2023. We evaluated five prominent LLMs (OpenAI-o1, GPT-4, GPT-4o, GPT-4o-mini, and Llama-3.1-8b) across three different prompts to see which combinations result in more effective distractors, i.e., those that are plausible (often picked by students), and potentially based on common misconceptions. Our results suggest that GPT-4o was the most effective model, matching close to 50% of the functional distractors originally authored by humans. At the same time, all of the evaluated LLMs generated many novel distractors, i.e., those that did not match the pre-existing human-authored ones. Our preliminary analysis shows that those appear to be promising. Establishing their effectiveness in real-world classroom settings is left for future work. 
    more » « less
  3. Hedges allow speakers to mark utterances as provisional, whether to signal non-prototypicality or “fuzziness”, to indicate a lack of commitment to an utterance, to attribute responsibility for a statement to someone else, to invite input from a partner, or to soften critical feedback in the service of face management needs. Here we focus on hedges in an experimentally parameterized corpus of 63 Roadrunner cartoon narratives spontaneously produced from memory by 21 speakers for co-present addressees, transcribed to text (Galati and Brennan, 2010). We created a gold standard of hedges annotated by human coders (the Roadrunner-Hedge corpus) and compared three LLM-based approaches for hedge detection: fine-tuning BERT, and zero and few-shot prompting with GPT-4o and LLaMA-3. The best-performing approach was a fine-tuned BERT model, followed by few-shot GPT-4o. After an error analysis on the top performing approaches, we used an LLM-in-the-Loop approach to improve the gold standard coding, as well as to highlight cases in which hedges are ambiguous in linguistically interesting ways that will guide future research. This is the first step in our research program to train LLMs to interpret and generate collateral signals appropriately and meaningfully in conversation. 
    more » « less
  4. Thematic Analysis (TA) is a fundamental method in healthcare research for analyzing transcript data, but it is resource-intensive and difficult to scale for large, complex datasets. This study investigates the potential of large language models (LLMs) to augment the inductive TA process in high-stakes healthcare settings. Focusing on interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we propose an LLM-Enhanced Thematic Analysis (LLM-TA) pipeline. Our pipeline integrates an affordable state-of-the-art LLM (GPT-4o mini), LangChain, and prompt engineering with chunking techniques to analyze nine detailed transcripts following the inductive TA framework. We evaluate the LLM-generated themes against human-generated results using thematic similarity metrics, LLM-assisted assessments, and expert reviews. Results demonstrate that our pipeline outperforms existing LLM-assisted TA methods significantly. While the pipeline alone has not yet reached human-level quality in inductive TA, it shows great potential to improve scalability, efficiency, and accuracy while reducing analyst workload when working collaboratively with domain experts. We provide practical recommendations for incorporating LLMs into high-stakes TA workflows and emphasize the importance of close collaboration with domain experts to address challenges related to real-world applicability and dataset complexity. 
    more » « less
  5. Novice programmers often face challenges in designing computational artifacts and fixing code errors, which can lead to task abandonment and over-reliance on external support. While research has explored effective meta-cognitive strategies to scaffold novice programmers' learning, it is essential to first understand and assess students' conceptual, procedural, and strategic/conditional programming knowledge at scale. To address this issue, we propose a three-model framework that leverages Large Language Models (LLMs) to simulate, classify, and correct student responses to programming questions based on the SOLO Taxonomy. The SOLO Taxonomy provides a structured approach for categorizing student understanding into four levels: Pre-structural, Uni-structural, Multi-structural, and Relational. Our results showed that GPT-4o achieved high accuracy in generating and classifying responses for the Relational category, with moderate accuracy in the Uni-structural and Pre-structural categories, but struggled with the Multi-structural category. The model successfully corrected responses to the Relational level. Although further refinement is needed, these findings suggest that LLMs hold significant potential for supporting computer science education by assessing programming knowledge and guiding students toward deeper cognitive engagement. 
    more » « less