skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Grading explanations of problem-solving process and generating feedback using large language models at human-level accuracy
This study examines the feasibility and potential advantages of using large language models, in particular GPT-4o, to perform partial credit grading of large numbers of student written responses to introductory level physics problems. Students were instructed to write down verbal explanations of their reasoning process when solving one conceptual and two numerical calculation problems on two exams. The explanations were then graded according to a three-item rubric with each item graded as binary (1 or 0). We first demonstrate that machine grading using GPT-4o with no examples or reference answers can reliably agree with human graders in 70%–80% of all cases, which is equal to or higher than the level at which two human graders agree with each other. Two methods are essential for achieving this level of accuracy: (i) Adding explanation language to each rubric item that targets the errors of initial machine grading. (ii) Running the grading process 5 times and taking the most frequent outcome. Next, we show that the variation in outcomes across five machine grading attempts can serve as a grading confidence index. The index allows a human expert to identify 40 % of all potentially incorrect gradings by reviewing just 10%–15% of all responses with the highest variation. Finally, we show that it is straightforward to use GPT-4o to write a clear and detailed explanation of the partial credit grading outcome. Those explanations can be used as feedback for students, which will allow students to understand their grades and raise different opinions when necessary. Almost all feedback messages generated were rated three or above on a five-point scale by two instructors who had taught the course multiple times. The entire grading and feedback generating process costs roughly $5 per 100 student answers, which shows immense promise for automating labor-intensive grading process through a combination of machine grading with human input and supervision. Published by the American Physical Society2025  more » « less
Award ID(s):
1845436
PAR ID:
10580008
Author(s) / Creator(s):
;
Publisher / Repository:
American Physical Society
Date Published:
Journal Name:
Physical Review Physics Education Research
Volume:
21
Issue:
1
ISSN:
2469-9896
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Martin Fred; Norouzi, Narges; Rosenthal, Stephanie (Ed.)
    This paper examines the use of LLMs to support the grading and explanation of short-answer formative assessments in K12 science topics. While significant work has been done on programmatically scoring well-structured student assessments in math and computer science, many of these approaches produce a numerical score and stop short of providing teachers and students with explanations for the assigned scores. In this paper, we investigate few-shot, in-context learning with chain-of-thought reasoning and active learning using GPT-4 for automated assessment of students’ answers in a middle school Earth Science curriculum. Our findings from this human-in-the-loop approach demonstrate success in scoring formative assessment responses and in providing meaningful explanations for the assigned score. We then perform a systematic analysis of the advantages and limitations of our approach. This research provides insight into how we can use human-in-the-loop methods for the continual improvement of automated grading for open-ended science assessments. 
    more » « less
  2. The ability of students to “Explain in Plain English” (EiPE) the purpose of code is a critical skill for students in introductory programming courses to develop. EiPE questions serve as both a mechanism for students to develop and demonstrate code comprehension skills. However, evaluating this skill has been challenging as manual grading is time consuming and not easily automated. The process of constructing a prompt for the purposes of code generation for a Large Language Model, such OpenAI’s GPT-4, bears a striking resemblance to constructing EiPE responses. In this paper, we explore the potential of using test cases run on code generated by GPT-4 from students’ EiPE responses as a grading mechanism for EiPE questions. We applied this proposed grading method to a corpus of EiPE responses collected from past exams, then measured agreement between the results of this grading method and human graders. Overall, we find moderate agreement between the human raters and the results of the unit tests run on the generated code. This appears to be attributable to GPT-4’s code generation being more lenient than human graders on low-level descriptions of code 
    more » « less
  3. Human-conducted rating tasks are resource-intensive and demand significant time and financial commitments. As Large Language Models (LLMs) like GPT emerge and exhibit prowess across various domains, their potential in automating such evaluation tasks becomes evident. In this research, we leveraged four prominent LLMs: GPT-4, GPT-3.5, Vicuna, and PaLM 2, to scrutinize their aptitude in evaluating teacher-authored mathematical explanations. We utilized a detailed rubric that encompassed accuracy, explanation clarity, the correctness of mathematical notation, and the efficacy of problem-solving strategies. During our investigation, we unexpectedly discerned the influence of HTML formatting on these evaluations. Notably, GPT-4 consistently favored explanations formatted with HTML, whereas the other models displayed mixed inclinations. When gauging Inter-Rater Reliability (IRR) among these models, only Vicuna and PaLM 2 demonstrated high IRR using the conventional Cohen’s Kappa metric for explanations formatted with HTML. Intriguingly, when a more relaxed version of the metric was applied, all model pairings showcased robust agreement. These revelations not only underscore the potential of LLMs in providing feedback on student-generated content but also illuminate new avenues, such as reinforcement learning, which can harness the consistent feedback from these models. 
    more » « less
  4. Fair and consistent assessment of student learning is critical in educational settings, particularly when evaluating the impact of instructional innovations. Although widely used for efficiency, output-based auto-grading often falls short in capturing partial understanding—limiting its effectiveness for measuring learning gains. This paper presents an empirical evaluation of a rubric-based, question-focused, double-grading protocol for written-response (WR) coding questions in pre- and post-tests from a large introductory programming course. This work provides both methodological insights and practical guidance for scaling reliable grading of WR coding questions. To balance efficiency and accuracy, each grader scored a specific question item across all submissions, with two graders assigned per item. Adjudication was triggered when score differences exceeded a 20% threshold. Intraclass Correlation Coefficient (ICC) analysis identified two items with initially low inter-rater reliability. After rubric clarification and regrading, reliability improved substantially, with ICC values ranging from 0.892 to 0.967 (all data) and 0.831 to 0.875 (excluding zero scores). We describe the iterative development of the assessment process and show how this structured approach—combined with ICC analysis as a diagnostic tool and targeted adjudication—achieves strong inter-grader reliability. The framework is scalable and robust for WR coding question evaluation in CS1 settings and is adaptable to a range of instructional contexts. These findings support instructors and researchers seeking consistent, practical methods for assessing WR student work in programming courses. 
    more » « less
  5. Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from the University of California, Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of handwritten, in-person proctored free-response quiz submissions from nearly 800 students included in the paper’s empirical analysis. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable real-course deployment. Building on the dataset and evaluation framework developed here, we outline a path toward a future standardized benchmark for AI grading of handwritten mathematics to support reproducible evaluation, transparent comparison, reliable deployment, and future research. 
    more » « less