skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on September 1, 2026

Title: Enhancing Automated Grading in Science Education through LLM-Driven Causal Reasoning and Multimodal Analysis
Automated assessment of open responses in K–12 science education poses significant challenges due to the multimodal nature of student work, which often integrates textual explanations, drawings, and handwritten elements. Traditional evaluation methods that focus solely on textual analysis fail to capture the full breadth of student reasoning and are susceptible to biases such as handwriting neatness or answer length. In this paper, we propose a novel LLM-augmented multimodal evaluation framework that addresses these limitations through a comprehensive, bias-corrected grading system. Our approach leverages LLMs to generate causal knowledge graphs that encapsulate the essential conceptual relationships in student responses, comparing these graphs with those derived automatically from the rubrics and submissions. Experimental results demonstrate that our framework improves grading accuracy and consistency over deep supervised learning and few-shot LLM baselines.  more » « less
Award ID(s):
2530474 2212174
PAR ID:
10637445
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
International Joint Conferences on Artificial Intelligence Organization
Date Published:
ISBN:
978-1-956792-06-5
Page Range / eLocation ID:
10352 to 10360
Format(s):
Medium: X
Location:
Montreal, Canada
Sponsoring Org:
National Science Foundation
More Like this
  1. Mills, Caitlin; Alexandron, Giora; Taibi, Davide; Lo_Bosco, Giosuè; Paquette, Luc (Ed.)
    Short answer assessment is a vital component of science education, allowing evaluation of students' complex three-dimensional understanding. Large language models (LLMs) that possess human-like ability in linguistic tasks are increasingly popular in assisting human graders to reduce their workload. However, LLMs' limitations in domain knowledge restrict their understanding in task-specific requirements and hinder their ability to achieve satisfactory performance. Retrieval-augmented generation (RAG) emerges as a promising solution by enabling LLMs to access relevant domain-specific knowledge during assessment. In this work, we propose an adaptive RAG framework for automated grading that dynamically retrieves and incorporates domain-specific knowledge based on the question and student answer context. Our approach combines semantic search and curated educational sources to retrieve valuable reference materials. Experimental results in a science education dataset demonstrate that our system achieves an improvement in grading accuracy compared to baseline LLM approaches. The findings suggest that RAG-enhanced grading systems can serve as reliable support with efficient performance gains. 
    more » « less
  2. We present and evaluate a machine learning based system that automatically grades audios of students speaking a foreign language. The use of automated systems to aid the assessment of student performance holds great promise in augmenting the teacher’s ability to provide meaningful feedback and instruction to students. Teachers spend a significant amount of time grading student work and the use of these tools can save teachers a significant amount of time on their grading. This additional time could be used to give personalized attention to each student. Significant prior research has focused on the grading of closed-form problems, open-ended essays and textual content. However, little research has focused on audio content that is much more prevalent in the language-study education. In this paper, we explore the development of automated assessment tools for audio responses in a college-level Chinese language-learning course. We analyze several challenges faced while working with data of this type as well as the generation and extraction of features for the purpose of building machine learning models to aid in the assessment of student language learning. 
    more » « less
  3. We present and evaluate a machine learning based system that automatically grades audios of students speaking a foreign language. The use of automated systems to aid the assessment of student performance holds great promise in augmenting the teacher’s ability to provide meaningful feedback and instruction to students. Teachers spend a significant amount of time grading student work and the use of these tools can save teachers a significant amount of time on their grading. This additional time could be used to give personalized attention to each student. Significant prior research has focused on the grading of closed-form problems, open-ended essays and textual content. However, little research has focused on audio content that is much more prevalent in language study education. In this paper, we explore the development of automated assessment tools for audio responses in a college-level Chinese language-learning course. We analyze several challenges faced while working with data of this type as well as the generation and extraction of features for the purpose of building machine learning models to aid in the assessment of student language learning. 
    more » « less
  4. Teachers often rely on the use of a range of open-ended problems to assess students’ understanding of mathematical concepts. Beyond traditional conceptions of student open- ended work, commonly in the form of textual short-answer or essay responses, the use of figures, tables, number lines, graphs, and pictographs are other examples of open-ended work common in mathematics. While recent developments in areas of natural language processing and machine learning have led to automated methods to score student open-ended work, these methods have largely been limited to textual an- swers. Several computer-based learning systems allow stu- dents to take pictures of hand-written work and include such images within their answers to open-ended questions. With that, however, there are few-to-no existing solutions that support the auto-scoring of student hand-written or drawn answers to questions. In this work, we build upon an ex- isting method for auto-scoring textual student answers and explore the use of OpenAI/CLIP, a deep learning embedding method designed to represent both images and text, as well as Optical Character Recognition (OCR) to improve model performance. We evaluate the performance of our method on a dataset of student open-responses that contains both text- and image-based responses, and find a reduction of model error in the presence of images when controlling for other answer-level features. 
    more » « less
  5. Teachers often rely on the use of a range of open-ended problems to assess students’ understanding of mathematical concepts. Beyond traditional conceptions of student open- ended work, commonly in the form of textual short-answer or essay responses, the use of figures, tables, number lines, graphs, and pictographs are other examples of open-ended work common in mathematics. While recent developments in areas of natural language processing and machine learning have led to automated methods to score student open-ended work, these methods have largely been limited to textual an- swers. Several computer-based learning systems allow stu- dents to take pictures of hand-written work and include such images within their answers to open-ended questions. With that, however, there are few-to-no existing solutions that support the auto-scoring of student hand-written or drawn answers to questions. In this work, we build upon an ex- isting method for auto-scoring textual student answers and explore the use of OpenAI/CLIP, a deep learning embedding method designed to represent both images and text, as well as Optical Character Recognition (OCR) to improve model performance. We evaluate the performance of our method on a dataset of student open-responses that contains both text- and image-based responses, and find a reduction of model error in the presence of images when controlling for other answer-level features. 
    more » « less