In this study, we have explored the effectiveness of two instructional approaches in the context of the motion of objects falling at terminal speed in the presence of air resistance. We ground these instructional approaches in dual-process theories of reasoning, which assert that human cognition relies on two thinking processes. Dual-process theories suggest multiple possible avenues by which instruction might impact student reasoning. In this paper, we compare two possible instructional approaches: one designed to reinforce the normative approach (improving the outputs of the intuitive process) and another that guides students to reflect on and analyze their initial ideas (supporting the analytic process). The results suggest that for students who have already demonstrated a minimum level of requisite knowledge, instruction that supports analysis of their likely intuitive mental model leads to greater learning benefits in the short term than instruction that focuses solely on providing practice with the normative mindware. These results have implications for the design of instructional materials and help to demonstrate how dual-process theories can be leveraged to explain the success of existing research-based materials. Published by the American Physical Society2024
more »
« less
This content will become publicly available on March 1, 2026
Grading explanations of problem-solving process and generating feedback using large language models at human-level accuracy
This study examines the feasibility and potential advantages of using large language models, in particular GPT-4o, to perform partial credit grading of large numbers of student written responses to introductory level physics problems. Students were instructed to write down verbal explanations of their reasoning process when solving one conceptual and two numerical calculation problems on two exams. The explanations were then graded according to a three-item rubric with each item graded as binary (1 or 0). We first demonstrate that machine grading using GPT-4o with no examples or reference answers can reliably agree with human graders in 70%–80% of all cases, which is equal to or higher than the level at which two human graders agree with each other. Two methods are essential for achieving this level of accuracy: (i) Adding explanation language to each rubric item that targets the errors of initial machine grading. (ii) Running the grading process 5 times and taking the most frequent outcome. Next, we show that the variation in outcomes across five machine grading attempts can serve as a grading confidence index. The index allows a human expert to identify of all potentially incorrect gradings by reviewing just 10%–15% of all responses with the highest variation. Finally, we show that it is straightforward to use GPT-4o to write a clear and detailed explanation of the partial credit grading outcome. Those explanations can be used as feedback for students, which will allow students to understand their grades and raise different opinions when necessary. Almost all feedback messages generated were rated three or above on a five-point scale by two instructors who had taught the course multiple times. The entire grading and feedback generating process costs roughly $5 per 100 student answers, which shows immense promise for automating labor-intensive grading process through a combination of machine grading with human input and supervision. Published by the American Physical Society2025
more »
« less
- Award ID(s):
- 1845436
- PAR ID:
- 10580008
- Publisher / Repository:
- American Physical Society
- Date Published:
- Journal Name:
- Physical Review Physics Education Research
- Volume:
- 21
- Issue:
- 1
- ISSN:
- 2469-9896
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Martin Fred; Norouzi, Narges; Rosenthal, Stephanie (Ed.)This paper examines the use of LLMs to support the grading and explanation of short-answer formative assessments in K12 science topics. While significant work has been done on programmatically scoring well-structured student assessments in math and computer science, many of these approaches produce a numerical score and stop short of providing teachers and students with explanations for the assigned scores. In this paper, we investigate few-shot, in-context learning with chain-of-thought reasoning and active learning using GPT-4 for automated assessment of students’ answers in a middle school Earth Science curriculum. Our findings from this human-in-the-loop approach demonstrate success in scoring formative assessment responses and in providing meaningful explanations for the assigned score. We then perform a systematic analysis of the advantages and limitations of our approach. This research provides insight into how we can use human-in-the-loop methods for the continual improvement of automated grading for open-ended science assessments.more » « less
-
Human-conducted rating tasks are resource-intensive and demand significant time and financial commitments. As Large Language Models (LLMs) like GPT emerge and exhibit prowess across various domains, their potential in automating such evaluation tasks becomes evident. In this research, we leveraged four prominent LLMs: GPT-4, GPT-3.5, Vicuna, and PaLM 2, to scrutinize their aptitude in evaluating teacher-authored mathematical explanations. We utilized a detailed rubric that encompassed accuracy, explanation clarity, the correctness of mathematical notation, and the efficacy of problem-solving strategies. During our investigation, we unexpectedly discerned the influence of HTML formatting on these evaluations. Notably, GPT-4 consistently favored explanations formatted with HTML, whereas the other models displayed mixed inclinations. When gauging Inter-Rater Reliability (IRR) among these models, only Vicuna and PaLM 2 demonstrated high IRR using the conventional Cohen’s Kappa metric for explanations formatted with HTML. Intriguingly, when a more relaxed version of the metric was applied, all model pairings showcased robust agreement. These revelations not only underscore the potential of LLMs in providing feedback on student-generated content but also illuminate new avenues, such as reinforcement learning, which can harness the consistent feedback from these models.more » « less
-
We propose and analyze deterministic protocols to generate qudit photonic graph states from quantum emitters. We show that our approach can be applied to generate any qudit graph state and we exemplify it by constructing protocols to generate one- and two-dimensional qudit cluster states, absolutely maximally entangled states, and logical states of quantum error-correcting codes. Some of these protocols make use of time-delayed feedback, while others do not. The only additional resource requirement compared to the qubit case is the ability to control multilevel emitters. These results significantly broaden the range of multiphoton entangled states that can be produced deterministically from quantum emitters. Published by the American Physical Society2024more » « less
-
Over the course of the introductory calculus-based physics course, students are often expected to build conceptual understanding and develop and refine skills in problem solving and qualitative inferential reasoning. Many of the research-based materials developed over the past 30 years by the physics education research community use sequences of scaffolded questions to step students through a qualitative inferential reasoning chain. It is often tacitly assumed that, in addition to building conceptual understanding, such materials improve qualitative reasoning skills. However, clear documentation of the impact of such materials on qualitative reasoning skills is critical. New methodologies are needed to better study reasoning processes and to disentangle, to the extent possible, processes related to physics content from processes general to all human reasoning. As a result, we have employed network analysis methodologies to examine student responses to reasoning-related tasks in order to gain deeper insight into the nature of student reasoning in physics. In this paper, we show that network analysis metrics are both interpretable and valuable when applied to student reasoning data generated from . We also demonstrate that documentation of improvements in the articulation of specific lines of reasoning can be obtained from a network analysis of responses to reasoning chain construction tasks. Published by the American Physical Society2024more » « less