NSF PAR Search | NSF Public Access Repository

Leveraging Large Language Models for Evaluating Explanations in Math Education. Learning Analytics and Knowledge

Worden, E; Croteau, E; Cheng, L; McReynolds, A; Heffernan, N (January 2024, LAK 2024 (submitted, in review))

Human-conducted rating tasks are resource-intensive and demand significant time and financial commitments. As Large Language Models (LLMs) like GPT emerge and exhibit prowess across various domains, their potential in automating such evaluation tasks becomes evident. In this research, we leveraged four prominent LLMs: GPT-4, GPT-3.5, Vicuna, and PaLM 2, to scrutinize their aptitude in evaluating teacher-authored mathematical explanations. We utilized a detailed rubric that encompassed accuracy, explanation clarity, the correctness of mathematical notation, and the efficacy of problem-solving strategies. During our investigation, we unexpectedly discerned the influence of HTML formatting on these evaluations. Notably, GPT-4 consistently favored explanations formatted with HTML, whereas the other models displayed mixed inclinations. When gauging Inter-Rater Reliability (IRR) among these models, only Vicuna and PaLM 2 demonstrated high IRR using the conventional Cohen’s Kappa metric for explanations formatted with HTML. Intriguingly, when a more relaxed version of the metric was applied, all model pairings showcased robust agreement. These revelations not only underscore the potential of LLMs in providing feedback on student-generated content but also illuminate new avenues, such as reinforcement learning, which can harness the consistent feedback from these models.

Full Text Available

The use of Bayesian Knowledge Tracing (BKT) models in predicting student learning and mastery, especially in math- ematics, is a well-established and proven approach in learn- ing analytics. In this work, we report on our analysis exam- ining the generalizability of BKT models across academic years attributed to ”detector rot.” We compare the gen- eralizability of Knowledge Training (KT) models by com- paring model performance in predicting student knowledge within the academic year and across academic years. Models were trained on data from two popular open-source curric- ula available through Open Educational Resources. We ob- served that the models generally were highly performant in predicting student learning within an academic year, whereas certain academic years were more generalizable than other academic years. We posit that the Knowledge Tracing mod- els are relatively stable in terms of performance across aca- demic years yet can still be susceptible to systemic changes and underlying learner behavior. As indicated by the evi- dence in this paper, we posit that learning platforms lever- aging KT models need to be mindful of systemic changes or drastic changes in certain user demographics.

Search for: All records