Assessing the ality of Large Language Models in Generating Mathematics Explanations

Wang, Allison; Prihar, Ethan; Heffernan, Neil

The development and measurable improvements in performance of large language models on natural language tasks [12] opens up the opportunity to utilize large language models in an educational setting to replicate human tutoring, which is often costly and inaccessible. We are particularly interested in large language models from the GPT series, created by OpenAI [7]. In a prior study we found that the quality of explanations generated with GPT-3.5 was poor, where two dierent approaches to generating explanations resulted in a 43% and 10% success rate. In this replication study, we were interested in whether the measurable improvements in GPT-4 performance [6] led to a higher rate of success for generating valid explanations compared to GPT-3.5. A replication of the original study was conducted by using GPT-4 to generate explanations for the same problems given to GPT-3.5. Using GPT-4, explanation correctness dramatically improved to a success rate of 94%.We were further interested in evaluating if GPT-4 explanations were positively perceived compared to human-written explanations. A preregistered, single-blinded study was implemented where 10 evaluators were asked to rate the quality of randomized GPT-4 and teacher-created explanations. Even with 4% of problems containing some amount of incorrect content, GPT-4 explanations were preferred over human explanations. The implications of our signi- cant results at Learning @ Scale are that digital platforms can start A/B testing the eects of GPT-4 generated explanations on student learning, implementing explanations at scale, and also prompt programming to test dierent education theories, e.g., social emotional learning factors [5].

More Like this