skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: Comparing Different Approaches to Generating Mathematics Explanations Using Large Language Models
Large language models have recently been able to perform well in a wide variety of circumstances. In this work, we explore the possibility of large language models, specifically GPT-3, to write explanations for middle-school mathematics problems, with the goal of eventually using this process to rapidly generate explanations for the mathematics problems of new curricula as they emerge, shortening the time to integrate new curricula into online learning platforms. To generate explanations, two approaches were taken. The first approach attempted to summarize the salient advice in tutoring chat logs between students and live tutors. The second approach attempted to generate explanations using few-shot learning from explanations written by teachers for similar mathematics problems. After explanations were generated, a survey was used to compare their quality to that of explanations written by teachers. We test our methodology using the GPT-3 language model. Ultimately, the synthetic explanations were unable to outperform teacher written explanations. In the future more powerful large language models may be employed, and GPT-3 may still be effective as a tool to augment teachers’ process for writing explanations, rather than as a tool to replace them. The explanations, survey results, analysis code, and a dataset of tutoring chat logs are all available at https://osf.io/wh5n9/.  more » « less
Award ID(s):
2118725
NSF-PAR ID:
10451148
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
AIED 2023
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Large language models have recently been able to perform well in a wide variety of circumstances. In this work, we explore the possibility of large language models, specifically GPT-3, to write explanations for middle-school mathematics problems, with the goal of eventually using this process to rapidly generate explanations for the mathematics problems of new curricula as they emerge, shortening the time to integrate new curricula into online learning platforms. To generate explanations, two approaches were taken. The first approach attempted to summarize the salient advice in tutoring chat logs between students and live tutors. The second approach attempted to generate explanations using few-shot learning from explanations written by teachers for similar mathematics problems. After explanations were generated, a survey was used to compare their quality to that of explanations written by teachers. We test our methodology using the GPT-3 language model. Ultimately, the synthetic explanations were unable to outperform teacher written explanations. In the future more powerful large language models may be employed, and GPT-3 may still be effective as a tool to augment teachers’ process for writing explanations, rather than as a tool to replace them. The explanations, survey results, analysis code, and a dataset of tutoring chat logs are all available at https://osf.io/wh5n9/. 
    more » « less
  2. Large language models have recently been able to perform well in a wide variety of circumstances. In this work, we explore the possi- bility of large language models, specifically GPT-3, to write explanations for middle-school mathematics problems, with the goal of eventually us- ing this process to rapidly generate explanations for the mathematics problems of new curricula as they emerge, shortening the time to inte- grate new curricula into online learning platforms. To generate expla- nations, two approaches were taken. The first approach attempted to summarize the salient advice in tutoring chat logs between students and live tutors. The second approach attempted to generate explanations us- ing few-shot learning from explanations written by teachers for similar mathematics problems. After explanations were generated, a survey was used to compare their quality to that of explanations written by teachers. We test our methodology using the GPT-3 language model. Ultimately, the synthetic explanations were unable to outperform teacher written explanations. In the future more powerful large language models may be employed, and GPT-3 may still be effective as a tool to augment teachers’ process for writing explanations, rather than as a tool to replace them. The prompts, explanations, survey results, analysis code, and a dataset of tutoring chat logs are all available at BLINDED FOR REVIEW. 
    more » « less
  3. Large language models have recently been able to perform well in a wide variety of circumstances. In this work, we explore the possi- bility of large language models, specifically GPT-3, to write explanations for middle-school mathematics problems, with the goal of eventually us- ing this process to rapidly generate explanations for the mathematics problems of new curricula as they emerge, shortening the time to inte- grate new curricula into online learning platforms. To generate expla- nations, two approaches were taken. The first approach attempted to summarize the salient advice in tutoring chat logs between students and live tutors. The second approach attempted to generate explanations us- ing few-shot learning from explanations written by teachers for similar mathematics problems. After explanations were generated, a survey was used to compare their quality to that of explanations written by teachers. We test our methodology using the GPT-3 language model. Ultimately, the synthetic explanations were unable to outperform teacher written explanations. In the future more powerful large language models may be employed, and GPT-3 may still be effective as a tool to augment teach- ers’ process for writing explanations, rather than as a tool to replace them. The explanations, survey results, analysis code, and a dataset of tutoring chat logs are all available at https://osf.io/wh5n9/. 
    more » « less
  4. Large language models have recently been able to perform well in a wide variety of circumstances. In this work, we explore the possi- bility of large language models, specifically GPT-3, to write explanations for middle-school mathematics problems, with the goal of eventually us- ing this process to rapidly generate explanations for the mathematics problems of new curricula as they emerge, shortening the time to inte- grate new curricula into online learning platforms. To generate expla- nations, two approaches were taken. The first approach attempted to summarize the salient advice in tutoring chat logs between students and live tutors. The second approach attempted to generate explanations us- ing few-shot learning from explanations written by teachers for similar mathematics problems. After explanations were generated, a survey was used to compare their quality to that of explanations written by teachers. We test our methodology using the GPT-3 language model. Ultimately, the synthetic explanations were unable to outperform teacher written explanations. In the future more powerful large language models may be employed, and GPT-3 may still be effective as a tool to augment teach- ers’ process for writing explanations, rather than as a tool to replace them. The explanations, survey results, analysis code, and a dataset of tutoring chat logs are all available at https://osf.io/wh5n9/. 
    more » « less
  5. The development and measurable improvements in performance of large language models on natural language tasks opens the opportunity to utilize large language models in an educational setting to replicate human tutoring, which is often costly and inaccessible. We are particularly interested in large language models from the GPT series, created by OpenAI. In the original study we found that the quality of explanations generated with GPT-3.5 was poor, where two different approaches to generating explanations resulted in a 43% and 10% successrate. In a replication study, we were interested in whether the measurable improvements in GPT-4 performance led to a higher rate of success for generating valid explanations compared to GPT-3.5. A replication of the original study was conducted by using GPT-4 to generate explanations for the same problems given to GPT-3.5. Using GPT-4, explanation correctness dramatically improved to a success rate of 94%. We were further interested in evaluating if GPT-4 explanations were positively perceived compared to human-written explanations. A preregistered, follow-up study was implemented where 10 evaluators were asked to rate the quality of randomized GPT-4 and teacher-created explanations. Even with 4% of problems containing some amount of incorrect content, GPT-4 explanations were preferred over human explanations. 
    more » « less