<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Comparing Different Approaches to Generating Mathematics Explanations Using Large Language Models.</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2023 July</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10425014</idno>
					<idno type="doi">10.1007/978-3-031-36336-8_45</idno>
					<title level='j'>AIED 2023</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>E Prihar</author><author>M Lee</author><author>M Hopman</author><author>A Kalai</author><author>S Vempala</author><author>A Wang</author><author>G Wickline</author><author>N Heffernan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Large language models have recently been able to performwell in a wide variety of circumstances. In this work, we explore the possi-bility of large language models, specifically GPT-3, to write explanationsfor middle-school mathematics problems, with the goal of eventually us-ing this process to rapidly generate explanations for the mathematicsproblems of new curricula as they emerge, shortening the time to inte-grate new curricula into online learning platforms. To generate expla-nations, two approaches were taken. The first approach attempted tosummarize the salient advice in tutoring chat logs between students andlive tutors. The second approach attempted to generate explanations us-ing few-shot learning from explanations written by teachers for similarmathematics problems. After explanations were generated, a survey wasused to compare their quality to that of explanations written by teachers.We test our methodology using the GPT-3 language model. Ultimately,the synthetic explanations were unable to outperform teacher writtenexplanations. In the future more powerful large language models may beemployed, and GPT-3 may still be effective as a tool to augment teach-ers’ process for writing explanations, rather than as a tool to replacethem. The explanations, survey results, analysis code, and a dataset oftutoring chat logs are all available at https://osf.io/wh5n9/.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Online learning platforms offer students tutoring in a variety of forms, such as one-on-one messaging with real human tutors <ref type="bibr">[1]</ref> or providing expert-written messages for each question that students are required to answer <ref type="bibr">[5]</ref>. These methods, while effective, can be costly and time consuming to scale. However, recent advances in Language Models (LMs) may provide an opportunity to offset the cost of providing effective tutoring to students.</p><p>In this work, we explore the effectiveness of using LMs to create explanations of mathematics problems for students within the ASSISTments online learning platform <ref type="bibr">[5]</ref>. Recent transformer-based LMs have exhibited breakthrough performance on a number of domains <ref type="bibr">[2,</ref><ref type="bibr">3]</ref>. In this work, we perform experiments using one of the most powerful currently available LMs, , accessed through OpenAI's API.</p><p>Two different approaches to generate this content were explored. The first approach used few-shot learning <ref type="bibr">[2]</ref> to generate new explanations from a handful of similar mathematics problems with answers and explanations, and the second approach attempted to generate new explanations by using the LM to summarize message logs between students and real human tutors. After each method was used to generate new explanations, these explanations were compared to existing explanations in the ASSISTments online learning platform through surveys given to mathematics teachers. Comparing teachers' evaluations of the quality of the various explanations enabled an empirical evaluation of each LM-based approach, as well as an evaluation of their applicability in a real-world setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Language models</head><p>LMs are a type of deep learning model trained to generate human-like text. They are trained on a massive dataset of millions of web pages, books, and other written documents, and are capable of generating text that is often indistinguishable from human-written text <ref type="bibr">[2,</ref><ref type="bibr">3]</ref>. In this work, we focus on GPT-3 since it is a powerful LM that is publicly accessible through a paid API. When using GPT-3, one can specify parameters for the text generation such as Frequency Penalty, which penalizes GPT-3 for repeating phrases in its response, Temperature, which increases the frequency of picking a less-that-most-likely word to include in the response, and Max Tokens, which specifies the maximum length of the response [2].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Data Sources</head><p>The data used to generate explanations using few-shot learning came from the ASSISTments online learning platform <ref type="bibr">[5]</ref>. Within ASSISTments, middle-school mathematics students complete mathematics problem sets assigned to them by their teacher. If students are struggling with their assignment, ASSISTments will provide them with an explanation upon their request. When a student requests an explanation, a message that explains how to solve the mathematics problem they are currently struggling with and the solution to the problem is provided to them.</p><p>The data used to generate explanations from summaries of tutoring chat logs comes from UPchieve, a provider of online tutoring. UPchieve<ref type="foot">foot_0</ref> offers live online tutoring with volunteers through an interface that facilitates sharing of text and images. In ASSISTments, students had the ability to request a chat with a live tutor. When a live tutor was requested, a tutoring session was opened via UPchieve.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Data Processing</head><p>In order to examine the effects of changes to the prompts on the generate explanations in a way that would not bias the results of the analysis, all of the available data for generating explanations was split in half. Half of the data was used for prompt engineering (development set). This data was used iteratively to examine how small variations in the prompt effected the resulting explanations. Once the generated explanations reached a satisfactory level, the most effective prompts were used on the second half of the data (evaluation set). The analysis of the validity and quality of explanations discussed in the results was performed only on this second half of the data, eliminating any bias from the prompt engineering process.</p><p>During the live tutoring partnership period, there were 244 tutoring sessions across 93 students and 110 problems covering various middle-school mathematics skills. Of these tutoring sessions, 2 were excluded because they contained no interaction between student and tutor and 2 were excluded because they were longer than GPT-3's 4,000 token limit. Of the 40,523 problems available, only 914 problems remained after removing the problems that were not of the same skills as the problems in the tutoring sessions and the problems that could not be used in the few-shot learning prompt due to the presence of images within the problems, answers, or explanations. Both the chat logs and the problems and their explanations were evenly partitioned into a development and evaluation set stratified by subject matter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Methodology</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Tutoring Chat Log Summarization</head><p>Development data was used to engineer a four step process for generating explanations from tutoring chat logs. The prompts are shown below, with the GPT-3 parameters shown in parentheses as (Frequency Penalty, Temperature, Max Tokens). The text-davinci-003 model was used for all prompts.</p><p>1. Does the the tutor successfully help the student in the following chain of messages? [The tutoring chat log.] (0, 0.7, 128) 2. Explain the mathematical concepts the tutor used to help the student, including explanations the tutor gave of these concepts, and ignoring any names. [The tutoring chat log.] (0.25, 0.9, 750) 3. Reword the following explanation to not include references to a tutor or student, and to be in the present tense: [The previously generated explanation.] (0.25, 0.9, 750) 4. Summarize the following explanations, making sure to include the most generalizable math advice. [The previously generated explanations.] (0.25, 0.9, 500)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Problem-Level Explanation Few-Shot Learning</head><p>Before generating explanations for the 53 problems in the summarization development set, problems that were open response or not text-based had to be removed. For each of the 40 remaining problems a prompt was constructed by randomly sampling problems of the same skill from the development set, and appending the phrase below, replacing the content in brackets with the problem content, until there were no more same-skill problems or the max token length was reached. For these prompts, the Frequency Penalty was 0, the Temperature was 0.73, the Max Tokens was 256, and the code-davinci-003 model was used.</p><p>Problem: [The text of the problem.] Answer: [The answer to the problem.] Explanation: [The explanation for the problem.]</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Empirical Analysis of Generated Explanations</head><p>After the summarization and few-shot learning processes were completed for the evaluation data using the processes developed with the development data. The explanations from both processes were manually evaluated by subject-matter experts for both structural and mathematical validity. Structural validity required that the explanation be in the format of an explanation. Mathematical validity required that the explanation be mathematically correct.</p><p>The valid generated explanations and teacher-written explanations for the same problems were compiled into a survey. The source of the explanations was blinded, and mathematics teachers were given a picture of each mathematics problem and the text of the explanation and told to rate the explanations on a scale from 1-5. A multi-level model <ref type="bibr">[4]</ref> was used to predict the rating of each explanation given random effects for the rater and the mathematics problem, and fixed effects for the source of the explanation. The effects for the sources of explanations were used to determine if there were any statistically significant differences between the sources.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Results</head><p>Performing the summarization process on the evaluation data resulted in 57 explanations. Expert review found 14 structurally and 17 mathematically invalid explanations. For the 61 problems available in the summarization evaluation set, only 33 explanations could be generated using the few-shot learning approach due to the presence of images in the problems. Expert review found 1 structurally and 26 mathematically invalid explanations.</p><p>In total, 26 summarization, 6 few-shot learning, and 10 ASSISTments explanations were included for a total of 42 survey questions. Five current or former middle-school or high-school mathematics teachers completed the survey. Once survey results were collected, two different models were fit; one that only included teachers ratings of valid explanations, and one that included all the generated explanations, with a rating of 1 for explanations that were invalid. The effects and 95% confidence intervals of the different sources of explanations are shown in Figure <ref type="figure">1</ref>. ASSISTments explanations are rated the highest, with an average rating of about 4.2. Summarization based explanations were statistically significantly worse than ASSISTments explanations, with an average rating of about 2.6 for the valid explanations and 1.7 for all explanations. Qualitatively, teachers reported that the summarization based explanations used terms that the students did not necessarily know, and tended to give advice that was too general. Few-shot learning based explanations received an average rating of about 3.6 for valid explanations, which was not statistically significantly worse than ASSISTments explanations, but only 6 of the few-shot learning based explanations were valid. Few-shot learning based explanations received an average rating of about 1.6 when invalid explanations were included in the model, which is statistically significantly worse than ASSISTments explanations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>Overall it seems that GPT-3 based explanations do not compare in quality to those created by teachers. Fundamentally, GPT-3 was trained to understand language, but not mathematics, and while the structure of what GPT-3 generated made proper use of the English language, it often generated incorrect mathematical content, or simply failed to generate content in the proper format. Summarizing tutors' advice to students created explanations that were significantly worse than teacher-written explanations, and while the valid explanations generated through few-shot learning were not significantly worse than teacher-written explanations, only 10% of the generated explanations were mathematically valid. Ultimately, GPT-3 does not seem to have the grasp of mathematics necessary to generate high-quality explanations. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0"><p>https://upchieve.org/</p></note>
		</body>
		</text>
</TEI>
