<?xml version="1.0" encoding="UTF-8"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcq="http://purl.org/dc/terms/"><records count="1" morepages="false" start="1" end="1"><record rownumber="1"><dc:product_type>Conference Paper</dc:product_type><dc:title>Evaluating Large Language Model Code Generation as an Autograding Mechanism for "Explain in Plain English" Questions</dc:title><dc:creator>Smith, David H (ORCID:0000000265724347); Zilles, Craig (ORCID:0000000346014398)</dc:creator><dc:corporate_author/><dc:editor/><dc:description>The ability of students to “Explain in Plain English” (EiPE) the
purpose of code is a critical skill for students in introductory programming courses to develop. EiPE questions serve as both a mechanism for students to develop and demonstrate code comprehension
skills. However, evaluating this skill has been challenging as manual
grading is time consuming and not easily automated. The process
of constructing a prompt for the purposes of code generation for
a Large Language Model, such OpenAI’s GPT-4, bears a striking
resemblance to constructing EiPE responses. In this paper, we explore the potential of using test cases run on code generated by
GPT-4 from students’ EiPE responses as a grading mechanism for
EiPE questions. We applied this proposed grading method to a corpus of EiPE responses collected from past exams, then measured
agreement between the results of this grading method and human
graders. Overall, we find moderate agreement between the human
raters and the results of the unit tests run on the generated code.
This appears to be attributable to GPT-4’s code generation being
more lenient than human graders on low-level descriptions of code</dc:description><dc:publisher>ACM</dc:publisher><dc:date>2024-03-14</dc:date><dc:nsf_par_id>10644112</dc:nsf_par_id><dc:journal_name/><dc:journal_volume/><dc:journal_issue/><dc:page_range_or_elocation>1824 to 1825</dc:page_range_or_elocation><dc:issn/><dc:isbn/><dc:doi>https://doi.org/10.1145/3626253.3635542</dc:doi><dcq:identifierAwardId>2121424</dcq:identifierAwardId><dc:subject>GPT-4</dc:subject><dc:subject>Large Language Models</dc:subject><dc:subject>EiPE</dc:subject><dc:subject>Autograding</dc:subject><dc:version_number/><dc:location/><dc:rights/><dc:institution/><dc:sponsoring_org>National Science Foundation</dc:sponsoring_org></record></records></rdf:RDF>