Evaluating Large Language Model Code Generation as an Autograding Mechanism for "Explain in Plain English" Questions

Smith, David H (ORCID:0000000265724347); Zilles, Craig (ORCID:0000000346014398)

doi:10.1145/3626253.3635542

Citation Details

Evaluating Large Language Model Code Generation as an Autograding Mechanism for "Explain in Plain English" Questions

The ability of students to “Explain in Plain English” (EiPE) the purpose of code is a critical skill for students in introductory programming courses to develop. EiPE questions serve as both a mechanism for students to develop and demonstrate code comprehension skills. However, evaluating this skill has been challenging as manual grading is time consuming and not easily automated. The process of constructing a prompt for the purposes of code generation for a Large Language Model, such OpenAI’s GPT-4, bears a striking resemblance to constructing EiPE responses. In this paper, we explore the potential of using test cases run on code generated by GPT-4 from students’ EiPE responses as a grading mechanism for EiPE questions. We applied this proposed grading method to a corpus of EiPE responses collected from past exams, then measured agreement between the results of this grading method and human graders. Overall, we find moderate agreement between the human raters and the results of the unit tests run on the generated code. This appears to be attributable to GPT-4’s code generation being more lenient than human graders on low-level descriptions of code more »

Award ID(s):: 2121424

PAR ID:: 10644112

Author(s) / Creator(s):: Smith, David H; Zilles, Craig

Publisher / Repository:: ACM

Date Published:: 2024-03-14

Page Range / eLocation ID:: 1824 to 1825

Subject(s) / Keyword(s):: GPT-4 Large Language Models EiPE Autograding

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3626253.3635542

More Like this