skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Strategies for Deploying Unreliable AI Graders in High-Transparency High-Stakes Exams
We describe the deployment of an imperfect NLP-based automatic short answer grading system on an exam in a large-enrollment introductory college course. We characterize this deployment as both high stakes (the questions were on an mid-term exam worth 10% of students’ final grade) and high transparency (the question was graded interactively during the computer-based exam and correct solutions were shown to students that could be compared to their answer). We study two techniques designed to mitigate the potential student dissatisfaction resulting from students incorrectly not granted credit by the imperfect AI grader. We find (1) that providing multiple attempts can eliminate first-attempt false negatives at the cost of additional false positives, and (2) that students not granted credit from the algorithm cannot reliably determine if their answer was mis-scored.  more » « less
Award ID(s):
1915257
PAR ID:
10200291
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
International Conference on Artificial Intelligence in Education
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We explore how course policies affect students' studying and learning when a second-chance exam is offered. High-stakes, one-off exams remain a de facto standard for assessing student knowledge in STEM, despite compelling evidence that other assessment paradigms such as mastery learning can improve student learning. Unfortunately, mastery learning can be costly to implement. We explore the use of optional second-chance testing to sustainably reap the benefits of mastery-based learning at scale. Prior work has shown that course policies affect students' studying and learning but have not compared these effects within the same course context. We conducted a quasi-experimental study in a single course to compare the effect of two grading policies for second-chance exams and the effect of increasing the size of the range of dates for students taking asynchronous exams. The first grading policy, called 90-cap, allowed students to optionally take a second-chance exam that would fully replace their score on a first-chance exam except the second-chance exam would be capped at 90% credit. The second grading policy, called 90-10, combined students' first- and second-chance exam scores as a weighted average (90% max score + 10% min score). The 90-10 policy significantly increased the likelihood that marginally competent students would take the second-chance exam. Further, our data suggests that students learned more under the 90-10 policy, providing improved student learning outcomes at no cost to the instructor. Most students took exams on the last day an exam was available, regardless of how many days the exam was available. 
    more » « less
  2. Computer-based testing is a powerful tool for scaling exams in large lecture classes. The decision to adopt computer-based testing is typically framed as a tradeoff in terms of time; time saved by auto-grading is reallocated as time spent developing problem pools, but with significant time savings. This paper seeks to examine the tradeoff in terms of accuracy in measuring student understanding. While some exams (e.g., multiple choice) are readily portable to a computer-based format, adequately porting other exam types (e.g., drawings like FBDs or worked problems) can be challenging. A key component of this challenge is to ask “What is the exam actually able to measure?” In this paper the authors will provide a quantitative and qualitative analysis of student understanding measurements via computer-based testing in a sophomore level Solid Mechanics course. At Michigan State University, Solid Mechanics is taught using the SMART methodology. SMART stands for Supported Mastery Assessment through Repeated Testing. In a typical semester, students are given 5 exams that test their understanding of the material. Each exam is graded using the SMART rubric which awards full points for the correct answer, some percentage for non-conceptual errors, and zero points for a solution that has a conceptual error. Every exam is divided into four sections; concept, simple, average, and challenge. Each exam has at least one retake opportunity, for a total of 10 written tests. In the current study, students representing 10% of the class took half of each exam in Prairie Learn, a computer-based auto-grading platform. During this exam, students were given instant feedback on submitted answers (correct or incorrect) and given an opportunity to identify their mistakes and resubmit their work. Students were provided with scratch paper to set up the problem and work out solutions. After the exam, the paper-based work was compared with the computer submitted answers. This paper examines what types of mistakes (conceptual and non-conceptual) students were able to correct when feedback was provided. The answer is dependent on the type and difficulty of the problem. The analysis also examines whether students taking the computer-based test performed at the same level as their peers who took the paper-based exams. Additionally, student feedback is provided and discussed. 
    more » « less
  3. Carvalho, Paulo F. (Ed.)
    Evidence-based teaching practices are associated with improved student academic performance. However, these practices encompass a wide range of activities and determining which type, intensity or duration of activity is effective at improving student exam performance has been elusive. To address this shortcoming, we used a previously validated classroom observation tool, Practical Observation Rubric to Assess Active Learning (PORTAAL) to measure the presence, intensity, and duration of evidence-based teaching practices in a retrospective study of upper and lower division biology courses. We determined the cognitive challenge of exams by categorizing all exam questions obtained from the courses using Bloom’s Taxonomy of Cognitive Domains. We used structural equation modeling to correlate the PORTAAL practices with exam performance while controlling for cognitive challenge of exams, students’ GPA at start of the term, and students’ demographic factors. Small group activities, randomly calling on students or groups to answer questions, explaining alternative answers, and total time students were thinking, working with others or answering questions had positive correlations with exam performance. On exams at higher Bloom’s levels, students explaining the reasoning underlying their answers, students working alone, and receiving positive feedback from the instructor also correlated with increased exam performance. Our study is the first to demonstrate a correlation between the intensity or duration of evidence-based PORTAAL practices and student exam performance while controlling for Bloom’s level of exams, as well as looking more specifically at which practices correlate with performance on exams at low and high Bloom’s levels. This level of detail will provide valuable insights for faculty as they prioritize changes to their teaching. As we found that multiple PORTAAL practices had a positive association with exam performance, it may be encouraging for instructors to realize that there are many ways to benefit students’ learning by incorporating these evidence-based teaching practices. 
    more » « less
  4. null (Ed.)
    In this paper, we study a computerized exam system that allows students to attempt the same question multiple times. This system permits students either to receive feedback on their submitted answer immediately or to defer the feedback and grade questions in bulk. An analysis of student behavior in three courses across two semesters found similar student behaviors across courses and student groups. We found that only a small minority of students used the deferred feedback option. A clustering analysis that considered both when students chose to receive feedback and either to immediately retry incorrect problems or to attempt other unfinished problems identified four main student strategies. These strategies were correlated to statistically significant differences in exam scores, but it was not clear if some strategies improved outcomes or if stronger students tended to prefer certain strategies. 
    more » « less
  5. We study a game theoretic model of standardized testing for college admissions. Students are of two types; High and Low. There is a college that would like to admit the High type students. Students take a potentially costly standardized exam which provides a noisy signal of their type. The students come from two populations, which are identical in talent (i.e. the type distribution is the same), but differ in their access to resources: the higher resourced population can at their option take the exam multiple times, whereas the lower resourced population can only take the exam once. We study two models of score reporting, which capture existing policies used by colleges. The first policy (sometimes known as "super-scoring") allows students to report the max of the scores they achieve. The other policy requires that all scores be reported. We find in our model that requiring that all scores be reported results in superior outcomes in equilibrium, both from the perspective of the college (the admissions rule is more accurate), and from the perspective of equity across populations: a student's probability of admission is independent of their population, conditional on their type. In particular, the false positive rates and false negative rates are identical in this setting, across the highly and poorly resourced student populations. This is the case despite the fact that the more highly resourced students can -- at their option -- either report a more accurate signal of their type, or pool with the lower resourced population under this policy. 
    more » « less