skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Fair Grading Algorithms for Randomized Exams
This paper studies grading algorithms for randomized exams. In a randomized exam, each student is asked a small number of random questions from a large question bank. The predominant grading rule is simple averaging, i.e., calculating grades by averaging scores on the questions each student is asked, which is fair ex-ante, over the randomized questions, but not fair ex-post, on the realized questions. The fair grading problem is to estimate the average grade of each student on the full question bank. The maximum-likelihood estimator for the Bradley-Terry-Luce model on the bipartite student-question graph is shown to be consistent with high probability when the number of questions asked to each student is at least the cubed-logarithm of the number of students. In an empirical study on exam data and in simulations, our algorithm based on the maximum-likelihood estimator significantly outperforms simple averaging in prediction accuracy and ex-post fairness even with a small class and exam size.  more » « less
Award ID(s):
2229162
PAR ID:
10472227
Author(s) / Creator(s):
; ;
Editor(s):
Kunal Talwar
Publisher / Repository:
Schloss Dagstuhl
Date Published:
Journal Name:
4th Symposium on Foundations of Responsible Computing (FORC 2023)
Volume:
256
ISSN:
1868-8969
ISBN:
978-3-95977-272-3
Page Range / eLocation ID:
7:1--7:22
Subject(s) / Keyword(s):
Ex-ante and Ex-post Fairness Item Response Theory Algorithmic Fairness in Education
Format(s):
Medium: X
Location:
Stanford, CA
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We describe the deployment of an imperfect NLP-based automatic short answer grading system on an exam in a large-enrollment introductory college course. We characterize this deployment as both high stakes (the questions were on an mid-term exam worth 10% of students’ final grade) and high transparency (the question was graded interactively during the computer-based exam and correct solutions were shown to students that could be compared to their answer). We study two techniques designed to mitigate the potential student dissatisfaction resulting from students incorrectly not granted credit by the imperfect AI grader. We find (1) that providing multiple attempts can eliminate first-attempt false negatives at the cost of additional false positives, and (2) that students not granted credit from the algorithm cannot reliably determine if their answer was mis-scored. 
    more » « less
  2. null (Ed.)
    We explore how course policies affect students' studying and learning when a second-chance exam is offered. High-stakes, one-off exams remain a de facto standard for assessing student knowledge in STEM, despite compelling evidence that other assessment paradigms such as mastery learning can improve student learning. Unfortunately, mastery learning can be costly to implement. We explore the use of optional second-chance testing to sustainably reap the benefits of mastery-based learning at scale. Prior work has shown that course policies affect students' studying and learning but have not compared these effects within the same course context. We conducted a quasi-experimental study in a single course to compare the effect of two grading policies for second-chance exams and the effect of increasing the size of the range of dates for students taking asynchronous exams. The first grading policy, called 90-cap, allowed students to optionally take a second-chance exam that would fully replace their score on a first-chance exam except the second-chance exam would be capped at 90% credit. The second grading policy, called 90-10, combined students' first- and second-chance exam scores as a weighted average (90% max score + 10% min score). The 90-10 policy significantly increased the likelihood that marginally competent students would take the second-chance exam. Further, our data suggests that students learned more under the 90-10 policy, providing improved student learning outcomes at no cost to the instructor. Most students took exams on the last day an exam was available, regardless of how many days the exam was available. 
    more » « less
  3. The optimal receiver operating characteristic (ROC) curve, giving the maximum probability of detection as a function of the probability of false alarm, is a key information-theoretic indicator of the difficulty of a binary hypothesis testing problem (BHT). It is well known that the optimal ROC curve for a given BHT, corresponding to the likelihood ratio test, is theoretically determined by the probability distribution of the observed data under each of the two hypotheses. In some cases, these two distributions may be unknown or computationally intractable, but independent samples of the likelihood ratio can be observed. This raises the problem of estimating the optimal ROC for a BHT from such samples. The maximum likelihood estimator of the optimal ROC curve is derived, and it is shown to converge to the true optimal ROC curve in the \levy\ metric, as the number of observations tends to infinity. A classical empirical estimator, based on estimating the two types of error probabilities from two separate sets of samples, is also considered. The maximum likelihood estimator is observed in simulation experiments to be considerably more accurate than the empirical estimator, especially when the number of samples obtained under one of the two hypotheses is small. The area under the maximum likelihood estimator is derived; it is a consistent estimator of the true area under the optimal ROC curve. 
    more » « less
  4. In schools and colleges around the world, open-ended home-work assignments are commonly used. However, such assignments require substantial instructor effort for grading, and tend not to support opportunities for repeated practice. We propose UpGrade, a novel learnersourcing approach that generates scalable learning opportunities using prior student solutions to open-ended problems. UpGrade creates interactive questions that offer automated and real-time feedback, while enabling repeated practice. In a two-week experiment in a college-level HCI course, students answering UpGrade-created questions instead of traditional open-ended assignments achieved indistinguishable learning outcomes in ~30% less time. Further, no manual grading effort is required. To enhance quality control, UpGrade incorporates a psychometric approach using crowd workers' answers to automatically prune out low quality questions, resulting in a question bank that exceeds reliability standards for classroom use. 
    more » « less
  5. Computer-based testing is a powerful tool for scaling exams in large lecture classes. The decision to adopt computer-based testing is typically framed as a tradeoff in terms of time; time saved by auto-grading is reallocated as time spent developing problem pools, but with significant time savings. This paper seeks to examine the tradeoff in terms of accuracy in measuring student understanding. While some exams (e.g., multiple choice) are readily portable to a computer-based format, adequately porting other exam types (e.g., drawings like FBDs or worked problems) can be challenging. A key component of this challenge is to ask “What is the exam actually able to measure?” In this paper the authors will provide a quantitative and qualitative analysis of student understanding measurements via computer-based testing in a sophomore level Solid Mechanics course. At Michigan State University, Solid Mechanics is taught using the SMART methodology. SMART stands for Supported Mastery Assessment through Repeated Testing. In a typical semester, students are given 5 exams that test their understanding of the material. Each exam is graded using the SMART rubric which awards full points for the correct answer, some percentage for non-conceptual errors, and zero points for a solution that has a conceptual error. Every exam is divided into four sections; concept, simple, average, and challenge. Each exam has at least one retake opportunity, for a total of 10 written tests. In the current study, students representing 10% of the class took half of each exam in Prairie Learn, a computer-based auto-grading platform. During this exam, students were given instant feedback on submitted answers (correct or incorrect) and given an opportunity to identify their mistakes and resubmit their work. Students were provided with scratch paper to set up the problem and work out solutions. After the exam, the paper-based work was compared with the computer submitted answers. This paper examines what types of mistakes (conceptual and non-conceptual) students were able to correct when feedback was provided. The answer is dependent on the type and difficulty of the problem. The analysis also examines whether students taking the computer-based test performed at the same level as their peers who took the paper-based exams. Additionally, student feedback is provided and discussed. 
    more » « less