skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: What Do Peer Evaluations Represent? A Study of Rater Consensus and Target Personality
When professors assign group work, they assume that peer ratings are a valid source of information, but few studies have evaluated rater consensus in such ratings. We analyzed peer ratings from project teams in a second-year university course to examine consensus. Our first goal was to examine whether members of a team generally agreed on the competence of each team member. Our second goal was to test if a target’s personality traits predicted how well they were rated. Our third goal was to evaluate whether the self-rating of each student correlated with their peer rating. Data were analyzed from 130 students distributed across 21 teams (mean team size = 6.2). The sample was diverse in gender and ethnicity. Social relations model analyses showed that on average 32% of variance in peer-ratings was due to “consensus,” meaning some targets consistently received higher skill ratings than other targets did. Another 20% of the variance was due to “assimilation,” meaning some raters consistently gave higher ratings than other raters did. Thus, peer ratings reflected consensus (target effects), but also assimilation (rater effects) and noise. Among the six HEXACO traits that we examined, only conscientiousness predicted higher peer ratings, suggesting it may be beneficial to assign one highly conscientious person to every team. Lastly, there was an average correlation of.35 between target effects and self-ratings, indicating moderate self-other agreement, which suggests that students were only weakly biased in their self-ratings.  more » « less
Award ID(s):
1730262
PAR ID:
10335853
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Frontiers in Education
Volume:
7
ISSN:
2504-284X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ratings are present in many areas of assessment including peer review of research proposals and journal articles, teacher observations, university admissions and selection of new hires. One feature present in any rating process with multiple raters is that different raters often assign different scores to the same assessee, with the potential for bias and inconsistencies related to rater or assessee covariates. This paper analyzes disparities in ratings of internal and external applicants to teaching positions using applicant data from Spokane Public Schools. We first test for biases in rating while accounting for measures of teacher applicant qualifications and quality. Then, we develop model-based inter-rater reliability (IRR) estimates that allow us to account for various sources of measurement error, the hierarchical structure of the data, and to test whether covariates, such as applicant status, moderate IRR. We find that applicants external to the district receive lower ratings for job applications compared to internal applicants. This gap in ratings remains significant even after including measures of qualifications and quality such as experience, state licensure scores, or estimated teacher value added. With model-based IRR, we further show that consistency between raters is significantly lower when rating external applicants. We conclude the paper by discussing policy implications and possible applications of our model-based IRR estimate for hiring and selection practices in and out of the teacher labor market. 
    more » « less
  2. Assessing team software development projects is notoriously difficult and typically based on subjective metrics. To help make assessments more rigorous, we conducted an empirical study to explore relationships between subjective metrics based on peer and instructor assessments, and objective metrics based on GitHub and chat data. We studied 23 undergraduate software teams (n= 117 students) from two undergraduate computing courses at two North American research universities. We collected data on teams’ (a) commits and issues from their GitHub code repositories, (b) chat messages from their Slack and Microsoft Teams channels, (c) peer evaluation ratings from the CATME peer evaluation system, and (d) individual assignment grades from the courses. We derived metrics from (a) and (b) to measure both individual team members’contributionsto the team, and theequalityof team members’ contributions. We then performed Pearson analyses to identify correlations among the metrics, peer evaluation ratings, and individual grades. We found significant positive correlations between team members’ GitHub contributions, chat contributions, and peer evaluation ratings. In addition, the equality of teams’ GitHub contributions was positively correlated with teams’ average peer evaluation ratings and negatively correlated with the variance in those ratings. However, no such positive correlations were detected between the equality of teams’ chat contributions and their peer evaluation ratings. Our study extends previous research results by providing evidence that (a) team members’ chat contributions, like their GitHub contributions, are positively correlated with their peer evaluation ratings; (b) team members’ chat contributions are positively correlated with their GitHub contributions; and (c) the equality of team’ GitHub contributions is positively correlated with their peer evaluation ratings. These results lend further support to the idea that combining objective and subjective metrics can make the assessment of team software projects more comprehensive and rigorous. 
    more » « less
  3. Rated preference aggregation is conventionally performed by averaging ratings from multiple evaluators to create a consensus ordering of candidates from highest to lowest average rating. Ideally, the consensus is fair, meaning critical opportunities are not withheld from marginalized groups of candidates, even if group biases may be present in the to-be-combined ratings. Prior work operationalizing fairness in preference aggregation is limited to settings where evaluators provide rankings of candidates (e.g., Joe > Jack > Jill). Yet, in practice, many evaluators assign ratings such as Likert scales or categories (e.g., yes, no, maybe) to each candidate. Ratings convey different information than rankings leading to distinct fairness issues during their aggregation. The existing literature does not characterize these fairness concerns nor provide applicable bias-mitigation solutions. Unlike the ranked setting studied previously, two unique forms of bias arise in rating aggregation. First, biased rating stems from group disparities in to-be-aggregated evaluator ratings. Second, biased tie-breaking occurs because ties in average ratings must be resolved when aggregating ratings into a consensus ranking, and this tie-breaking act can unfairly advantage certain groups. To address this gap, we define the open fair rated preference aggregation problem and introduce the corresponding Fate methodology. Fate offers the first group fairness metric specifically for rated preference data. We propose two Fate algorithms. Fate-Break works in settings when ties need to be broken, explicitly fairness-enhancing such processes without lowering consensus utility. Fate-Rate mitigates disparities in how groups are rated, by using a Markov-chain approach to generate outcomes where groups are, in as much as possible, equally represented. Our experimental study illustrates the FATE methods provide the most bias-mitigation compared to adapting prior methods to fair tie-breaking and rating aggregation. 
    more » « less
  4. This evidence-based practices paper discusses the method employed in validating the use of a project modified version of the PROCESS tool (Grigg, Van Dyken, Benson, & Morkos, 2013) for measuring student problem solving skills. The PROCESS tool allows raters to score students’ ability in the domains of Problem definition, Representing the problem, Organizing information, Calculations, Evaluating the solution, Solution communication, and Self-assessment. Specifically, this research compares student performance on solving traditional textbook problems with novel, student-generated learning activities (i.e. reverse engineering videos in order to then create their own homework problem and solution). The use of student-generated learning activities to assess student problem solving skills has theoretical underpinning in Felder’s (1987) work of “creating creative engineers,” as well as the need to develop students’ abilities to transfer learning and solve problems in a variety of real world settings. In this study, four raters used the PROCESS tool to score the performance of 70 students randomly selected from two undergraduate chemical engineering cohorts at two Midwest universities. Students from both cohorts solved 12 traditional textbook style problems and students from the second cohort solved an additional nine student-generated video problems. Any large scale assessment where multiple raters use a rating tool requires the investigation of several aspects of validity. The many-facets Rasch measurement model (MFRM; Linacre, 1989) has the psychometric properties to determine if there are any characteristics other than “student problem solving skills” that influence the scores assigned, such as rater bias, problem difficulty, or student demographics. Before implementing the full rating plan, MFRM was used to examine how raters interacted with the six items on the modified PROCESS tool to score a random selection of 20 students’ performance in solving one problem. An external evaluator led “inter-rater reliability” meetings where raters deliberated rationale for their ratings and differences were resolved by recourse to Pretz, et al.’s (2003) problem-solving cycle that informed the development of the PROCESS tool. To test the new understandings of the PROCESS tool, raters were assigned to score one new problem from a different randomly selected group of six students. Those results were then analyzed in the same manner as before. This iterative process resulted in substantial increases in reliability, which can be attributed to increased confidence that raters were operating with common definitions of the items on the PROCESS tool and rating with consistent and comparable severity. This presentation will include examples of the student-generated problems and a discussion of common discrepancies and solutions to the raters’ initial use of the PROCESS tool. Findings as well as the adapted PROCESS tool used in this study can be useful to engineering educators and engineering education researchers. 
    more » « less
  5. Human ratings are ubiquitous in creativity research. Yet the process of rating responses to creativity tasks—typically several hundred or thousands of responses, per rater—is often time consuming and expensive. Planned missing data designs, where raters only rate a subset of the total number of responses, have been recently proposed as one possible solution to decrease overall rating time and monetary costs. However, researchers also need ratings that adhere to psychometric standards, such as a certain degree of reliability, and psychometric work with planned missing designs is currently lacking in the literature. In this work, we introduce how judge response theory and simulations can be used to fine-tune planning of missing data designs. We provide open code for the community and illustrate our proposed approach by a cost-effectiveness calculation based on a realistic example. We clearly show that fine tuning helps to save time (to perform the ratings) and monetary costs, while simultaneously targeting expected levels of reliability. 
    more » « less