skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Disparities in ratings of internal and external applicants: A case for model-based inter-rater reliability
Ratings are present in many areas of assessment including peer review of research proposals and journal articles, teacher observations, university admissions and selection of new hires. One feature present in any rating process with multiple raters is that different raters often assign different scores to the same assessee, with the potential for bias and inconsistencies related to rater or assessee covariates. This paper analyzes disparities in ratings of internal and external applicants to teaching positions using applicant data from Spokane Public Schools. We first test for biases in rating while accounting for measures of teacher applicant qualifications and quality. Then, we develop model-based inter-rater reliability (IRR) estimates that allow us to account for various sources of measurement error, the hierarchical structure of the data, and to test whether covariates, such as applicant status, moderate IRR. We find that applicants external to the district receive lower ratings for job applications compared to internal applicants. This gap in ratings remains significant even after including measures of qualifications and quality such as experience, state licensure scores, or estimated teacher value added. With model-based IRR, we further show that consistency between raters is significantly lower when rating external applicants. We conclude the paper by discussing policy implications and possible applications of our model-based IRR estimate for hiring and selection practices in and out of the teacher labor market.  more » « less
Award ID(s):
1759825
PAR ID:
10081885
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
PloS one
Volume:
13
Issue:
10
ISSN:
1932-6203
Page Range / eLocation ID:
e0203002
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. When professors assign group work, they assume that peer ratings are a valid source of information, but few studies have evaluated rater consensus in such ratings. We analyzed peer ratings from project teams in a second-year university course to examine consensus. Our first goal was to examine whether members of a team generally agreed on the competence of each team member. Our second goal was to test if a target’s personality traits predicted how well they were rated. Our third goal was to evaluate whether the self-rating of each student correlated with their peer rating. Data were analyzed from 130 students distributed across 21 teams (mean team size = 6.2). The sample was diverse in gender and ethnicity. Social relations model analyses showed that on average 32% of variance in peer-ratings was due to “consensus,” meaning some targets consistently received higher skill ratings than other targets did. Another 20% of the variance was due to “assimilation,” meaning some raters consistently gave higher ratings than other raters did. Thus, peer ratings reflected consensus (target effects), but also assimilation (rater effects) and noise. Among the six HEXACO traits that we examined, only conscientiousness predicted higher peer ratings, suggesting it may be beneficial to assign one highly conscientious person to every team. Lastly, there was an average correlation of.35 between target effects and self-ratings, indicating moderate self-other agreement, which suggests that students were only weakly biased in their self-ratings. 
    more » « less
  2. This evidence-based practices paper discusses the method employed in validating the use of a project modified version of the PROCESS tool (Grigg, Van Dyken, Benson, & Morkos, 2013) for measuring student problem solving skills. The PROCESS tool allows raters to score students’ ability in the domains of Problem definition, Representing the problem, Organizing information, Calculations, Evaluating the solution, Solution communication, and Self-assessment. Specifically, this research compares student performance on solving traditional textbook problems with novel, student-generated learning activities (i.e. reverse engineering videos in order to then create their own homework problem and solution). The use of student-generated learning activities to assess student problem solving skills has theoretical underpinning in Felder’s (1987) work of “creating creative engineers,” as well as the need to develop students’ abilities to transfer learning and solve problems in a variety of real world settings. In this study, four raters used the PROCESS tool to score the performance of 70 students randomly selected from two undergraduate chemical engineering cohorts at two Midwest universities. Students from both cohorts solved 12 traditional textbook style problems and students from the second cohort solved an additional nine student-generated video problems. Any large scale assessment where multiple raters use a rating tool requires the investigation of several aspects of validity. The many-facets Rasch measurement model (MFRM; Linacre, 1989) has the psychometric properties to determine if there are any characteristics other than “student problem solving skills” that influence the scores assigned, such as rater bias, problem difficulty, or student demographics. Before implementing the full rating plan, MFRM was used to examine how raters interacted with the six items on the modified PROCESS tool to score a random selection of 20 students’ performance in solving one problem. An external evaluator led “inter-rater reliability” meetings where raters deliberated rationale for their ratings and differences were resolved by recourse to Pretz, et al.’s (2003) problem-solving cycle that informed the development of the PROCESS tool. To test the new understandings of the PROCESS tool, raters were assigned to score one new problem from a different randomly selected group of six students. Those results were then analyzed in the same manner as before. This iterative process resulted in substantial increases in reliability, which can be attributed to increased confidence that raters were operating with common definitions of the items on the PROCESS tool and rating with consistent and comparable severity. This presentation will include examples of the student-generated problems and a discussion of common discrepancies and solutions to the raters’ initial use of the PROCESS tool. Findings as well as the adapted PROCESS tool used in this study can be useful to engineering educators and engineering education researchers. 
    more » « less
  3. With the growing use of mixed reality teaching simulations in teacher education there is a need for researchers to examine how preservice teacher (PST) learning can be supported when using these simulations. To address this gap the current study explores how 47 PSTs used an online teaching simulation to facilitate a discussion focused on argumentation with five student avatars in the MursionTM mixed reality simulated classroom environment. We assessed PSTs' performance in the simulation using rubric-level scores assigned by trained raters and then compared the scores to PSTs' survey responses completed after their discussion asking them to self-report their goals for the discussion, how successful they thought they were across five dimensions of facilitating high-quality, argumentation-focused discussions, and their overall perceptions of the mixed reality teaching simulation. Findings suggest that PSTs' understanding of the discussion task's learning goals somewhat predicted their success in facilitating the discussion and that PSTs' self-assessment of their performance was not always consistent with raters' evaluation of the PSTs' performance. In particular, self-assessment was found to be most consistent with raters' evaluations for those PSTs with higher rater-assigned scores and least consistent for those with lower rater-assigned scores. The implications of these findings are as follows: (1) researchers should be cautious in relying on PST self-report of success when engaging in mixed reality teaching simulations, particularly because low performance may be obscured, (2) teacher educators should be aware that reliance on self-report from PSTs likely obscures the need for additional support for exactly those PSTs who need it most, and (3) the field, therefore, should expand efforts to measure PSTs' performance when using mixed reality teaching simulations. 
    more » « less
  4. In the absence of gold standard for evaluating quality of peer review, considerable attention has been focused on studying reviewer agreement via inter-rater reliability (IRR) which can be thought of as the correlation between scores of different reviewers given to the same grant proposal. Noting that it is not uncommon for IRR in grant peer review studies to be estimated from some range-restricted subset of submissions, we use statistical methods and data analysis of real peer review data to illustrate behavior of such local IRR estimates when only fractions of top-quality proposal submissions are considered. We demonstrate that local IRR estimates are smaller than those obtained from all submissions and that zero local IRR estimates are quite plausible. We note that, from a measurement perspective, when reviewers are asked to differentiate among grant proposals across the whole range of submissions, only IRR measures that correspond to the complete range of submissions are warranted. We recommend against using local IRR estimates in those situations. Moreover, if review scores are intended to be used for differentiating among top proposals, we recommend peer review administrators and researchers to align review procedures with their intended measurement. 
    more » « less
  5. Human ratings are ubiquitous in creativity research. Yet the process of rating responses to creativity tasks—typically several hundred or thousands of responses, per rater—is often time consuming and expensive. Planned missing data designs, where raters only rate a subset of the total number of responses, have been recently proposed as one possible solution to decrease overall rating time and monetary costs. However, researchers also need ratings that adhere to psychometric standards, such as a certain degree of reliability, and psychometric work with planned missing designs is currently lacking in the literature. In this work, we introduce how judge response theory and simulations can be used to fine-tune planning of missing data designs. We provide open code for the community and illustrate our proposed approach by a cost-effectiveness calculation based on a realistic example. We clearly show that fine tuning helps to save time (to perform the ratings) and monetary costs, while simultaneously targeting expected levels of reliability. 
    more » « less