skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: When zero may not be zero: A cautionary note on the use of inter‐rater reliability in evaluating grant peer review
Considerable attention has focused on studying reviewer agreement via inter-rater reliability (IRR) as a way to assess the quality of the peer review process. Inspired by a recent study that reported an IRR of zero in the mock peer review of top-quality grant proposals, we use real data from a complete range of submissions to the National Institutes of Health and to the American Institute of Biological Sciences to bring awareness to two important issues with using IRR for assessing peer review quality. First, we demonstrate that estimating local IRR from subsets of restricted-quality proposals will likely result in zero estimates under many scenarios. In both data sets, we find that zero local IRR estimates are more likely when subsets of top-quality proposals rather than bottom-quality proposals are considered. However, zero estimates from range-restricted data should not be interpreted as indicating arbitrariness in peer review. On the contrary, despite different scoring scales used by the two agencies, when complete ranges of proposals are considered, IRR estimates are above 0.6 which indicates good reviewer agreement. Furthermore, we demonstrate that, with a small number of reviewers per proposal, zero estimates of IRR are possible even when the true value is not zero.  more » « less
Award ID(s):
1759825
PAR ID:
10254055
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Journal of the Royal Statistical Society: Series A (Statistics in Society)
ISSN:
0964-1998
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In the absence of gold standard for evaluating quality of peer review, considerable attention has been focused on studying reviewer agreement via inter-rater reliability (IRR) which can be thought of as the correlation between scores of different reviewers given to the same grant proposal. Noting that it is not uncommon for IRR in grant peer review studies to be estimated from some range-restricted subset of submissions, we use statistical methods and data analysis of real peer review data to illustrate behavior of such local IRR estimates when only fractions of top-quality proposal submissions are considered. We demonstrate that local IRR estimates are smaller than those obtained from all submissions and that zero local IRR estimates are quite plausible. We note that, from a measurement perspective, when reviewers are asked to differentiate among grant proposals across the whole range of submissions, only IRR measures that correspond to the complete range of submissions are warranted. We recommend against using local IRR estimates in those situations. Moreover, if review scores are intended to be used for differentiating among top proposals, we recommend peer review administrators and researchers to align review procedures with their intended measurement. 
    more » « less
  2. Abstract BackgroundIn many grant review settings, proposals are selected for funding on the basis of summary statistics of review ratings. Challenges of this approach (including the presence of ties and unclear ordering of funding preference for proposals) could be mitigated if rankings such as top-k preferences or paired comparisons, which are local evaluations that enforce ordering across proposals, were also collected and incorporated in the analysis of review ratings. However, analyzing ratings and rankings simultaneously has not been done until recently. This paper describes a practical method for integrating rankings and scores and demonstrates its usefulness for making funding decisions in real-world applications. MethodsWe first present the application of our existing joint model for rankings and ratings, the Mallows-Binomial, in obtaining an integrated score for each proposal and generating the induced preference ordering. We then apply this methodology to several theoretical “toy” examples of rating and ranking data, designed to demonstrate specific properties of the model. We then describe an innovative protocol for collecting rankings of the top-six proposals as an add-on to the typical peer review scoring procedures and provide a case study using actual peer review data to exemplify the output and how the model can appropriately resolve judges’ evaluations. ResultsFor the theoretical examples, we show how the model can provide a preference order to equally rated proposals by incorporating rankings, to proposals using ratings and only partial rankings (and how they differ from a ratings-only approach) and to proposals where judges provide internally inconsistent ratings/rankings and outlier scoring. Finally, we discuss how, using real world panel data, this method can provide information about funding priority with a level of accuracy in a well-suited format for research funding decisions. ConclusionsA methodology is provided to collect and employ both rating and ranking data in peer review assessments of proposal submission quality, highlighting several advantages over methods relying on ratings alone. This method leverages information to most accurately distill reviewer opinion into a useful output to make an informed funding decision and is general enough to be applied to settings such as in the NIH panel review process. 
    more » « less
  3. Ratings are present in many areas of assessment including peer review of research proposals and journal articles, teacher observations, university admissions and selection of new hires. One feature present in any rating process with multiple raters is that different raters often assign different scores to the same assessee, with the potential for bias and inconsistencies related to rater or assessee covariates. This paper analyzes disparities in ratings of internal and external applicants to teaching positions using applicant data from Spokane Public Schools. We first test for biases in rating while accounting for measures of teacher applicant qualifications and quality. Then, we develop model-based inter-rater reliability (IRR) estimates that allow us to account for various sources of measurement error, the hierarchical structure of the data, and to test whether covariates, such as applicant status, moderate IRR. We find that applicants external to the district receive lower ratings for job applications compared to internal applicants. This gap in ratings remains significant even after including measures of qualifications and quality such as experience, state licensure scores, or estimated teacher value added. With model-based IRR, we further show that consistency between raters is significantly lower when rating external applicants. We conclude the paper by discussing policy implications and possible applications of our model-based IRR estimate for hiring and selection practices in and out of the teacher labor market. 
    more » « less
  4. This paper describes the Engineering Education Research (EER) Peer Review Training (PERT) project, which is designed to develop EER scholars’ peer review skills through mentored reviewing experiences. Supported by the National Science Foundation, the overall programmatic goals of the PERT project are to establish and evaluate a mentored reviewer program for 1) EER journal manuscripts and 2) EER grant proposals. Concurrently, the project seeks to explore how EER scholars develop schema for evaluating EER scholarship, whether these schema are shared in the community, and how schema influence recommendations made to journal editors during the peer review process. To accomplish these goals, the PERT project leveraged the previously established Journal of Engineering Education (JEE) Mentored Reviewer Program, where two researchers with little reviewing experience are paired with an experienced mentor to complete three manuscript reviews collaboratively. In this paper we report on focus group and exit survey findings from the JEE Mentored Reviewer Program and discuss revisions to the program in response to those findings. 
    more » « less
  5. This paper describes the Engineering Education Research (EER) Peer Review Training (PERT) project, which is designed to develop EER scholars’ peer review skills through mentored reviewing experiences. Supported by the National Science Foundation, the overall programmatic goals of the PERT project are to establish and evaluate a mentored reviewer program for 1) EER journal manuscripts and 2) EER grant proposals. Concurrently, the project seeks to explore how EER scholars develop schema for evaluating EER scholarship, whether these schema are shared in the community, and how schema influence recommendations made to journal editors during the peer review process. To accomplish these goals, the PERT project leveraged the previously established Journal of Engineering Education (JEE) Mentored Reviewer Program, where two researchers with little reviewing experience are paired with an experienced mentor to complete three manuscript reviews collaboratively. In this paper we report on focus group and exit survey findings from the JEE Mentored Reviewer Program and discuss revisions to the program in response to those findings. 
    more » « less