The explosion of conference paper submissions in AI and related fields has underscored the need to improve many aspects of the peer review process, especially the matching of papers and reviewers. Recent work argues that the key to improve this matching is to modify aspects of the bidding phase itself, to ensure that the set of bids over papers is balanced, and in particular to avoid orphan papers, i.e., those papers that receive no bids. In an attempt to understand and mitigate this problem, we have developed a flexible bidding platform to test adaptations to the bidding process. Using this platform, we performed a field experiment during the bidding phase of a medium-size international workshop that compared two bidding methods. We further examined via controlled experiments on Amazon Mechanical Turk various factors that affect bidding, in particular the order in which papers are presented [11, 17]; and information on paper demand [33]. Our results suggest that several simple adaptations, that can be added to any existing platform, may significantly reduce the skew in bids, thereby improving the allocation for both reviewers and conference organizers. 
                        more » 
                        « less   
                    
                            
                            Design and analysis of the nips 2016 review process
                        
                    
    
            Neural Information Processing Systems (NIPS) is a top-tier annual conference in machine learning. The 2016 edition of the conference comprised more than 2,400 paper submissions, 3,000 reviewers, and 8,000 attendees. This represents a growth of nearly 40% in terms of submissions, 96% in terms of reviewers, and over 100% in terms of attendees as compared to the previous year. The massive scale as well as rapid growth of the conference calls for a thorough quality assessment of the peer-review process and novel means of improvement. In this paper, we analyze several aspects of the data collected during the review process, including an experiment investigating the efficacy of collecting ordinal rankings from reviewers. We make a number of key observations, provide suggestions that may be useful for subsequent conferences, and discuss open problems towards the goal of improving peer review. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10091688
- Date Published:
- Journal Name:
- Journal of machine learning research
- Volume:
- 19
- Issue:
- 1
- ISSN:
- 1533-7928
- Page Range / eLocation ID:
- 1913--1946
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            In the absence of gold standard for evaluating quality of peer review, considerable attention has been focused on studying reviewer agreement via inter-rater reliability (IRR) which can be thought of as the correlation between scores of different reviewers given to the same grant proposal. Noting that it is not uncommon for IRR in grant peer review studies to be estimated from some range-restricted subset of submissions, we use statistical methods and data analysis of real peer review data to illustrate behavior of such local IRR estimates when only fractions of top-quality proposal submissions are considered. We demonstrate that local IRR estimates are smaller than those obtained from all submissions and that zero local IRR estimates are quite plausible. We note that, from a measurement perspective, when reviewers are asked to differentiate among grant proposals across the whole range of submissions, only IRR measures that correspond to the complete range of submissions are warranted. We recommend against using local IRR estimates in those situations. Moreover, if review scores are intended to be used for differentiating among top proposals, we recommend peer review administrators and researchers to align review procedures with their intended measurement.more » « less
- 
            Bailey, Henry Hugh (Ed.)Many peer-review processes involve reviewers submitting their independent reviews, followed by a discussion between the reviewers of each paper. A common question among policymakers is whether the reviewers of a paper should be anonymous to each other during the discussion. We shed light on this question by conducting a randomized controlled trial at the Conference on Uncertainty in Artificial Intelligence (UAI) 2022 conference where reviewer discussions were conducted over a typed forum. We randomly split the reviewers and papers into two conditions–one with anonymous discussions and the other with non-anonymous discussions. We also conduct an anonymous survey of all reviewers to understand their experience and opinions. We compare the two conditions in terms of the amount of discussion, influence of seniority on the final decisions, politeness, reviewers’ self-reported experiences and preferences. Overall, this experiment finds small, significant differences favoring the anonymous discussion setup based on the evaluation criteria considered in this work.more » « less
- 
            Modern machine learning and computer science conferences are experiencing a surge in the number of submissions that challenges the quality of peer review as the number of competent reviewers is growing at a much slower rate. To curb this trend and reduce the burden on reviewers, several conferences have started encouraging or even requiring authors to declare the previous submission history of their papers. Such initiatives have been met with skepticism among authors, who raise the concern about a potential bias in reviewers' recommendations induced by this information. In this work, we investigate whether reviewers exhibit a bias caused by the knowledge that the submission under review was previously rejected at a similar venue, focusing on a population of novice reviewers who constitute a large fraction of the reviewer pool in leading machine learning and computer science conferences. We design and conduct a randomized controlled trial closely replicating the relevant components of the peer-review pipeline with $133$ reviewers (master's, junior PhD students, and recent graduates of top US universities) writing reviews for $19$ papers. The analysis reveals that reviewers indeed become negatively biased when they receive a signal about paper being a resubmission, giving almost 1 point lower overall score on a 10-point Likert item (Δ = -0.78, 95% CI = [-1.30, -0.24]) than reviewers who do not receive such a signal. Looking at specific criteria scores (originality, quality, clarity and significance), we observe that novice reviewers tend to underrate quality the most.more » « less
- 
            Peer review assignment algorithms aim to match research papers to suitable expert reviewers, working to maximize the quality of the resulting reviews. A key challenge in designing effective assignment policies is evaluating how changes to the assignment algorithm map to changes in review quality. In this work, we leverage recently proposed policies that introduce randomness in peer-review assignment— in order to mitigate fraud—as a valuable opportunity to evaluate counterfactual assignment policies. Specifically, we exploit how such randomized assignments provide a positive probability of observing the reviews of many assignment policies of interest. To address challenges in applying standard off-policy evaluation methods, such as violations of positivity, we introduce novel methods for partial identification based on monotonicity and Lipschitz smoothness assumptions for the mapping between reviewer-paper covariates and outcomes. We apply our methods to peer-review data from two computer science venues: the TPDP'21 workshop (95 papers and 35 reviewers) and the AAAI'22 conference (8,450 papers and 3,145 reviewers). We consider estimates of (i) the effect on review quality when changing weights in the assignment algorithm, e.g., weighting reviewers' bids vs. textual similarity (between the review's past papers and the submission), and (ii) the "cost of randomization", capturing the difference in expected quality between the perturbed and unperturbed optimal match. We find that placing higher weight on text similarity results in higher review quality and that introducing randomization in the reviewer-paper assignment only marginally reduces the review quality. Our methods for partial identification may be of independent interest, while our off-policy approach can likely find use in evaluating a broad class of algorithmic matching systems.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    