skip to main content


Title: Who Reviews The Reviewers? A Multi-Level Jury Problem
We investigate the problem of determining a binary ground truth using advice from a group of independent reviewers (experts) who express their guess about a ground truth correctly with some independent probability (competence) p_i. In this setting, when all reviewers are competent with p >= 0.5, the Condorcet Jury Theorem tells us that adding more reviewers increases the overall accuracy, and if all p_i's are known, then there exists an optimal weighting of the reviewers. However, in practical settings, reviewers may be noisy or incompetent, i.e., p_i < 0.5, and the number of experts may be small, so the asymptotic Condorcet Jury Theorem is not practically relevant. In such cases we explore appointing one or more chairs (judges) who determine the weight of each reviewer for aggregation, creating multiple levels. However, these chairs may be unable to correctly identify the competence of the reviewers they oversee, and therefore unable to compute the optimal weighting. We give conditions when a set of chairs is able to weight the reviewers optimally, and depending on the competence distribution of the agents, give results about when it is better to have more chairs or more reviewers. Through numerical simulations we show that in some cases it is better to have more chairs, but in many cases it is better to have more reviewers.  more » « less
Award ID(s):
2134857
PAR ID:
10480654
Author(s) / Creator(s):
; ;
Publisher / Repository:
arXiv Preprint Archive
Date Published:
Journal Name:
arXiv Preprint Archive
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The law expects jurors to weigh the facts and evidence of a case to inform the decision with which they are charged. However, evidence in legal cases is becoming increasingly complicated, and studies have raised questions about laypeople’s abilities to understand and use complex evidence to inform decisions. Compared to other studies that have looked at general evidence comprehension and expert credibility (e.g. Schweitzer & Saks, 2012), this experimental study investigated whether jurors can appropriately weigh strong vs. weak DNA evidence without special assistance. That is, without help to understand when DNA evidence is relatively weak, are jurors sensitive to the strength of weak DNA evidence as compared to strong DNA evidence? Responses from jury-eligible participants (N=346) were collected from Amazon Mechanical Turk (MTurk). Participants were presented with a summary of a robbery case before being asked a short questionnaire related to verdict preference and evidence comprehension. (Data is from the pilot of experiment 2 for the grant project). We hypothesized participants would not be able to distinguish high- from low-quality DNA evidence. We analyzed the data using Bayes Factors, which allows for directly testing the null hypothesis (Zyphur & Oswald, 2013). A Bayes Factor of 4-8 (depending on the priors used) was found supporting the null for participants’ rating of low vs. high quality scientific evidence. A Bayes Factor of 4 means that the null is four times as probable as an alternative hypothesis. Participants tended to rate the DNA evidence as “high quality” no matter the condition they were in. The Bayes Factor of 4-8 in this case gives good reason to believe that jury members are unable to discern what constitutes low quality DNA evidence without assistance. If jurors are unable to distinguish between different qualities of evidence, or if they are unaware that they may have to, they could give greater weight to low quality scientific evidence than is warranted. The current study supports the hypothesis that jurors have trouble distinguishing between complicated high vs. low quality evidence without help. Further attempts will be made to discover ways of presenting DNA evidence that could better calibrate jurors in their decisions. These future directions involve larger sample sizes in which jury-eligible participants will complete the study in person. Instead of reading about the evidence, they will watch a filmed mock jury trial. This plan also involves jury deliberation which will provide additional knowledge about how jurors come to conclusions as a group about different qualities of evidence. Acknowledging the potential issues in jury trials and working to solve these problems is a vital step in improving our justice system. 
    more » « less
  2. Aggregating signals from a collection of noisy sources is a fundamental problem in many domains including crowd-sourcing, multi-agent planning, sensor networks, signal processing, voting, ensemble learning, and federated learning. The core question is how to aggregate signals from multiple sources (e.g. experts) in order to reveal an underlying ground truth. While a full answer depends on the type of signal, correlation of signals, and desired output, a problem common to all of these applications is that of differentiating sources based on their quality and weighting them accordingly. It is often assumed that this differentiation and aggregation is done by a single, accurate central mechanism or agent (e.g. judge). We complicate this model in two ways. First, we investigate the setting with both a single judge, and one with multiple judges. Second, given this multi-agent interaction of judges, we investigate various constraints on the judges’ reporting space. We build on known results for the optimal weighting of experts and prove that an ensemble of sub-optimal mechanisms can perform optimally under certain conditions. We then show empirically that the ensemble approximates the performance of the optimal mechanism under a broader range of conditions. 
    more » « less
  3. Peer review assignment algorithms aim to match research papers to suitable expert reviewers, working to maximize the quality of the resulting reviews. A key challenge in designing effective assignment policies is evaluating how changes to the assignment algorithm map to changes in review quality. In this work, we leverage recently proposed policies that introduce randomness in peer-review assignment— in order to mitigate fraud—as a valuable opportunity to evaluate counterfactual assignment policies. Specifically, we exploit how such randomized assignments provide a positive probability of observing the reviews of many assignment policies of interest. To address challenges in applying standard off-policy evaluation methods, such as violations of positivity, we introduce novel methods for partial identification based on monotonicity and Lipschitz smoothness assumptions for the mapping between reviewer-paper covariates and outcomes. We apply our methods to peer-review data from two computer science venues: the TPDP'21 workshop (95 papers and 35 reviewers) and the AAAI'22 conference (8,450 papers and 3,145 reviewers). We consider estimates of (i) the effect on review quality when changing weights in the assignment algorithm, e.g., weighting reviewers' bids vs. textual similarity (between the review's past papers and the submission), and (ii) the "cost of randomization", capturing the difference in expected quality between the perturbed and unperturbed optimal match. We find that placing higher weight on text similarity results in higher review quality and that introducing randomization in the reviewer-paper assignment only marginally reduces the review quality. Our methods for partial identification may be of independent interest, while our off-policy approach can likely find use in evaluating a broad class of algorithmic matching systems. 
    more » « less
  4. Bringmann, Karl ; Grohe, Martin ; Puppis, Gabriele ; Svensson, Ola (Ed.)
    We revisit the noisy binary search model of [Karp and Kleinberg, 2007], in which we have n coins with unknown probabilities p_i that we can flip. The coins are sorted by increasing p_i, and we would like to find where the probability crosses (to within ε) of a target value τ. This generalized the fixed-noise model of [Burnashev and Zigangirov, 1974], in which p_i = 1/2 ± ε, to a setting where coins near the target may be indistinguishable from it. It was shown in [Karp and Kleinberg, 2007] that Θ(1/ε² log n) samples are necessary and sufficient for this task. We produce a practical algorithm by solving two theoretical challenges: high-probability behavior and sharp constants. We give an algorithm that succeeds with probability 1-δ from 1/C_{τ, ε} ⋅ (log₂ n + O(log^{2/3} n log^{1/3} 1/(δ) + log 1/(δ))) samples, where C_{τ, ε} is the optimal such constant achievable. For δ > n^{-o(1)} this is within 1 + o(1) of optimal, and for δ ≪ 1 it is the first bound within constant factors of optimal. 
    more » « less
  5. Abstract

    A frequent complaint of editors of scientific journals is that it has become increasingly difficult to find reviewers for evaluating submitted manuscripts. Such claims are, most commonly, based on anecdotal evidence. To gain more insight grounded on empirical evidence, editorial data of manuscripts submitted for publication to the Journal of Comparative Physiology A between 2014 and 2021 were analyzed. No evidence was found that more invitations were necessary over time to get manuscripts reviewed; that the reviewer’s response time after invitation increased; that the number of reviewers who completed their reports, relative to the number of reviewers who had agreed to review a manuscript, decreased; and that the recommendation behavior of reviewers changed. The only significant trend observed was among reviewers who completed their reports later than agreed. The average number of days that these reviewers submitted their evaluations roughly doubled over the period analyzed. By contrast, neither the proportion of late vs. early reviews, nor the time for completing the reviews among the punctual reviewers, changed. Comparison with editorial data from other journals suggests that journals that serve a smaller community of readers and authors, and whose editors themselves contact potential reviewers, perform better in terms of reviewer recruitment and performance than journals that receive large numbers of submissions and use editorial assistants for sending invitations to potential reviewers.

     
    more » « less