We investigate the problem of determining a binary ground truth using advice from a group of independent reviewers (experts) who express their guess about a ground truth correctly with some independent probability (competence) p_i. In this setting, when all reviewers are competent with p >= 0.5, the Condorcet Jury Theorem tells us that adding more reviewers increases the overall accuracy, and if all p_i's are known, then there exists an optimal weighting of the reviewers. However, in practical settings, reviewers may be noisy or incompetent, i.e., p_i < 0.5, and the number of experts may be small, so the asymptotic Condorcet Jury Theorem is not practically relevant. In such cases we explore appointing one or more chairs (judges) who determine the weight of each reviewer for aggregation, creating multiple levels. However, these chairs may be unable to correctly identify the competence of the reviewers they oversee, and therefore unable to compute the optimal weighting. We give conditions when a set of chairs is able to weight the reviewers optimally, and depending on the competence distribution of the agents, give results about when it is better to have more chairs or more reviewers. Through numerical simulations we show that in some cases it is better to have more chairs, but in many cases it is better to have more reviewers.
more »
« less
Who Reviews The Reviewers? A Multi-Level Jury Problem
We consider the problem of determining a binary ground truth using advice from a group of independent reviewers (experts) who express their guess about a ground truth correctly with some independent probability (competence) $$p_i$$. In this setting, when all reviewers are competent with $$p \geq 0.5$$, the Condorcet Jury Theorem tells us that adding more reviewers increases the overall accuracy, and if all $$p_i$$'s are known, then there exists an optimal weighting of the reviewers. However, in practical settings, reviewers may be noisy or incompetent, i.e., $$p_i \leq 0.5$$, and the number of experts may be small, so the asymptotic Condorcet Jury Theorem is not practically relevant. In such cases we explore appointing one or more chairs (judges) who determine the weight of each reviewer for aggregation, creating multiple levels. However, these chairs may be unable to correctly identify the competence of the reviewers they oversee, and therefore unable to compute the optimal weighting. We give conditions on when a set of chairs is able to weight the reviewers optimally, and depending on the competence distribution of the agents, give results about when it is better to have more chairs or more reviewers. Through simulations we show that in some cases it is better to have more chairs, but in many cases it is better to have more reviewers.
more »
« less
- Award ID(s):
- 2134857
- PAR ID:
- 10564622
- Publisher / Repository:
- The International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS)
- Date Published:
- Subject(s) / Keyword(s):
- Peer Review, Peer Selection, Jury Theorem
- Format(s):
- Medium: X
- Location:
- Detroit, USA
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We consider the problem of determining a binary ground truth using advice from a group of independent reviewers (experts) who express their guess about a ground truth correctly with some independent probability (competence) 𝑝 . In this setting, when all reviewers đť‘– are competent with 𝑝 ≥ 0.5, the Condorcet Jury Theorem tells us that adding more reviewers increases the overall accuracy, and if all 𝑝 ’s are known, then there exists an optimal weighting of the đť‘– reviewers. However, in practical settings, reviewers may be noisy or incompetent, i.e., 𝑝𝑖 ≤ 0.5, and the number of experts may be small, so the asymptotic Condorcet Jury Theorem is not practically relevant. In such cases we explore appointing one or more chairs ( judges) who determine the weight of each reviewer for aggregation, creating multiple levels. However, these chairs may be unable to correctly identify the competence of the reviewers they oversee, and therefore unable to compute the optimal weighting. We give conditions on when a set of chairs is able to weight the reviewers optimally, and depending on the competence distribution of the agents, give results about when it is better to have more chairs or more reviewers. Through simulations we show that in some cases it is better to have more chairs, but in many cases it is better to have more reviewers.more » « less
-
The law expects jurors to weigh the facts and evidence of a case to inform the decision with which they are charged. However, evidence in legal cases is becoming increasingly complicated, and studies have raised questions about laypeople’s abilities to understand and use complex evidence to inform decisions. Compared to other studies that have looked at general evidence comprehension and expert credibility (e.g. Schweitzer & Saks, 2012), this experimental study investigated whether jurors can appropriately weigh strong vs. weak DNA evidence without special assistance. That is, without help to understand when DNA evidence is relatively weak, are jurors sensitive to the strength of weak DNA evidence as compared to strong DNA evidence? Responses from jury-eligible participants (N=346) were collected from Amazon Mechanical Turk (MTurk). Participants were presented with a summary of a robbery case before being asked a short questionnaire related to verdict preference and evidence comprehension. (Data is from the pilot of experiment 2 for the grant project). We hypothesized participants would not be able to distinguish high- from low-quality DNA evidence. We analyzed the data using Bayes Factors, which allows for directly testing the null hypothesis (Zyphur & Oswald, 2013). A Bayes Factor of 4-8 (depending on the priors used) was found supporting the null for participants’ rating of low vs. high quality scientific evidence. A Bayes Factor of 4 means that the null is four times as probable as an alternative hypothesis. Participants tended to rate the DNA evidence as “high quality” no matter the condition they were in. The Bayes Factor of 4-8 in this case gives good reason to believe that jury members are unable to discern what constitutes low quality DNA evidence without assistance. If jurors are unable to distinguish between different qualities of evidence, or if they are unaware that they may have to, they could give greater weight to low quality scientific evidence than is warranted. The current study supports the hypothesis that jurors have trouble distinguishing between complicated high vs. low quality evidence without help. Further attempts will be made to discover ways of presenting DNA evidence that could better calibrate jurors in their decisions. These future directions involve larger sample sizes in which jury-eligible participants will complete the study in person. Instead of reading about the evidence, they will watch a filmed mock jury trial. This plan also involves jury deliberation which will provide additional knowledge about how jurors come to conclusions as a group about different qualities of evidence. Acknowledging the potential issues in jury trials and working to solve these problems is a vital step in improving our justice system.more » « less
-
Sequential learning models situations where agents predict a ground truth in sequence, by using their private, noisy measurements, and the predictions of agents who came earlier in the sequence. We study sequential learning in a social network, where agents only see the actions of the previous agents in their own neighborhood. The fraction of agents who predict the ground truth correctly depends heavily on both the network topology and the ordering in which the predictions are made. A natural question is to find an ordering, with a given network, to maximize the (expected) number of agents who predict the ground truth correctly. In this paper, we show that it is in fact NP-hard to answer this question for a general network, with both the Bayesian learning model and a simple majority rule model. Finally, we show that even approximating the answer is hard.more » « less
-
Peer review assignment algorithms aim to match research papers to suitable expert reviewers, working to maximize the quality of the resulting reviews. A key challenge in designing effective assignment policies is evaluating how changes to the assignment algorithm map to changes in review quality. In this work, we leverage recently proposed policies that introduce randomness in peer-review assignment— in order to mitigate fraud—as a valuable opportunity to evaluate counterfactual assignment policies. Specifically, we exploit how such randomized assignments provide a positive probability of observing the reviews of many assignment policies of interest. To address challenges in applying standard off-policy evaluation methods, such as violations of positivity, we introduce novel methods for partial identification based on monotonicity and Lipschitz smoothness assumptions for the mapping between reviewer-paper covariates and outcomes. We apply our methods to peer-review data from two computer science venues: the TPDP'21 workshop (95 papers and 35 reviewers) and the AAAI'22 conference (8,450 papers and 3,145 reviewers). We consider estimates of (i) the effect on review quality when changing weights in the assignment algorithm, e.g., weighting reviewers' bids vs. textual similarity (between the review's past papers and the submission), and (ii) the "cost of randomization", capturing the difference in expected quality between the perturbed and unperturbed optimal match. We find that placing higher weight on text similarity results in higher review quality and that introducing randomization in the reviewer-paper assignment only marginally reduces the review quality. Our methods for partial identification may be of independent interest, while our off-policy approach can likely find use in evaluating a broad class of algorithmic matching systems.more » « less
An official website of the United States government

