skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Multidimensional Pairwise Comparison Model for Heterogeneous Perceptions with an Application to Modelling the Perceived Truthfulness of Public Statements on COVID-19
Abstract Pairwise comparison models are an important type of latent attribute measurement model with broad applications in the social and behavioural sciences. Current pairwise comparison models are typically unidimensional. The existing multidimensional pairwise comparison models tend to be difficult to interpret and they are unable to identify groups of raters that share the same rater-specific parameters. To fill this gap, we propose a new multidimensional pairwise comparison model with enhanced interpretability which explicitly models how object attributes on different dimensions are differentially perceived by raters. Moreover, we add a Dirichlet process prior on rater-specific parameters which allows us to flexibly cluster raters into groups with similar perceptual orientations. We conduct simulation studies to show that the new model is able to recover the true latent variable values from the observed binary choice data. We use the new model to analyse original survey data regarding the perceived truthfulness of statements on COVID-19 collected in the summer of 2020. By leveraging the strengths of the new model, we find that the partisanship of the speaker and the partisanship of the respondent account for the majority of the variation in perceived truthfulness, with statements made by co-partisans being viewed as more truthful.  more » « less
Award ID(s):
1762420
PAR ID:
10400103
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series A: Statistics in Society
Volume:
185
Issue:
3
ISSN:
0964-1998
Format(s):
Medium: X Size: p. 1049-1073
Size(s):
p. 1049-1073
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Many large‐scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple‐choice and/or constructed responses sections of items to generate multiple scores. In the current article, we propose an extension of the hierarchical rater model (HRM) to be applied with simple structured tests with constructed response items. In addition to modeling the appropriate trait structure, the multidimensional HRM (M‐HRM) presented here also accounts for rater severity bias and rater variability or inconsistency. We introduce the model formulation, test parameter recovery with a focus on latent traits, and compare the M‐HRM to other scoring approaches (unidimensional HRMs and a traditional multidimensional item response theory model) using simulated and empirical data. Results show more precise scores under the M‐HRM, with a major improvement in scores when incorporating rater effects versus ignoring them in the traditional multidimensional item response theory model. 
    more » « less
  2. Ratings are present in many areas of assessment including peer review of research proposals and journal articles, teacher observations, university admissions and selection of new hires. One feature present in any rating process with multiple raters is that different raters often assign different scores to the same assessee, with the potential for bias and inconsistencies related to rater or assessee covariates. This paper analyzes disparities in ratings of internal and external applicants to teaching positions using applicant data from Spokane Public Schools. We first test for biases in rating while accounting for measures of teacher applicant qualifications and quality. Then, we develop model-based inter-rater reliability (IRR) estimates that allow us to account for various sources of measurement error, the hierarchical structure of the data, and to test whether covariates, such as applicant status, moderate IRR. We find that applicants external to the district receive lower ratings for job applications compared to internal applicants. This gap in ratings remains significant even after including measures of qualifications and quality such as experience, state licensure scores, or estimated teacher value added. With model-based IRR, we further show that consistency between raters is significantly lower when rating external applicants. We conclude the paper by discussing policy implications and possible applications of our model-based IRR estimate for hiring and selection practices in and out of the teacher labor market. 
    more » « less
  3. Abstract This article proposes a new statistical model to infer interpretable population-level preferences from ordinal comparison data. Such data is ubiquitous, e.g., ranked choice votes, top-10 movie lists, and pairwise sports outcomes. Traditional statistical inference on ordinal comparison data results in an overall ranking of objects, e.g., from best to worst, with each object having a unique rank. However, the ranks of some objects may not be statistically distinguishable. This could happen due to insufficient data or to the true underlying object qualities being equal. Because uncertainty communication in estimates of overall rankings is notoriously difficult, we take a different approach and allow groups of objects to have equal ranks or berank-clusteredin our model. Existing models related to rank-clustering are limited by their inability to handle a variety of ordinal data types, to quantify uncertainty, or by the need to pre-specify the number and size of potential rank-clusters. We solve these limitations through our proposed BayesianRank-Clustered Bradley–Terry–Luce (BTL)model. We accommodate rank-clustering via parameter fusion by imposing a novel spike-and-slab prior on object-specific worth parameters in the BTL family of distributions for ordinal comparisons. We demonstrate rank-clustering on simulated and real datasets in surveys, elections, and sports analytics. 
    more » « less
  4. Truthfulness is paramount for large language models (LLMs) as they are increasingly deployed in real-world applications. However, existing LLMs still struggle with generating truthful content, as evidenced by their modest performance on benchmarks like TruthfulQA. To address this issue, we propose GRAdual self-truTHifying (GRATH), a novel post-processing method to enhance truthfulness of LLMs. GRATH utilizes out-of-domain question prompts to generate pairwise truthfulness training data with each pair containing a question and its correct and incorrect answers, and then optimizes the model via direct preference optimization (DPO) to learn from the truthfulness difference between answer pairs. GRATH iteratively refines truthfulness data and updates the model, leading to a gradual improvement in model truthfulness in a self-supervised manner. Empirically, we evaluate GRATH using different 7B-LLMs and compare with LLMs with similar or even larger sizes on benchmark datasets. Our results show that GRATH effectively improves LLMs’ truthfulness without compromising other core capabilities. Notably, GRATH achieves state-of-the-art performance on TruthfulQA, with MC1 accuracy of 54.71% and MC2 accuracy of 69.10%, which even surpass those on 70B-LLMs. The code is available at https://github.com/chenweixin107/GRATH. 
    more » « less
  5. This study examines the utilization of cognitive interviews longitudinally over a one-year period to collectively trace raters’ response processes as they interpreted and scored with observational rubrics designed to measure teaching practices that promote equity and access in elementary and middle school mathematics classrooms. We draw on four rounds of cognitive interviews (totaling 14 interviews) that involved four raters at purposeful time points spread over the year. Findings reported in this study focus on raters’ responses about one rubric, positioning students as competent. The findings point to the complexities of utilizing observational rubrics and the need to track response processes longitudinally at multiple time points during data collection in order to attend to rater calibration and the reliability and validity of resulting rubric scores. 
    more » « less