Large-scale standardized tests are regularly used to measure student achievement overall and for student subgroups. These uses assume tests provide comparable measures of outcomes across student subgroups, but prior research suggests score comparisons across gender groups may be complicated by the type of test items used. This paper presents evidence that among nationally representative samples of 15-year-olds in the United States participating in the 2009, 2012, and 2015 PISA math and reading tests, there are consistent item format by gender differences. On average, male students answer multiple-choice items correctly relatively more often and female students answer constructed-response items correctly relatively more often. These patterns were consistent across 34 additional participating PISA jurisdictions, although the size of the format differences varied and were larger on average in reading than math. The average magnitude of the format differences is not large enough to be flagged in routine differential item functioning analyses intended to detect test bias but is large enough to raise questions about the validity of inferences based on comparisons of scores across gender groups. Researchers and other test users should account for test item format, particularly when comparing scores across gender groups.
more »
« less
This content will become publicly available on March 8, 2026
Investigating Differences in Assessment Delivery Formats: An Illustration Study
This study explored how mathematics problem-solving constructed-response tests compared in terms of item psychometrics when administered to eighth grade students in two different static formats: paper-pencil and computer-based. Quantitative results indicated similarly across all psychometric indices for the overall tests and at the item-level.
more »
« less
- Award ID(s):
- 2100988
- PAR ID:
- 10597566
- Publisher / Repository:
- Proceedings for the 52nd Annual Meeting of the Research Council on Mathematics Learning
- Date Published:
- Page Range / eLocation ID:
- 41-48
- Format(s):
- Medium: X
- Location:
- College Station, TX
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Many large‐scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple‐choice and/or constructed responses sections of items to generate multiple scores. In the current article, we propose an extension of the hierarchical rater model (HRM) to be applied with simple structured tests with constructed response items. In addition to modeling the appropriate trait structure, the multidimensional HRM (M‐HRM) presented here also accounts for rater severity bias and rater variability or inconsistency. We introduce the model formulation, test parameter recovery with a focus on latent traits, and compare the M‐HRM to other scoring approaches (unidimensional HRMs and a traditional multidimensional item response theory model) using simulated and empirical data. Results show more precise scores under the M‐HRM, with a major improvement in scores when incorporating rater effects versus ignoring them in the traditional multidimensional item response theory model.more » « less
-
Abstract Ecological theory predicts that herbivory should be weaker on islands than on mainland based on the assumption that islands have lower herbivore abundance and diversity. However, empirical tests of this prediction are rare, especially for insect herbivores, and those few tests often fail to address the mechanisms behind island–mainland divergence in herbivory. In particular, past studies have not addressed the relative contribution of top‐down (i.e. predator‐driven) and bottom‐up (i.e. plant‐driven) factors to these dynamics.To address this, we experimentally excluded insectivorous vertebrate predators (e.g. birds, bats) and measured leaf traits associated with herbivory in 52 populations of 12 oak (Quercus) species in three island–mainland sites: The Channel Islands of California vs. mainland California, Balearic Islands vs. mainland Spain, and the island Bornholm vs. mainland Sweden (N = 204 trees). In each site, at the end of the growing season, we measured leaf damage by insect herbivores on control vs. predator‐excluded branches and measured leaf traits, namely: phenolic compounds, specific leaf area, and nitrogen and phosphorous content. In addition, we obtained climatic and soil data for island and mainland populations using global databases. Specifically, we tested for island–mainland differences in herbivory, and whether differences in vertebrate predator effects or leaf traits between islands and mainland contributed to explaining the observed herbivory patterns.Supporting predictions, herbivory was lower on islands than on mainland, but only in the case of Mediterranean sites (California and Spain). We found no evidence for vertebrate predator effects on herbivory on either islands or mainland in any study site. In addition, while insularity affected leaf traits in some of the study sites (Sweden‐Bornholm and California), these effects were seemingly unrelated to differences in herbivory.Synthesis. Our results suggest that vertebrate predation and the studied leaf traits did not contribute to island–mainland variation patterns in herbivory, calling for more nuanced and comprehensive investigations of predator and plant trait effects, including measurements of other plant traits and assessments of predation by different groups of natural enemies.more » « less
-
Abstract Organisms such as allopolyploids and F1 hybrids contain multiple distinct subgenomes, each potentially with its own evolutionary history. These organisms present a challenge for multilocus phylogenetic inference and other analyses since it is not apparent which gene copies from different loci are from the same subgenome and thus share an evolutionary history.Here we introduce homologizer, a flexible Bayesian approach that uses a phylogenetic framework to infer the phasing of gene copies across loci into their respective subgenomes.Through the use of simulation tests, we demonstrate that homologizer is robust to a wide range of factors, such as incomplete lineage sorting and the phylogenetic informativeness of loci. Furthermore, we establish the utility of homologizer on real data, by analysing a multilocus dataset consisting of nine diploids and 19 tetraploids from the fern family Cystopteridaceae.Finally, we describe how homologizer may potentially be used beyond its core phasing functionality to identify non‐homologous sequences, such as hidden paralogs or contaminants.more » « less
-
null (Ed.)Sequential recommendation is the task of predicting the next items for users based on their interaction history. Modeling the dependence of the next action on the past actions accurately is crucial to this problem. Moreover, sequential recommendation often faces serious sparsity of item-to-item transitions in a user's action sequence, which limits the practical utility of such solutions. To tackle these challenges, we propose a Category-aware Collaborative Sequential Recommender. Our preliminary statistical tests demonstrate that the in-category item-to-item transitions are often much stronger indicators of the next items than the general item-to-item transitions observed in the original sequence. Our method makes use of item category in two ways. First, the recommender utilizes item category to organize a user's own actions to enhance dependency modeling based on her own past actions. It utilizes self-attention to capture in-category transition patterns, and determines which of the in-category transition patterns to consider based on the categories of recent actions. Second, the recommender utilizes the item category to retrieve users with similar in-category preferences to enhance collaborative learning across users, and thus conquer sparsity. It utilizes attention to incorporate in-category transition patterns from the retrieved users for the target user. Extensive experiments on two large datasets prove the effectiveness of our solution against an extensive list of state-of-the-art sequential recommendation models.more » « less