skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Item Response Theory – A Statistical Framework for Educational and Psychological Measurement
Item response theory (IRT) has become one of the most popular statistical models for psychometrics, a field of study concerned with the theory and techniques of psychological measurement. The IRT models are latent factor models tailored to the analysis, interpretation, and prediction of individuals’ behaviors in answering a set of measurement items that typically involve categorical response data. Many important questions of measurement are directly or indirectly answered through the use of IRT models, including scoring individuals’ test performances, validating a test scale, linking two tests, among others. This paper provides a review of item response theory, including its statistical framework and psychometric applications. We establish connections between item response theory and related topics in statistics, including empirical Bayes, nonparametric methods, matrix completion, regularized estimation, and sequential analysis. Possible future directions of IRT are discussed from the perspective of statistical learning.  more » « less
Award ID(s):
2119938
PAR ID:
10484759
Author(s) / Creator(s):
Publisher / Repository:
Statistical science
Date Published:
Journal Name:
Statistical science
ISSN:
0883-4237
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The recent surge in computerized testing brings challenges in the analysis of testing data with classic item response theory (IRT) models. To handle individually varying and irregularly spaced longitudinal dichotomous responses, we adopt a dynamic IRT model framework and then extend the model to link with individual characteristics at a hierarchical level. Further, we have developed an algorithm to select important characteristics of individuals that can capture the growth changes of one’s ability under this multi-level dynamic IRT model, where we can compute the Bayes factor of the proposed model including different covariates using a single Markov chain Monte Carlo output from the full model. In addition, we have shown the model selection consistency under the modified Zellner–Siow prior, and we have conducted simulations to illustrate the properties of the model selection consistency in finite samples. Finally, we have applied our proposed model and computational algorithms to a real data application, called EdSphere dataset, in educational testing. 
    more » « less
  2. Recent years have seen a movement within the research-based assessment development community towards item formats that go beyond simple multiple-choice formats. Some have moved towards free-response questions, particularly at the upper-division level; however, free-response items have the constraint that they must be scored by hand. To avoid this limitation, some assessment developers have moved toward formats that maintain the closed-response format, while still providing more nuanced insight into student reasoning. One such format is known as coupled, multiple response (CMR). This format pairs multiple-choice and multiple-response formats to allow students to both commit to an answer in addition to selecting options that correspond with their reasoning. In addition to being machine-scorable, this format allows for more nuanced scoring than simple right or wrong. However, such nuanced scoring presents a potential challenge with respect to utilizing certain testing theories to construct validity arguments for the assessment. In particular, Item Response Theory (IRT) models often assume dichotomously scored items. While polytomous IRT models do exist, each brings with it certain constraints and limitations. Here, we will explore multiple IRT models and scoring schema using data from an existing CMR test, with the goal of providing guidance and insight for possible methods for simultaneously leveraging the affordances of both the CMR format and IRT models in the context of constructing validity arguments for research-based assessments. 
    more » « less
  3. Research on spatial thinking requires reliable and valid measures of individual differences in various component skills. Spatial perspective taking (PT)-the ability to represent viewpoints different from one's own-is one kind of spatial skill that is especially relevant to navigation. This study had two goals. First, the psychometric properties of four PT tests were examined: Four Mountains Task (FMT), Spatial Orientation Task (SOT), Perspective-Taking Task for Adults (PTT-A), and Photographic Perspective-Taking Task (PPTT). Using item response theory (IRT), item difficulty, discriminability, and efficiency of item information functions were evaluated. Second, the relation of PT scores to general intelligence, working memory, and mental rotation (MR) was assessed. All tasks showed good construct validity except for FMT. PPTT tapped a wide range of PT ability, with maximum measurement precision at average ability. PTT-A captured a lower range of ability. Although SOT contributed less measurement information than other tasks, it did well across a wide range of PT ability. After controlling for general intelligence and working memory, original and IRT-refined versions of PT tasks were each related to MR. PTT-A and PPTT showed relatively more divergent validity from MR than SOT. Tests of dimensionality indicated that PT tasks share one common PT dimension, with secondary task-specific factors also impacting the measurement of individual differences in performance. Advantages and disadvantages of a hybrid PT test that includes a combination of items across tasks are discussed. 
    more » « less
  4. Abstract Measurement of object recognition (OR) ability could predict learning and success in real-world settings, and there is hope that it may reduce bias often observed in cognitive tests. Although the measurement of visual OR is not expected to be influenced by the language of participants or the language of instructions, these assumptions remain largely untested. Here, we address the challenges of measuring OR abilities across linguistically diverse populations. In Study 1, we find that English–Spanish bilinguals, when randomly assigned to the English or Spanish version of the novel object memory test (NOMT), exhibit a highly similar overall performance. Study 2 extends this by assessing psychometric equivalence using an approach grounded in item response theory (IRT). We examined whether groups fluent in English or Spanish differed in (a) latent OR ability as assessed by a three-parameter logistic IRT model, and (2) the mapping of observed item responses on the latent OR construct, as assessed by differential item functioning (DIF) analyses. Spanish speakers performed better than English speakers, a difference we suggest is due to motivational differences between groups of vastly different size on the Prolific platform. That we found no substantial DIF between the groups tested in English or Spanish on the NOMT indicates measurement invariance. The feasibility of increasing diversity by combining groups tested in different languages remains unexplored. Adopting this approach could enable visual scientists to enhance diversity, equity, and inclusion in their research, and potentially in the broader application of their work in society. 
    more » « less
  5. Abstract Numeracy—the ability to understand and use numeric information—is linked to good decision-making. Several problems exist with current numeracy measures, however. Depending on the participant sample, some existing measures are too easy or too hard; also, established measures often contain items well-known to participants. The current article aimed to develop new numeric understanding measures (NUMs) including a 1-item (1-NUM), 4-item (4-NUM), and 4-item adaptive measure (A-NUM). In a calibration study, 2 participant samples (n = 226 and 264 from Amazon’s Mechanical Turk [MTurk]) each responded to half of 84 novel numeracy items. We calibrated items using 2-parameter logistic item response theory (IRT) models. Based on item parameters, we developed the 3 new numeracy measures. In a subsequent validation study, 600 MTurk participants completed the new numeracy measures, the adaptive Berlin Numeracy Test, and the Weller Rasch-Based Numeracy Test, in randomized order. To establish predictive and convergent validities, participants also completed judgment and decision tasks, Raven’s progressive matrices, a vocabulary test, and demographics. Confirmatory factor analyses suggested that the 1-NUM, 4-NUM, and A-NUM load onto the same factor as existing measures. The NUM scales also showed similar association patterns to subjective numeracy and cognitive ability measures as established measures. Finally, they effectively predicted classic numeracy effects. In fact, based on power analyses, the A-NUM and 4-NUM appeared to confer more power to detect effects than existing measures. Thus, using IRT, we developed 3 brief numeracy measures, using novel items and without sacrificing construct scope. The measures can be downloaded as Qualtrics files (https://osf.io/pcegz/). 
    more » « less