skip to main content


Title: Item Response Theory – A Statistical Framework for Educational and Psychological Measurement
Item response theory (IRT) has become one of the most popular statistical models for psychometrics, a field of study concerned with the theory and techniques of psychological measurement. The IRT models are latent factor models tailored to the analysis, interpretation, and prediction of individuals’ behaviors in answering a set of measurement items that typically involve categorical response data. Many important questions of measurement are directly or indirectly answered through the use of IRT models, including scoring individuals’ test performances, validating a test scale, linking two tests, among others. This paper provides a review of item response theory, including its statistical framework and psychometric applications. We establish connections between item response theory and related topics in statistics, including empirical Bayes, nonparametric methods, matrix completion, regularized estimation, and sequential analysis. Possible future directions of IRT are discussed from the perspective of statistical learning.  more » « less
Award ID(s):
2119938
NSF-PAR ID:
10484759
Author(s) / Creator(s):
Publisher / Repository:
Statistical science
Date Published:
Journal Name:
Statistical science
ISSN:
0883-4237
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    The Science Teaching Efficacy Belief Instrument A (STEBI-A; Riggs & Enochs, 1990 in Science Education, 74(6), 625-637) has been the dominant measurement tool of in-service science teacher self-efficacy and outcome expectancy for nearly 30 years. However, concerns about certain aspects of the STEBI-A have arisen, including the wording, validity, reliability, and dimensionality. In the present study, we revised the STEBI-A by addressing many concerns research has identified, and developed a new instrument called the T-STEM Science Scale. The T-STEM Science Scale was reviewed by expert panels and piloted first before it was administered to 727 elementary and secondary science teachers. The combination of classical test theory (CTT) and item response theory (IRT) approaches were used to validate the instrument. Multidimensional Rasch analysis and confirmatory factor analysis were run.

    Results

    Based on the results, the negatively worded items were found to be problematic and thus removed from the instrument. We also found that the three-dimensional model fit our data the best, in line with our theoretical conceptualization. Based on the literature review and analysis, although the personal science teaching efficacy beliefs (PTSEB) construct remained intact, the original outcome expectancy construct was renamed science teacher responsibility for learning outcomes beliefs (STRLOB) and was divided into two dimensions, above- and below-average student interest or performance. The T-STEM Science Scale had satisfactory reliability values as well.

    Conclusions

    Through the development and validation of the T-STEM Science Scale, we have addressed some critical concerns emergent from prior research concerning the STEBI-A. Psychometrically, the refinement of the wording, item removal, and the separation into three constructs have resulted in better reliability values compared to STEBI-A. While two distinct theoretical foundations are now used to explain the constructs of the new T-STEM instrument, prior literature and our empirical results note the important interrelationship of these constructs. The preservation of these constructs preserves a bridge, though imperfect, to the large body of legacy research using the STEBI-A.

     
    more » « less
  2. Abstract

    Numeracy—the ability to understand and use numeric information—is linked to good decision-making. Several problems exist with current numeracy measures, however. Depending on the participant sample, some existing measures are too easy or too hard; also, established measures often contain items well-known to participants. The current article aimed to develop new numeric understanding measures (NUMs) including a 1-item (1-NUM), 4-item (4-NUM), and 4-item adaptive measure (A-NUM). In a calibration study, 2 participant samples (n = 226 and 264 from Amazon’s Mechanical Turk [MTurk]) each responded to half of 84 novel numeracy items. We calibrated items using 2-parameter logistic item response theory (IRT) models. Based on item parameters, we developed the 3 new numeracy measures. In a subsequent validation study, 600 MTurk participants completed the new numeracy measures, the adaptive Berlin Numeracy Test, and the Weller Rasch-Based Numeracy Test, in randomized order. To establish predictive and convergent validities, participants also completed judgment and decision tasks, Raven’s progressive matrices, a vocabulary test, and demographics. Confirmatory factor analyses suggested that the 1-NUM, 4-NUM, and A-NUM load onto the same factor as existing measures. The NUM scales also showed similar association patterns to subjective numeracy and cognitive ability measures as established measures. Finally, they effectively predicted classic numeracy effects. In fact, based on power analyses, the A-NUM and 4-NUM appeared to confer more power to detect effects than existing measures. Thus, using IRT, we developed 3 brief numeracy measures, using novel items and without sacrificing construct scope. The measures can be downloaded as Qualtrics files (https://osf.io/pcegz/).

     
    more » « less
  3. It is well established that access to social supports is essential for engineering students’ persistence and yet access to supports varies across groups. Understanding the differential supports inherent in students’ social networks and then working to provide additional needed supports can help the field of engineering education become more inclusive of all students. Our work contributes to this effort by examing the reliability and fairness of a social capital instrument, the Undergraduate Supports Survey (USS). We examined the extent to which two scales were reliable across ability levels (level of social capital), gender groups and year-in-school. We conducted two item response theory (IRT) models using a graded response model and performed differential item functioning (DIF) tests to detect item differences in gender and year-in-school. Our results indicate that most items have acceptable to good item discrimination and difficulty. DIF analysis shows that multiple items report DIF across gender groups in the Expressive Support scale in favor of women and nonbinary engineering students. DIF analysis shows that year-in-school has little to no effect on items, with only one DIF item. Therefore, engineering educators can use the USS confidently to examine expressive and instrumental social capital in undergraduates across year-in-school. Our work can be used by the engineering education research community to identify and address differences in students’ access to support. We recommend that the engineering education community works to be explicit in their expressive and instrumental support. Future work will explore the measurement invariance in Expressive Support items across gender. 
    more » « less
  4. Research on spatial thinking requires reliable and valid measures of individual differences in various component skills. Spatial perspective taking (PT)-the ability to represent viewpoints different from one's own-is one kind of spatial skill that is especially relevant to navigation. This study had two goals. First, the psychometric properties of four PT tests were examined: Four Mountains Task (FMT), Spatial Orientation Task (SOT), Perspective-Taking Task for Adults (PTT-A), and Photographic Perspective-Taking Task (PPTT). Using item response theory (IRT), item difficulty, discriminability, and efficiency of item information functions were evaluated. Second, the relation of PT scores to general intelligence, working memory, and mental rotation (MR) was assessed. All tasks showed good construct validity except for FMT. PPTT tapped a wide range of PT ability, with maximum measurement precision at average ability. PTT-A captured a lower range of ability. Although SOT contributed less measurement information than other tasks, it did well across a wide range of PT ability. After controlling for general intelligence and working memory, original and IRT-refined versions of PT tasks were each related to MR. PTT-A and PPTT showed relatively more divergent validity from MR than SOT. Tests of dimensionality indicated that PT tasks share one common PT dimension, with secondary task-specific factors also impacting the measurement of individual differences in performance. Advantages and disadvantages of a hybrid PT test that includes a combination of items across tasks are discussed. 
    more » « less
  5. Recent years have seen a movement within the research-based assessment development community towards item formats that go beyond simple multiple-choice formats. Some have moved towards free-response questions, particularly at the upper-division level; however, free-response items have the constraint that they must be scored by hand. To avoid this limitation, some assessment developers have moved toward formats that maintain the closed-response format, while still providing more nuanced insight into student reasoning. One such format is known as coupled, multiple response (CMR). This format pairs multiple-choice and multiple-response formats to allow students to both commit to an answer in addition to selecting options that correspond with their reasoning. In addition to being machine-scorable, this format allows for more nuanced scoring than simple right or wrong. However, such nuanced scoring presents a potential challenge with respect to utilizing certain testing theories to construct validity arguments for the assessment. In particular, Item Response Theory (IRT) models often assume dichotomously scored items. While polytomous IRT models do exist, each brings with it certain constraints and limitations. Here, we will explore multiple IRT models and scoring schema using data from an existing CMR test, with the goal of providing guidance and insight for possible methods for simultaneously leveraging the affordances of both the CMR format and IRT models in the context of constructing validity arguments for research-based assessments. 
    more » « less