Item nonresponses are prevalent in standardized testing. They happen either when students fail to reach the end of a test due to a time limit or quitting, or when students choose to omit some items strategically. Oftentimes, item nonresponses are nonrandom, and hence, the missing data mechanism needs to be properly modeled. In this paper, we proposed to use an innovative item response time model as a cohesive missing data model to account for the two most common item nonresponses: not‐reached items and omitted items. In particular, the new model builds on a behavior process interpretation: a person chooses to skip an item if the required effort exceeds the implicit time the person allocates to the item (Lee & Ying, 2015; Wolf, Smith, & Birnbaum, 1995), whereas a person fails to reach the end of the test due to lack of time. This assumption was verified by analyzing the 2015 PISA computer‐based mathematics data. Simulation studies were conducted to further evaluate the performance of the proposed Bayesian estimation algorithm for the new model and to compare the new model with a recently proposed “speed‐accuracy + omission” model (Ulitzsch, von Davier, & Pohl, 2019). Results revealed that all model parameters could recover properly, and inadequately accounting for missing data caused biased item and person parameter estimates.
more » « less- NSF-PAR ID:
- 10148191
- Publisher / Repository:
- Wiley-Blackwell
- Date Published:
- Journal Name:
- Journal of Educational Measurement
- Volume:
- 57
- Issue:
- 4
- ISSN:
- 0022-0655
- Page Range / eLocation ID:
- p. 584-620
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
von_Davier, Matthias (Ed.)Time limits are imposed on many computer-based assessments, and it is common to observe exami- nees who run out of time, resulting in missingness due to not-reached items. The present study proposes an approach to account for the missing mechanisms of not-reached items via response time censoring. The censoring mechanism is directly incorporated into the observed likelihood of item responses and response times. A marginal maximum likelihood estimator is proposed, and its asymptotic properties are estab- lished. The proposed method was evaluated and compared to several alternative approaches that ignore the censoring through simulation studies. An empirical study based on the PISA 2018 Science Test was further conducted.more » « less
-
Abstract Many large‐scale educational surveys have moved from linear form design to multistage testing (MST) design. One advantage of MST is that it can provide more accurate latent trait (
θ ) estimates using fewer items than required by linear tests. However, MST generates incomplete response data by design; hence, questions remain as to how to calibrate items using the incomplete data from MST design. Further complication arises when there are multiple correlated subscales per test, and when items from different subscales need to be calibrated according to their respective score reporting metric. The current calibration‐per‐subscale method produced biased item parameters, and there is no available method for resolving the challenge. Deriving from the missing data principle, we showed when calibrating all items together the Rubin's ignorability assumption is satisfied such that the traditional single‐group calibration is sufficient. When calibrating items per subscale, we proposed a simple modification to the current calibration‐per‐subscale method that helps reinstate the missing‐at‐random assumption and therefore corrects for the estimation bias that is otherwise existent. Three mainstream calibration methods are discussed in the context of MST, they are the marginal maximum likelihood estimation, the expectation maximization method, and the fixed parameter calibration. An extensive simulation study is conducted and a real data example from NAEP is analyzed to provide convincing empirical evidence. -
Eliassi-Rad, Tina (Ed.)Multidimensional unfolding methods are widely used for visualizing item response data. Such methods project respondents and items simultaneously onto a low-dimensional Eu- clidian space, in which respondents and items are represented by ideal points, with person- person, item-item, and person-item similarities being captured by the Euclidian distances between the points. In this paper, we study the visualization of multidimensional unfold- ing from a statistical perspective. We cast multidimensional unfolding into an estimation problem, where the respondent and item ideal points are treated as parameters to be esti- mated. An estimator is then proposed for the simultaneous estimation of these parameters. Asymptotic theory is provided for the recovery of the ideal points, shedding lights on the validity of model-based visualization. An alternating projected gradient descent algorithm is proposed for the parameter estimation. We provide two illustrative examples, one on users’ movie rating and the other on senate roll call voting.more » « less
-
Large-scale standardized tests are regularly used to measure student achievement overall and for student subgroups. These uses assume tests provide comparable measures of outcomes across student subgroups, but prior research suggests score comparisons across gender groups may be complicated by the type of test items used. This paper presents evidence that among nationally representative samples of 15-year-olds in the United States participating in the 2009, 2012, and 2015 PISA math and reading tests, there are consistent item format by gender differences. On average, male students answer multiple-choice items correctly relatively more often and female students answer constructed-response items correctly relatively more often. These patterns were consistent across 34 additional participating PISA jurisdictions, although the size of the format differences varied and were larger on average in reading than math. The average magnitude of the format differences is not large enough to be flagged in routine differential item functioning analyses intended to detect test bias but is large enough to raise questions about the validity of inferences based on comparisons of scores across gender groups. Researchers and other test users should account for test item format, particularly when comparing scores across gender groups.more » « less
-
The purpose of the current study was to analyze the impact of delayed monitoring judgments on both monitoring accuracy and science knowledge in a game-based learning environment called MISSING MONTY. Fifth-grade students from public schools in the USA were randomly assigned to either an immediate monitoring (IM) (n = 142) condition or to a delayed monitoring (DM) condition (n = 171). All students completed a pre and posttest of science knowledge and made item-level confidence judgments on each test. The students then played MISSING MONTY for approximately 2-5 weeks depending upon class schedule. During gameplay students visited various animal researchers, read informational texts, and completed knowledge and monitoring challenges. In the IM condition, students rated their confidence on a 100-point scale immediately following each item. In the DM condition, the students first completed the knowledge challenge and then provided monitoring judgments following the completion of all items. Results showed significant improvements for science knowledge and monitoring accuracy for both groups, however no significant differences were found between the two conditions Thus, MISSING MONTY appeared to have positive effects on both resultant science knowledge and monitoring accuracy regardless of when monitoring was assessed. Implications for the design of learning environments and SRL will be discussed.more » « less