skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on October 8, 2026

Title: Two‐Phase Content‐Balancing CD‐CAT Online Item Calibration
Abstract Online calibration estimates new item parameters alongside previously calibrated items, supporting efficient item replenishment. However, most existing online calibration procedures for Cognitive Diagnostic Computerized Adaptive Testing (CD‐CAT) lack mechanisms to ensure content balance during live testing. This limitation can lead to uneven content coverage, potentially undermining the alignment with instructional goals. This research extends the current calibration framework by integrating a two‐phase test design with a content‐balancing item selection method into the online calibration procedure. Simulation studies evaluated item parameter recovery and attribute profile estimation accuracy under the proposed procedure. Results indicated that the developed procedure yielded more accurate new item parameter estimates. The procedure also maintained content representativeness under both balanced and unbalanced constraints. Attribute profile estimation was sensitive to item parameter values. Accuracy declined when items had larger parameter values. Calibration improved with larger sample sizes and smaller parameter values. Longer test lengths contributed more to profile estimation than to new item calibration. These findings highlight design trade‐offs in adaptive item replenishment and suggest new directions for hybrid calibration methods.  more » « less
Award ID(s):
2322015 2141847 2142317
PAR ID:
10648459
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  
Publisher / Repository:
Journal of Educational Measurement
Date Published:
Journal Name:
Journal of Educational Measurement
ISSN:
0022-0655
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Results of a comprehensive simulation study are reported investigating the effects of sample size, test length, number of attributes and base rate of mastery on item parameter recovery and classification accuracy of four DCMs (i.e., C-RUM, DINA, DINO, and LCDMREDUCED). Effects were evaluated using bias and RMSE computed between true (i.e., generating) parameters and estimated parameters. Effects of simulated factors on attribute assignment were also evaluated using the percentage of classification accuracy. More precise estimates of item parameters were obtained with larger sample size and longer test length. Recovery of item parameters decreased as the number of attributes increased from three to five but base rate of mastery had a varying effect on the item recovery. Item parameter and classification accuracy were higher for DINA and DINO models. 
    more » « less
  2. Abstract Numeracy—the ability to understand and use numeric information—is linked to good decision-making. Several problems exist with current numeracy measures, however. Depending on the participant sample, some existing measures are too easy or too hard; also, established measures often contain items well-known to participants. The current article aimed to develop new numeric understanding measures (NUMs) including a 1-item (1-NUM), 4-item (4-NUM), and 4-item adaptive measure (A-NUM). In a calibration study, 2 participant samples (n = 226 and 264 from Amazon’s Mechanical Turk [MTurk]) each responded to half of 84 novel numeracy items. We calibrated items using 2-parameter logistic item response theory (IRT) models. Based on item parameters, we developed the 3 new numeracy measures. In a subsequent validation study, 600 MTurk participants completed the new numeracy measures, the adaptive Berlin Numeracy Test, and the Weller Rasch-Based Numeracy Test, in randomized order. To establish predictive and convergent validities, participants also completed judgment and decision tasks, Raven’s progressive matrices, a vocabulary test, and demographics. Confirmatory factor analyses suggested that the 1-NUM, 4-NUM, and A-NUM load onto the same factor as existing measures. The NUM scales also showed similar association patterns to subjective numeracy and cognitive ability measures as established measures. Finally, they effectively predicted classic numeracy effects. In fact, based on power analyses, the A-NUM and 4-NUM appeared to confer more power to detect effects than existing measures. Thus, using IRT, we developed 3 brief numeracy measures, using novel items and without sacrificing construct scope. The measures can be downloaded as Qualtrics files (https://osf.io/pcegz/). 
    more » « less
  3. Abstract In group testing, the goal is to identify a subset of defective items within a larger set of items based on tests whose outcomes indicate whether at least one defective item is present. This problem is relevant in areas such as medical testing, DNA sequencing, communication protocols and many more. In this paper, we study (i) a sparsity-constrained version of the problem, in which the testing procedure is subjected to one of the following two constraints: items are finitely divisible and thus may participate in at most $$\gamma $$ tests; or tests are size-constrained to pool no more than $$\rho $$ items per test; and (ii) a noisy version of the problem, where each test outcome is independently flipped with some constant probability. Under each of these settings, considering the for-each recovery guarantee with asymptotically vanishing error probability, we introduce a fast splitting algorithm and establish its near-optimality not only in terms of the number of tests, but also in terms of the decoding time. While the most basic formulations of our algorithms require $$\varOmega (n)$$ storage for each algorithm, we also provide low-storage variants based on hashing, with similar recovery guarantees. 
    more » « less
  4. Physics instructors and education researchers use research-based assessments (RBAs) to evaluate students' preparation for physics courses. This preparation can cover a wide range of constructs including mathematics and physics content. Using separate mathematics and physics RBAs consumes course time. We are developing a new RBA for introductory mechanics as an online test using both computerized adaptive testing and cognitive diagnostic models. This design allows the adaptive RBA to assess mathematics and physics content knowledge within a single assessment. In this article, we used an evidence-centered design framework to inform the extent to which our models of skills students develop in physics courses fit the data from three mathematics RBAs. Our dataset came from the LASSO platform and includes 3,491 responses from the Calculus Concept Assessment, Calculus Concept Inventory, and Pre-calculus Concept Assessment. Our model included five skills: apply vectors, conceptual relationships, algebra, visualizations, and calculus. The "deterministic inputs, noisy 'and' gate'' (DINA) analyses demonstrated a good fit for the five skills. The classification accuracies for the skills were satisfactory. Including items from the three mathematics RBAs in the item bank for the adaptive RBA will provide a flexible assessment of these skills across mathematics and physics content areas that can adapt to instructors' needs. 
    more » « less
  5. Modeling student knowledge is important for assessment design, adaptive testing, curriculum design, and pedagogical intervention. The assessment design community has primarily focused on continuous latent-skill models with strong conditional independence assumptions among knowledge items, while the prerequisite discovery community has developed many models that aim to exploit the interdependence of discrete knowledge items. This paper attempts to bridge the gap by asking, "When does modeling assessment item interdependence improve predictive accuracy?" A novel adaptive testing evaluation framework is introduced that is amenable to techniques from both communities, and an efficient algorithm, Directed Item-Dependence And Confidence Thresholds (DIDACT), is introduced and compared with an Item-Response-Theory based model on several real and synthetic datasets. Experiments suggest that assessments with closely related questions benefit significantly from modeling item interdependence. 
    more » « less