skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Developments in Psychometric Population Models for Technology-Based Large-Scale Assessments: An Overview of Challenges and Opportunities
International large-scale assessments (ILSAs) transitioned from paper-based assessments to computer-based assessments (CBAs) facilitating the use of new item types and more effective data collection tools. This allows implementation of more complex test designs and to collect process and response time (RT) data. These new data types can be used to improve data quality and the accuracy of test scores obtained through latent regression (population) models. However, the move to a CBA also poses challenges for comparability and trend measurement, one of the major goals in ISLAs. We provide an overview of current methods used in ILSAs to examine and assure the comparability of data across different assessment modes and methods that improve the accuracy of test scores by making use of new data types provided by a CBA.  more » « less
Award ID(s):
1633353
PAR ID:
10208081
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Journal of Educational and Behavioral Statistics
Volume:
44
Issue:
6
ISSN:
1076-9986
Page Range / eLocation ID:
671 to 705
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Effective antiretroviral therapy (ART) has significantly reduced mortality of people living with HIV (PLWH), and the prevalence of at-risk alcohol use is higher among PLWH. Increased survival and aging of PLWH is associated with increased prevalence of metabolic comorbidities especially among menopausal women, and adipose tissue metabolic dysregulation may be a significant contributing factor. We examined the differential effects of chronic binge alcohol (CBA) administration and ovariectomy (OVX) on the omental adipose tissue (OmAT) proteome in a subset of simian immunodeficiency virus (SIV)-infected macaques of a longitudinal parent study. Quantitative discovery-based proteomics identified 1,429 differentially expressed proteins. Ingenuity Pathway Analysis (IPA) was used to calculate z-scores, or activation predictions, for functional pathways and diseases. Results revealed that protein changes associated with functional pathways centered around the “OmAT metaboproteome profile.” Based on z-scores, CBA did not affect functional pathways of metabolic disease but dysregulated proteins involved in adenosine monophosphate-activated protein kinase (AMPK) signaling and lipid metabolism. OVX-mediated proteome changes were predicted to promote pathways involved in glucose- and lipid-associated metabolic disease. Proteins involved in apoptosis, necrosis, and reactive oxygen species (ROS) pathways were also predicted to be activated by OVX and these were predicted to be inhibited by CBA. These results provide evidence for the role of ovarian hormone loss in mediating OmAT metaboproteome dysregulation in SIV and suggest that CBA modifies OVX-associated changes. In the context of OVX, CBA administration produced larger metabolic and cellular effects, which we speculate may reflect a protective role of estrogen against CBA-mediated adipose tissue injury in female SIV-infected macaques. 
    more » « less
  2. Abstract When large-scale assessment programs are developed and administered in a particular language, students from other native language backgrounds may experience considerable barriers to appropriate measurement of the targeted knowledge and skills. Empirical work is needed to determine if one of the most commonly-applied accommodations to address language barriers, namely extended test time limits, corresponds to score comparability for students who use it. Prior work has examined score comparability for English learners (ELs) eligible to use extended time on tests in the United States, but not specifically for those who more specifically show evidence of using the accommodation. NAEP process data were used to explore score comparability for two groups of ELs eligible for extended time: those who used extended time and those who did not. Analysis of differential item functioning (DIF) was applied to examine potential item bias for these groups when compared to a reference group of native English speakers. Items showing significant and large DIF were identified in both comparisons, with slightly more DIF items identified for the comparison involving ELs who used extended time. Item location and word counts were examined for those items displaying DIF, with results showing some alignment with the notion that language-related barriers may be present for ELs even when extended time is used. Overall, results point to a need for ongoing consideration of the unique needs of ELs during large-scale testing, and the opportunities test process data offer for more comprehensive analyses of accommodation use and effectiveness. 
    more » « less
  3. Abstract Data visualizations play a crucial role in communicating patterns in quantitative data, making data visualization literacy a key target of STEM education. However, it is currently unclear to what degree different assessments of data visualization literacy measure the same underlying constructs. Here, we administered two widely used graph comprehension assessments (Galesic and Garcia-Retamero in Med Dec Mak 31:444–457, 2011; Lee et al. in IEEE Trans Vis Comput Graph 235:51–560, 2016) to both a university-based convenience sample and a demographically representative sample of adult participants in the USA (N=1,113). Our analysis of individual variability in test performance suggests that overall scores are correlated between assessments and associated with the amount of prior coursework in mathematics. However, further exploration of individual error patterns suggests that these assessments probe somewhat distinct components of data visualization literacy, and we do not find evidence that these components correspond to the categories that guided the design of either test (e.g., questions that require retrieving values rather than making comparisons). Together, these findings suggest opportunities for development of more comprehensive assessments of data visualization literacy that are organized by components that better account for detailed behavioral patterns. 
    more » « less
  4. Psychological test scores are commonly used in high-stakes settings to classify individuals. While measurement invariance across groups is necessary for valid and meaningful inferences of group differences, full measurement invariance rarely holds in practice. The classification accuracy analysis framework aims to quantify the degree and practical impact of noninvariance. However, how to best navigate the next steps remains unclear, and methods devised to account for noninvariance at the group level may be insufficient when the goal is classification. Furthermore, deleting a biased item may improve fairness but negatively affect performance, and replacing the test can be costly. We propose item-level effect size indices that allow test users to make more informed decisions by quantifying the impact of deleting (or retaining) an item on test performance and fairness, provide an illustrative example, and introduce unbiasr, an R package implementing the proposed methods. 
    more » « less
  5. Recent success in deep learning has partially been driven by training increasingly overparametrized networks on ever larger datasets. It is therefore natural to ask: how much of the data is superfluous, which examples are important for general- ization, and how do we find them? In this work, we make the striking observation that, in standard vision datasets, simple scores averaged over several weight ini- tializations can be used to identify important examples very early in training. We propose two such scores—the Gradient Normed (GraNd) and the Error L2-Norm (EL2N) scores—and demonstrate their efficacy on a range of architectures and datasets by pruning significant fractions of training data without sacrificing test accuracy. In fact, using EL2N scores calculated a few epochs into training, we can prune half of the CIFAR10 training set while slightly improving test accuracy. Furthermore, for a given dataset, EL2N scores from one architecture or hyper- parameter configuration generalize to other configurations. Compared to recent work that prunes data by discarding examples that are rarely forgotten over the course of training, our scores use only local information early in training. We also use our scores to detect noisy examples and study training dynamics through the lens of important examples—we investigate how the data distribution shapes the loss surface and identify subspaces of the model’s data representation that are relatively stable over training. 
    more » « less