skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Gender Bias in Test Item Formats: Evidence from PISA 2009, 2012, and 2015 Math and Reading Tests
Large-scale standardized tests are regularly used to measure student achievement overall and for student subgroups. These uses assume tests provide comparable measures of outcomes across student subgroups, but prior research suggests score comparisons across gender groups may be complicated by the type of test items used. This paper presents evidence that among nationally representative samples of 15-year-olds in the United States participating in the 2009, 2012, and 2015 PISA math and reading tests, there are consistent item format by gender differences. On average, male students answer multiple-choice items correctly relatively more often and female students answer constructed-response items correctly relatively more often. These patterns were consistent across 34 additional participating PISA jurisdictions, although the size of the format differences varied and were larger on average in reading than math. The average magnitude of the format differences is not large enough to be flagged in routine differential item functioning analyses intended to detect test bias but is large enough to raise questions about the validity of inferences based on comparisons of scores across gender groups. Researchers and other test users should account for test item format, particularly when comparing scores across gender groups.  more » « less
Award ID(s):
1749275
PAR ID:
10433095
Author(s) / Creator(s):
Date Published:
Journal Name:
Journal of Educational Measurement
ISSN:
0022-0655
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Selected response items and constructed response (CR) items are often found in the same test. Conventional psychometric models for these two types of items typically focus on using the scores for correctness of the responses. Recent research suggests, however, that more information may be available from the CR items than just scores for correctness. In this study, we describe an approach in which a statistical topic model along with a diagnostic classification model (DCM) was applied to a mixed item format formative test of English and Language Arts. The DCM was used to estimate students’ mastery status of reading skills. These mastery statuses were then included in a topic model as covariates to predict students’ use of each of the latent topics in their written answers to a CR item. This approach enabled investigation of the effects of mastery status of reading skills on writing patterns. Results indicated that one of the skills, Integration of Knowledge and Ideas, helped detect and explain students’ writing patterns with respect to students’ use of individual topics. 
    more » « less
  2. It is well established that access to social supports is essential for engineering students’ persistence and yet access to supports varies across groups. Understanding the differential supports inherent in students’ social networks and then working to provide additional needed supports can help the field of engineering education become more inclusive of all students. Our work contributes to this effort by examing the reliability and fairness of a social capital instrument, the Undergraduate Supports Survey (USS). We examined the extent to which two scales were reliable across ability levels (level of social capital), gender groups and year-in-school. We conducted two item response theory (IRT) models using a graded response model and performed differential item functioning (DIF) tests to detect item differences in gender and year-in-school. Our results indicate that most items have acceptable to good item discrimination and difficulty. DIF analysis shows that multiple items report DIF across gender groups in the Expressive Support scale in favor of women and nonbinary engineering students. DIF analysis shows that year-in-school has little to no effect on items, with only one DIF item. Therefore, engineering educators can use the USS confidently to examine expressive and instrumental social capital in undergraduates across year-in-school. Our work can be used by the engineering education research community to identify and address differences in students’ access to support. We recommend that the engineering education community works to be explicit in their expressive and instrumental support. Future work will explore the measurement invariance in Expressive Support items across gender. 
    more » « less
  3. It is well established that access to social supports is essential for engineering students’ persistence and yet access to supports varies across groups. Understanding the differential supports inherent in students’ social networks and then working to provide additional needed supports can help the field of engineering education become more inclusive of all students. Our work contributes to this effort by examing the reliability and fairness of a social capital instrument, the Undergraduate Supports Survey (USS). We examined the extent to which two scales were reliable across ability levels (level of social capital), gender groups and year-in-school. We conducted two item response theory (IRT) models using a graded response model and performed differential item functioning (DIF) tests to detect item differences in gender and year-in-school. Our results indicate that most items have acceptable to good item discrimination and difficulty. DIF analysis shows that multiple items report DIF across gender groups in the Expressive Support scale in favor of women and nonbinary engineering students. DIF analysis shows that year-in-school has little to no effect on items, with only one DIF item. Therefore, engineering educators can use the USS confidently to examine expressive and instrumental social capital in undergraduates across year-in-school. Our work can be used by the engineering education research community to identify and address differences in students’ access to support. We recommend that the engineering education community works to be explicit in their expressive and instrumental support. Future work will explore the measurement invariance in Expressive Support items across gender. 
    more » « less
  4. Education researchers often compare performance across race and gender on research-based assessments of physics knowledge to investigate the impacts of racism and sexism on physics student learning. These investigations' claims rely on research-based assessments providing reliable, unbiased measures of student knowledge across social identity groups. We used classical test theory and differential item functioning (DIF) analysis to examine whether the items on the Force Concept Inventory (FCI) provided unbiased data across social identifiers for race, gender, and their intersections. The data was accessed through the Learning About STEM Student Outcomes platform and included responses from 4,848 students posttests in 152 calculus-based introductory physics courses from 16 institutions. The results indicated that the majority of items (22) on the FCI were biased towards a group. These results point to the need for instrument validation to account for item bias and the identification or development of fair research-based assessments. 
    more » « less
  5. Evans, T; Marmur, O; Hunter, J; Leach, G (Ed.)
    In college, taking algebra can prevent degree completion. One reason for this is that algebra courses in college tend to focus on procedures disconnected from meaning-making (e.g., Goldrick-Rab, 2007). It is critical to connect procedural fluency with conceptual understanding (Kilpatrick, et al., 2001). Several instruments test algebraic proficiency, however, none were designed to test a large body of algebraic conceptions and concepts. We address this gap by developing the Algebra Concept Inventory (ACI), to test college students’ conceptual understanding in algebra. A total of 402 items were developed and tested in eight waves from spring 2019 to fall 2022, administered to 18,234 students enrolled in non-arithmetic based mathematics classes at a large urban community college in the US. Data collection followed a common-item random groups equating design. Retrospective think-aloud interviews were conducted with 135 students to assess construct validity of the items. 2PL IRT models were run on all waves; 63.4% of items (253) have at least moderate, and roughly one-third have high or very high discrimination. In all waves, peak instrument values have excellent reliability ( R ≥ 0.9 ). Convergent validity was explored through the relationship between scores on the ACI and mathematics course level. Students in “mid”-level courses scored on average 0.35 SD higher than those in “low”-level courses; students in “high”-level courses scored on average 0.35 SD higher than those in “mid”-level courses, providing strong evidence of convergent validity. There was no consistent evidence of differential item functioning (DIF) related to examinee characteristics: race/ethnicity, gender, and English-language-learner status. Results suggest that algebraic conceptual understanding, conceptualized by the ACI, is measurable. The final ACI is likely to differentiate between students of various mathematical levels, without conflating characteristics such as race, gender, etc. 
    more » « less