skip to main content


Title: Flip it: An exploratory (versus explanatory) sequential mixed methods design using Delphi and differential item functioning to evaluate item bias.
The Delphi method has been adapted to inform item refinements in educational and psychological assessment development. An explanatory sequential mixed methods design using Delphi is a common approach to gain experts' insight into why items might have exhibited differential item functioning (DIF) for a sub-group, indicating potential item bias. Use of Delphi before quantitative field testing to screen for potential sources leading to item bias is lacking in the literature. An exploratory sequential design is illustrated as an additional approach using a Delphi technique in Phase I and Rasch DIF analyses in Phase II. We introduce the 2 × 2 Concordance Integration Typology as a systematic way to examine agreement and disagreement across the qualitative and quantitative findings using a concordance joint display table. A worked example from the development of the Problem-Solving Measures Grades 6–8 Computer Adaptive Tests supported using an exploratory sequential design to inform item refinement. The 2 × 2 Concordance Integration Typology (a) crystallized instances where additional refinements were potentially needed and (b) provided for evaluating the distribution of bias across the set of items as a whole. Implications are discussed for advancing data integration techniques and using mixed methods to improve instrument development.  more » « less
Award ID(s):
2100988
NSF-PAR ID:
10428384
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Methods in psychology
Volume:
8
ISSN:
2590-2601
Page Range / eLocation ID:
100-117
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The Standards for educational and psychological assessment were developed by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (AERA et al., 2014). The Standards specify assessment developers establish five types of validity evidence: test content, response processes, internal structure, relationship to other variables, and consequential/bias. Relevant to this proposal is consequential validity evidence that identifies the potential negative impact of testing or bias. Standard 3.1 of The Standards (2014) on fairness in testing states that “those responsible for test development, revision, and administration should design all steps of the testing process to promote valid score interpretations for intended score uses for the widest possible range of individuals and relevant sub-groups in the intended populations” (p. 63). Three types of bias include construct, method, and item bias (Boer et al., 2018). Testing for differential item functioning (DIF) is a standard analysis adopted to detect item bias against a subgroup (Boer et al., 2018). Example subgroups include gender, race/ethnic group, socioeconomic status, native language, or disability. DIF is when “equally able test takers differ in their probabilities answering a test item correctly as a function of group membership” (AERA et al., 2005, p. 51). DIF indicates systematic error as compared to real mean group differences (Camilli & Shepard, 1994). Items exhibiting significant DIF are removed or reviewed for sources leading to bias to determine modifications to retain and further test an item. The Delphi technique is an emergent systematic research method whereby expert panel members review item content through an iterative process (Yildirim & Büyüköztürk, 2018). Experts independently evaluate each item for potential sources leading to DIF, researchers group their responses, and experts then independently complete a survey to rate their level of agreement with the anonymously grouped responses. This process continues until saturation and consensus are reached among experts as established through some criterion (e.g., median agreement rating, item quartile range, and percent agreement). The technique allows researchers to “identify, learn, and share the ideas of experts by searching for agreement among experts” (Yildirim & Büyüköztürk, 2018, p. 451). Research has illustrated this technique applied after DIF is detected, but not before administering items in the field. The current research is a methodological illustration of the Delphi technique applied in the item construction phase of assessment development as part of a five-year study to develop and test new problem-solving measures (PSM; Bostic et al., 2015, 2017) for U.S.A. grades 6-8 in a computer adaptive testing environment. As part of an iterative design-science-based methodology (Middleton et al., 2008), we illustrate the integration of the Delphi technique into the item writing process. Results from two three-person panels each reviewing a set of 45 PSM items are utilized to illustrate the technique. Advantages and limitations identified through a survey by participating experts and researchers are outlined to advance the method. 
    more » « less
  2. When measuring academic skills among students whose primary language is not English, standardized assessments are often provided in languages other than English. The degree to which alternate-language test translations yield unbiased, equitable assessment must be evaluated; however, traditional methods of investigating measurement equivalence are susceptible to confounding group differences. The primary purposes of this study were to investigate differential item functioning (DIF) and item bias across Spanish and English forms of an assessment of early mathematics skills. Secondary purposes were to investigate the presence of selection bias and demonstrate a novel approach for investigating DIF that uses a regression discontinuity design framework to control for selection bias. Data were drawn from 1,750 Spanish-speaking Kindergarteners participating in the Early Childhood Longitudinal Study, Kindergarten Class of 1998–1999, who were administered either the Spanish or English version of the mathematics assessment based on their performance on an English language screening measure. Evidence of selection bias—differences between groups in SES, age, approaches to learning, self-control, social interaction, country of birth, childcare, household composition and number in the home, books in the home, and parent involvement—highlighted limitations of a traditional approach for investigating DIF that only controlled for ability. When controlling for selection bias, only 11% of items displayed DIF, and subsequent examination of item content did not suggest item bias. Results provide evidence that the Spanish translation of the ECLS-K mathematics assessment is an equitable and unbiased assessment accommodation for young dual language learners. 
    more » « less
  3. Abstract  
    more » « less
  4. Phenomena‐based approaches have become popular for elementary school teachers to engage children's innate curiosity in the natural world. However, integrating such phenomena‐based approaches in existing science courses within teacher education programs present potential challenges for both preservice elementary teachers (PSETs) and for laboratory instructors, both of whom may have had limited opportunities to learn or teach science within the student and instructor roles inherent within these approaches. This study uses a convergent parallel mixed‐methods approach to investigate PSETs' perceptions of their laboratory instructor's role within a Physical Science phenomena‐based laboratory curriculum and how it impacts their conceptual development (2 instructors/121 students). We also examine how the two laboratory instructors' discursive moves within the laboratory align with their's and PSETs' perceptions of the instructor role. Qualitative data includes triangulation between a student questionnaire, an instructor questionnaire, and video classroom observations, while quantitative data includes a nine‐item open response pre‐/post‐semester conceptual test. Guided by Mortimer's and Scott's analytic framework, our findings show that students primarily perceive their instructors as a guide/facilitator or an authoritarian/evaluator. Using Linn's knowledge integration framework, analysis of pre‐/post‐tests indicates that student outcomes align with students' perceptions of their instructors, with students who perceive their instructor as a guide/facilitator having significantly better pre‐/post‐outcomes. Additional analysis of scientific discourse from the classroom observations illustrates how one instructor primarily supports PSETs' perspectives on authentic science learning through dialogic–interactive talk moves whereas the other instructor epistemologically stifles personally relevant investigations with authoritative–interactive or authoritative–noninteractive discourse moves. Overall, this study concludes by discussing challenges facing laboratory instructors that need careful consideration for phenomena‐based approaches. 
    more » « less
  5. Sun, Daner (Ed.)
    The use of online instruction for undergraduate STEM courses is growing rapidly. While researchers and practitioners have access to validated instruments for studying the practice of teaching in face-to-face classrooms, analogous tools do not yet exist for online instruction. These tools are needed for quality design and control purposes. To meet this need, this project developed an observational protocol that can be used to collect non-evaluative data for the description, study, and improvement of online, undergraduate STEM courses. The development of this instrument used a sequential exploratory mixed methods approach to the research, design, pilot-testing, refinement and implementation of the protocol. Pairs of researchers tested the final version of this instrument, observing completed online undergraduate STEM courses. Across 2,394 pairs of observations, the observers recorded the same indication (yes or no to the presence of some course element) 1,853 times for an agreement rate of 77.4%, falling above the 75% threshold for an acceptable level of agreement. There was a wide range in the inter-rater reliability rates among items and further revisions were made to the instrument. This foundational work-in-progress instrument should be further developed and used by practitioners who are interested in learning about and reflecting on their online teaching practice. 
    more » « less