skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
Attention:The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.


Title: Adaptation of the Delphi technique in the development of assessments of problem-solving in computer adaptive testing environments (DEAP-CAT).
The Standards for educational and psychological assessment were developed by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (AERA et al., 2014). The Standards specify assessment developers establish five types of validity evidence: test content, response processes, internal structure, relationship to other variables, and consequential/bias. Relevant to this proposal is consequential validity evidence that identifies the potential negative impact of testing or bias. Standard 3.1 of The Standards (2014) on fairness in testing states that “those responsible for test development, revision, and administration should design all steps of the testing process to promote valid score interpretations for intended score uses for the widest possible range of individuals and relevant sub-groups in the intended populations” (p. 63). Three types of bias include construct, method, and item bias (Boer et al., 2018). Testing for differential item functioning (DIF) is a standard analysis adopted to detect item bias against a subgroup (Boer et al., 2018). Example subgroups include gender, race/ethnic group, socioeconomic status, native language, or disability. DIF is when “equally able test takers differ in their probabilities answering a test item correctly as a function of group membership” (AERA et al., 2005, p. 51). DIF indicates systematic error as compared to real mean group differences (Camilli & Shepard, 1994). Items exhibiting significant DIF are removed or reviewed for sources leading to bias to determine modifications to retain and further test an item. The Delphi technique is an emergent systematic research method whereby expert panel members review item content through an iterative process (Yildirim & Büyüköztürk, 2018). Experts independently evaluate each item for potential sources leading to DIF, researchers group their responses, and experts then independently complete a survey to rate their level of agreement with the anonymously grouped responses. This process continues until saturation and consensus are reached among experts as established through some criterion (e.g., median agreement rating, item quartile range, and percent agreement). The technique allows researchers to “identify, learn, and share the ideas of experts by searching for agreement among experts” (Yildirim & Büyüköztürk, 2018, p. 451). Research has illustrated this technique applied after DIF is detected, but not before administering items in the field. The current research is a methodological illustration of the Delphi technique applied in the item construction phase of assessment development as part of a five-year study to develop and test new problem-solving measures (PSM; Bostic et al., 2015, 2017) for U.S.A. grades 6-8 in a computer adaptive testing environment. As part of an iterative design-science-based methodology (Middleton et al., 2008), we illustrate the integration of the Delphi technique into the item writing process. Results from two three-person panels each reviewing a set of 45 PSM items are utilized to illustrate the technique. Advantages and limitations identified through a survey by participating experts and researchers are outlined to advance the method.  more » « less
Award ID(s):
2100988
PAR ID:
10331249
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the International Conference of Education, Research and Innovation
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The Delphi method has been adapted to inform item refinements in educational and psychological assessment development. An explanatory sequential mixed methods design using Delphi is a common approach to gain experts' insight into why items might have exhibited differential item functioning (DIF) for a sub-group, indicating potential item bias. Use of Delphi before quantitative field testing to screen for potential sources leading to item bias is lacking in the literature. An exploratory sequential design is illustrated as an additional approach using a Delphi technique in Phase I and Rasch DIF analyses in Phase II. We introduce the 2 × 2 Concordance Integration Typology as a systematic way to examine agreement and disagreement across the qualitative and quantitative findings using a concordance joint display table. A worked example from the development of the Problem-Solving Measures Grades 6–8 Computer Adaptive Tests supported using an exploratory sequential design to inform item refinement. The 2 × 2 Concordance Integration Typology (a) crystallized instances where additional refinements were potentially needed and (b) provided for evaluating the distribution of bias across the set of items as a whole. Implications are discussed for advancing data integration techniques and using mixed methods to improve instrument development. 
    more » « less
  2. Miller, B; Martin, C (Ed.)
    Assessment continues to be an important conversation point within Science, Technology, Engineering, and Mathematics (STEM) education scholarship and practice (Krupa et al., 2019; National Research Council, 2001). There are guidelines for developing and evaluating assess- ments (e.g., AERA et al., 2014; Carney et al., 2022; Lavery et al., 2019; Wilson & Wilmot, 2019). There are also Standards for Educational & Psychological Testing (Standards; AERA et al., 2014) that discuss important rele- vant frameworks and information about using assessment results and interpretations. Quantitative assessments are used as part of daily STEM instruction, STEM research, and STEM evaluation; therefore, having robust assess- ments is necessary (National Research Council, 2001). An aim of this editorial is to give readers a few relevant ideas about modern assessment research, some guidance for the use of quantitative assessments, and framing validation and assessment research as equity-forward work. 
    more » « less
  3. Martin, Christie; Miller, Bridget (Ed.)
    Assessment continues to be an important conversation point within Science, Technology, Engineering, and Mathematics (STEM) education scholarship and practice (Krupa et al., 2019; National Research Council, 2001). There are guidelines for developing and evaluating assessments (e.g., AERA et al., 2014; Carney et al., 2022; Lavery et al., 2019; Wilson & Wilmot, 2019). There are also Standards for Educational & Psychological Testing (Standards; AERA et al., 2014) that discuss important relevant frameworks and information about using assessment results and interpretations. Quantitative assessments are used as part of daily STEM instruction, STEM research, and STEM evaluation; therefore, having robust assessments is necessary (National Research Council, 2001). An aim of this editorial is to give readers a few relevant ideas about modern assessment research, some guidance for the use of quantitative assessments, and framing validation and assessment research as equity-forward work. 
    more » « less
  4. When measuring academic skills among students whose primary language is not English, standardized assessments are often provided in languages other than English. The degree to which alternate-language test translations yield unbiased, equitable assessment must be evaluated; however, traditional methods of investigating measurement equivalence are susceptible to confounding group differences. The primary purposes of this study were to investigate differential item functioning (DIF) and item bias across Spanish and English forms of an assessment of early mathematics skills. Secondary purposes were to investigate the presence of selection bias and demonstrate a novel approach for investigating DIF that uses a regression discontinuity design framework to control for selection bias. Data were drawn from 1,750 Spanish-speaking Kindergarteners participating in the Early Childhood Longitudinal Study, Kindergarten Class of 1998–1999, who were administered either the Spanish or English version of the mathematics assessment based on their performance on an English language screening measure. Evidence of selection bias—differences between groups in SES, age, approaches to learning, self-control, social interaction, country of birth, childcare, household composition and number in the home, books in the home, and parent involvement—highlighted limitations of a traditional approach for investigating DIF that only controlled for ability. When controlling for selection bias, only 11% of items displayed DIF, and subsequent examination of item content did not suggest item bias. Results provide evidence that the Spanish translation of the ECLS-K mathematics assessment is an equitable and unbiased assessment accommodation for young dual language learners. 
    more » « less
  5. Abstract When large-scale assessment programs are developed and administered in a particular language, students from other native language backgrounds may experience considerable barriers to appropriate measurement of the targeted knowledge and skills. Empirical work is needed to determine if one of the most commonly-applied accommodations to address language barriers, namely extended test time limits, corresponds to score comparability for students who use it. Prior work has examined score comparability for English learners (ELs) eligible to use extended time on tests in the United States, but not specifically for those who more specifically show evidence of using the accommodation. NAEP process data were used to explore score comparability for two groups of ELs eligible for extended time: those who used extended time and those who did not. Analysis of differential item functioning (DIF) was applied to examine potential item bias for these groups when compared to a reference group of native English speakers. Items showing significant and large DIF were identified in both comparisons, with slightly more DIF items identified for the comparison involving ELs who used extended time. Item location and word counts were examined for those items displaying DIF, with results showing some alignment with the notion that language-related barriers may be present for ELs even when extended time is used. Overall, results point to a need for ongoing consideration of the unique needs of ELs during large-scale testing, and the opportunities test process data offer for more comprehensive analyses of accommodation use and effectiveness. 
    more » « less