skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Adaptation of the Delphi technique in the development of assessments of problem-solving in computer adaptive testing environments (DEAP-CAT).
The Standards for educational and psychological assessment were developed by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (AERA et al., 2014). The Standards specify assessment developers establish five types of validity evidence: test content, response processes, internal structure, relationship to other variables, and consequential/bias. Relevant to this proposal is consequential validity evidence that identifies the potential negative impact of testing or bias. Standard 3.1 of The Standards (2014) on fairness in testing states that “those responsible for test development, revision, and administration should design all steps of the testing process to promote valid score interpretations for intended score uses for the widest possible range of individuals and relevant sub-groups in the intended populations” (p. 63). Three types of bias include construct, method, and item bias (Boer et al., 2018). Testing for differential item functioning (DIF) is a standard analysis adopted to detect item bias against a subgroup (Boer et al., 2018). Example subgroups include gender, race/ethnic group, socioeconomic status, native language, or disability. DIF is when “equally able test takers differ in their probabilities answering a test item correctly as a function of group membership” (AERA et al., 2005, p. 51). DIF indicates systematic error as compared to real mean group differences (Camilli & Shepard, 1994). Items exhibiting significant DIF are removed or reviewed for sources leading to bias to determine modifications to retain and further test an item. The Delphi technique is an emergent systematic research method whereby expert panel members review item content through an iterative process (Yildirim & Büyüköztürk, 2018). Experts independently evaluate each item for potential sources leading to DIF, researchers group their responses, and experts then independently complete a survey to rate their level of agreement with the anonymously grouped responses. This process continues until saturation and consensus are reached among experts as established through some criterion (e.g., median agreement rating, item quartile range, and percent agreement). The technique allows researchers to “identify, learn, and share the ideas of experts by searching for agreement among experts” (Yildirim & Büyüköztürk, 2018, p. 451). Research has illustrated this technique applied after DIF is detected, but not before administering items in the field. The current research is a methodological illustration of the Delphi technique applied in the item construction phase of assessment development as part of a five-year study to develop and test new problem-solving measures (PSM; Bostic et al., 2015, 2017) for U.S.A. grades 6-8 in a computer adaptive testing environment. As part of an iterative design-science-based methodology (Middleton et al., 2008), we illustrate the integration of the Delphi technique into the item writing process. Results from two three-person panels each reviewing a set of 45 PSM items are utilized to illustrate the technique. Advantages and limitations identified through a survey by participating experts and researchers are outlined to advance the method.  more » « less
Award ID(s):
2100988
PAR ID:
10331249
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the International Conference of Education, Research and Innovation
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Miller, B; Martin, C (Ed.)
    Assessment continues to be an important conversation point within Science, Technology, Engineering, and Mathematics (STEM) education scholarship and practice (Krupa et al., 2019; National Research Council, 2001). There are guidelines for developing and evaluating assess- ments (e.g., AERA et al., 2014; Carney et al., 2022; Lavery et al., 2019; Wilson & Wilmot, 2019). There are also Standards for Educational & Psychological Testing (Standards; AERA et al., 2014) that discuss important rele- vant frameworks and information about using assessment results and interpretations. Quantitative assessments are used as part of daily STEM instruction, STEM research, and STEM evaluation; therefore, having robust assess- ments is necessary (National Research Council, 2001). An aim of this editorial is to give readers a few relevant ideas about modern assessment research, some guidance for the use of quantitative assessments, and framing validation and assessment research as equity-forward work. 
    more » « less
  2. When measuring academic skills among students whose primary language is not English, standardized assessments are often provided in languages other than English. The degree to which alternate-language test translations yield unbiased, equitable assessment must be evaluated; however, traditional methods of investigating measurement equivalence are susceptible to confounding group differences. The primary purposes of this study were to investigate differential item functioning (DIF) and item bias across Spanish and English forms of an assessment of early mathematics skills. Secondary purposes were to investigate the presence of selection bias and demonstrate a novel approach for investigating DIF that uses a regression discontinuity design framework to control for selection bias. Data were drawn from 1,750 Spanish-speaking Kindergarteners participating in the Early Childhood Longitudinal Study, Kindergarten Class of 1998–1999, who were administered either the Spanish or English version of the mathematics assessment based on their performance on an English language screening measure. Evidence of selection bias—differences between groups in SES, age, approaches to learning, self-control, social interaction, country of birth, childcare, household composition and number in the home, books in the home, and parent involvement—highlighted limitations of a traditional approach for investigating DIF that only controlled for ability. When controlling for selection bias, only 11% of items displayed DIF, and subsequent examination of item content did not suggest item bias. Results provide evidence that the Spanish translation of the ECLS-K mathematics assessment is an equitable and unbiased assessment accommodation for young dual language learners. 
    more » « less
  3. Abstract When large-scale assessment programs are developed and administered in a particular language, students from other native language backgrounds may experience considerable barriers to appropriate measurement of the targeted knowledge and skills. Empirical work is needed to determine if one of the most commonly-applied accommodations to address language barriers, namely extended test time limits, corresponds to score comparability for students who use it. Prior work has examined score comparability for English learners (ELs) eligible to use extended time on tests in the United States, but not specifically for those who more specifically show evidence of using the accommodation. NAEP process data were used to explore score comparability for two groups of ELs eligible for extended time: those who used extended time and those who did not. Analysis of differential item functioning (DIF) was applied to examine potential item bias for these groups when compared to a reference group of native English speakers. Items showing significant and large DIF were identified in both comparisons, with slightly more DIF items identified for the comparison involving ELs who used extended time. Item location and word counts were examined for those items displaying DIF, with results showing some alignment with the notion that language-related barriers may be present for ELs even when extended time is used. Overall, results point to a need for ongoing consideration of the unique needs of ELs during large-scale testing, and the opportunities test process data offer for more comprehensive analyses of accommodation use and effectiveness. 
    more » « less
  4. Problem solving is central to mathematics learning (NCTM, 2014). Assessments are needed that appropriately measure students’ problem-solving performance. More importantly, assessments must be grounded in robust validity evidence that justifies their interpretations and outcomes (AERA et al., 2014). Thus, measures that are grounded in validity evidence are warranted for use by practitioners and scholars. The purpose of this presentation is to convey validity evidence for a new measure titled Problem-Solving Measure for grade four (PSM4). The research question is: What validity evidence supports PSM4 administration? The PSM4 is one assessment within the previously published PSM series designed for elementary and middle grades students. Problems are grounded in Schoenfeld’s (2011) framework and rely upon Verschaffel et al. (1999) perspective that word problems be open, complex, and realistic. The mathematics in the problems is tied to USA grade-level content and practice standards (CCSSI, 2010). 
    more » « less
  5. This study investigated uniform differential item functioning (DIF) detection in response times. We proposed a regression analysis approach with both the working speed and the group membership as independent variables, and logarithm transformed response times as the dependent variable. Effect size measures such as Δ[Formula: see text] and percentage change in regression coefficients in conjunction with the statistical significance tests were used to flag DIF items. A simulation study was conducted to assess the performance of three DIF detection criteria: (a) significance test, (b) significance test with Δ[Formula: see text], and (c) significance test with the percentage change in regression coefficients. The simulation study considered factors such as sample sizes, proportion of the focal group in relation to total sample size, number of DIF items, and the amount of DIF. The results showed that the significance test alone was too strict; using the percentage change in regression coefficients as an effect size measure reduced the flagging rate when the sample size was large, but the effect was inconsistent across different conditions; using Δ R2with significance test reduced the flagging rate and was fairly consistent. The PISA 2018 data were used to illustrate the performance of the proposed method in a real dataset. Furthermore, we provide guidelines for conducting DIF studies with response time. 
    more » « less