skip to main content

This content will become publicly available on November 1, 2022

Title: Adaptation of the Delphi technique in the development of assessments of problem-solving in computer adaptive testing environments (DEAP-CAT).
The Standards for educational and psychological assessment were developed by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (AERA et al., 2014). The Standards specify assessment developers establish five types of validity evidence: test content, response processes, internal structure, relationship to other variables, and consequential/bias. Relevant to this proposal is consequential validity evidence that identifies the potential negative impact of testing or bias. Standard 3.1 of The Standards (2014) on fairness in testing states that “those responsible for test development, revision, and administration should design all steps of the testing process to promote valid score interpretations for intended score uses for the widest possible range of individuals and relevant sub-groups in the intended populations” (p. 63). Three types of bias include construct, method, and item bias (Boer et al., 2018). Testing for differential item functioning (DIF) is a standard analysis adopted to detect item bias against a subgroup (Boer et al., 2018). Example subgroups include gender, race/ethnic group, socioeconomic status, native language, or disability. DIF is when “equally able test takers differ in their probabilities answering a test item correctly as a function of group membership” (AERA et al., 2005, p. 51). DIF more » indicates systematic error as compared to real mean group differences (Camilli & Shepard, 1994). Items exhibiting significant DIF are removed or reviewed for sources leading to bias to determine modifications to retain and further test an item. The Delphi technique is an emergent systematic research method whereby expert panel members review item content through an iterative process (Yildirim & Büyüköztürk, 2018). Experts independently evaluate each item for potential sources leading to DIF, researchers group their responses, and experts then independently complete a survey to rate their level of agreement with the anonymously grouped responses. This process continues until saturation and consensus are reached among experts as established through some criterion (e.g., median agreement rating, item quartile range, and percent agreement). The technique allows researchers to “identify, learn, and share the ideas of experts by searching for agreement among experts” (Yildirim & Büyüköztürk, 2018, p. 451). Research has illustrated this technique applied after DIF is detected, but not before administering items in the field. The current research is a methodological illustration of the Delphi technique applied in the item construction phase of assessment development as part of a five-year study to develop and test new problem-solving measures (PSM; Bostic et al., 2015, 2017) for U.S.A. grades 6-8 in a computer adaptive testing environment. As part of an iterative design-science-based methodology (Middleton et al., 2008), we illustrate the integration of the Delphi technique into the item writing process. Results from two three-person panels each reviewing a set of 45 PSM items are utilized to illustrate the technique. Advantages and limitations identified through a survey by participating experts and researchers are outlined to advance the method. « less
Authors:
; ; ; ; ; ;
Award ID(s):
2100988
Publication Date:
NSF-PAR ID:
10331249
Journal Name:
Proceedings of the International Conference of Education, Research and Innovation
Sponsoring Org:
National Science Foundation
More Like this
  1. This evidence-based practices paper discusses the method employed in validating the use of a project modified version of the PROCESS tool (Grigg, Van Dyken, Benson, & Morkos, 2013) for measuring student problem solving skills. The PROCESS tool allows raters to score students’ ability in the domains of Problem definition, Representing the problem, Organizing information, Calculations, Evaluating the solution, Solution communication, and Self-assessment. Specifically, this research compares student performance on solving traditional textbook problems with novel, student-generated learning activities (i.e. reverse engineering videos in order to then create their own homework problem and solution). The use of student-generated learning activities tomore »assess student problem solving skills has theoretical underpinning in Felder’s (1987) work of “creating creative engineers,” as well as the need to develop students’ abilities to transfer learning and solve problems in a variety of real world settings. In this study, four raters used the PROCESS tool to score the performance of 70 students randomly selected from two undergraduate chemical engineering cohorts at two Midwest universities. Students from both cohorts solved 12 traditional textbook style problems and students from the second cohort solved an additional nine student-generated video problems. Any large scale assessment where multiple raters use a rating tool requires the investigation of several aspects of validity. The many-facets Rasch measurement model (MFRM; Linacre, 1989) has the psychometric properties to determine if there are any characteristics other than “student problem solving skills” that influence the scores assigned, such as rater bias, problem difficulty, or student demographics. Before implementing the full rating plan, MFRM was used to examine how raters interacted with the six items on the modified PROCESS tool to score a random selection of 20 students’ performance in solving one problem. An external evaluator led “inter-rater reliability” meetings where raters deliberated rationale for their ratings and differences were resolved by recourse to Pretz, et al.’s (2003) problem-solving cycle that informed the development of the PROCESS tool. To test the new understandings of the PROCESS tool, raters were assigned to score one new problem from a different randomly selected group of six students. Those results were then analyzed in the same manner as before. This iterative process resulted in substantial increases in reliability, which can be attributed to increased confidence that raters were operating with common definitions of the items on the PROCESS tool and rating with consistent and comparable severity. This presentation will include examples of the student-generated problems and a discussion of common discrepancies and solutions to the raters’ initial use of the PROCESS tool. Findings as well as the adapted PROCESS tool used in this study can be useful to engineering educators and engineering education researchers.« less
  2. Problem solving is central to mathematics learning (NCTM, 2014). Assessments are needed that appropriately measure students’ problem-solving performance. More importantly, assessments must be grounded in robust validity evidence that justifies their interpretations and outcomes (AERA et al., 2014). Thus, measures that are grounded in validity evidence are warranted for use by practitioners and scholars. The purpose of this presentation is to convey validity evidence for a new measure titled Problem-Solving Measure for grade four (PSM4). The research question is: What validity evidence supports PSM4 administration? The PSM4 is one assessment within the previously published PSM series designed for elementary andmore »middle grades students. Problems are grounded in Schoenfeld’s (2011) framework and rely upon Verschaffel et al. (1999) perspective that word problems be open, complex, and realistic. The mathematics in the problems is tied to USA grade-level content and practice standards (CCSSI, 2010).« less
  3. This research paper describes the development of an assessment instrument for use with middle school students that provides insight into students’ interpretive understanding by looking at early indicators of developing expertise in students’ responses to solution generation, reflection, and concept demonstration tasks. We begin by detailing a synthetic assessment model that served as the theoretical basis for assessing specific thinking skills. We then describe our process of developing test items by working with a Teacher Design Team (TDT) of instructors in our partner school system to set guidelines that would better orient the assessment in that context and working withinmore »the framework of standards and disciplinary core ideas enumerated in the Next Generation Science Standards (NGSS). We next specify our process of refining the assessment from 17 items across three separate item pools to a final total of three open-response items. We then provide evidence for the validity and reliability of the assessment instrument from the standards of (1) content, (2) meaningfulness, (3) generalizability, and (4) instructional sensitivity. As part of the discussion from the standards of generalizability and instructional sensitivity, we detail a study carried out in our partner school system in the fall of 2019. The instrument was administered to students in treatment (n= 201) and non- treatment (n = 246) groups, wherein the former participated in a two-to-three- week, NGSS-aligned experimental instructional unit introducing the principles of engineering design that focused on engaging students using the Imaginative Education teaching approach. The latter group were taught using the district’s existing engineering design curriculum. Results from statistical analysis of student responses showed that the interrater reliability of the scoring procedures were good-to-excellent, with intra-class correlation coefficients ranging between .72 and .95. To gauge the instructional sensitivity of the assessment instrument, a series of non-parametric comparative analyses (independent two-group Mann- Whitney tests) were carried out. These found statistically significant differences between treatment and non-treatment student responses related to the outcomes of fluency and elaboration, but not reflection.« less
  4. This research paper describes the development of an assessment instrument for use with middle school students that provides insight into students’ interpretive understanding by looking at early indicators of developing expertise in students’ responses to solution generation, reflection, and concept demonstration tasks. We begin by detailing a synthetic assessment model that served as the theoretical basis for assessing specific thinking skills. We then describe our process of developing test items by working with a Teacher Design Team (TDT) of instructors in our partner school system to set guidelines that would better orient the assessment in that context and working withinmore »the framework of standards and disciplinary core ideas enumerated in the Next Generation Science Standards (NGSS). We next specify our process of refining the assessment from 17 items across three separate item pools to a final total of three open-response items. We then provide evidence for the validity and reliability of the assessment instrument from the standards of (1) content, (2) meaningfulness, (3) generalizability, and (4) instructional sensitivity. As part of the discussion from the standards of generalizability and instructional sensitivity, we detail a study carried out in our partner school system in the fall of 2019. The instrument was administered to students in treatment (n= 201) and non-treatment (n = 246) groups, wherein the former participated in a two-to-three-week, NGSS-aligned experimental instructional unit introducing the principles of engineering design that focused on engaging students using the Imaginative Education teaching approach. The latter group were taught using the district’s existing engineering design curriculum. Results from statistical analysis of student responses showed that the interrater reliability of the scoring procedures were good-to-excellent, with intra-class correlation coefficients ranging between .72 and .95. To gauge the instructional sensitivity of the assessment instrument, a series of non-parametric comparative analyses (independent two-group Mann-Whitney tests) were carried out. These found statistically significant differences between treatment and non-treatment student responses related to the outcomes of fluency and elaboration, but not reflection.« less
  5. Instrument development should adhere to the Standards (AERA et al., 2014). “Content oriented evidence of validation is at the heart of the [validation] process” (AERA et al., 2014, p.15) and is one of the five sources of validity evidence. The research question for this study is: What is the evidence related to test content for the three instruments called the PSM3, PSM4, and PSM5? The study’s purpose is to describe content validity evidence related to new problem-solving measures currently under development. We have previously published validity evidence for problem-solving measures (PSM6, PSM7, and PSM8) that address middle grades math standardsmore »(see Bostic & Sondergeld, 2015; Bostic, Sondergeld, Folger, & Kruse, 2017).« less