skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, May 23 until 2:00 AM ET on Friday, May 24 due to maintenance. We apologize for the inconvenience.

Title: Adaptation of the Delphi technique in the development of assessments of problem-solving in computer adaptive testing environments (DEAP-CAT).
The Standards for educational and psychological assessment were developed by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (AERA et al., 2014). The Standards specify assessment developers establish five types of validity evidence: test content, response processes, internal structure, relationship to other variables, and consequential/bias. Relevant to this proposal is consequential validity evidence that identifies the potential negative impact of testing or bias. Standard 3.1 of The Standards (2014) on fairness in testing states that “those responsible for test development, revision, and administration should design all steps of the testing process to promote valid score interpretations for intended score uses for the widest possible range of individuals and relevant sub-groups in the intended populations” (p. 63). Three types of bias include construct, method, and item bias (Boer et al., 2018). Testing for differential item functioning (DIF) is a standard analysis adopted to detect item bias against a subgroup (Boer et al., 2018). Example subgroups include gender, race/ethnic group, socioeconomic status, native language, or disability. DIF is when “equally able test takers differ in their probabilities answering a test item correctly as a function of group membership” (AERA et al., 2005, p. 51). DIF indicates systematic error as compared to real mean group differences (Camilli & Shepard, 1994). Items exhibiting significant DIF are removed or reviewed for sources leading to bias to determine modifications to retain and further test an item. The Delphi technique is an emergent systematic research method whereby expert panel members review item content through an iterative process (Yildirim & Büyüköztürk, 2018). Experts independently evaluate each item for potential sources leading to DIF, researchers group their responses, and experts then independently complete a survey to rate their level of agreement with the anonymously grouped responses. This process continues until saturation and consensus are reached among experts as established through some criterion (e.g., median agreement rating, item quartile range, and percent agreement). The technique allows researchers to “identify, learn, and share the ideas of experts by searching for agreement among experts” (Yildirim & Büyüköztürk, 2018, p. 451). Research has illustrated this technique applied after DIF is detected, but not before administering items in the field. The current research is a methodological illustration of the Delphi technique applied in the item construction phase of assessment development as part of a five-year study to develop and test new problem-solving measures (PSM; Bostic et al., 2015, 2017) for U.S.A. grades 6-8 in a computer adaptive testing environment. As part of an iterative design-science-based methodology (Middleton et al., 2008), we illustrate the integration of the Delphi technique into the item writing process. Results from two three-person panels each reviewing a set of 45 PSM items are utilized to illustrate the technique. Advantages and limitations identified through a survey by participating experts and researchers are outlined to advance the method.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the International Conference of Education, Research and Innovation
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The Delphi method has been adapted to inform item refinements in educational and psychological assessment development. An explanatory sequential mixed methods design using Delphi is a common approach to gain experts' insight into why items might have exhibited differential item functioning (DIF) for a sub-group, indicating potential item bias. Use of Delphi before quantitative field testing to screen for potential sources leading to item bias is lacking in the literature. An exploratory sequential design is illustrated as an additional approach using a Delphi technique in Phase I and Rasch DIF analyses in Phase II. We introduce the 2 × 2 Concordance Integration Typology as a systematic way to examine agreement and disagreement across the qualitative and quantitative findings using a concordance joint display table. A worked example from the development of the Problem-Solving Measures Grades 6–8 Computer Adaptive Tests supported using an exploratory sequential design to inform item refinement. The 2 × 2 Concordance Integration Typology (a) crystallized instances where additional refinements were potentially needed and (b) provided for evaluating the distribution of bias across the set of items as a whole. Implications are discussed for advancing data integration techniques and using mixed methods to improve instrument development. 
    more » « less
  2. Problem-solving is a typical type of assessment in engineering dynamics tests. To solve a problem, students need to set up equations and find a numerical answer. Depending on its difficulty and complexity, it can take anywhere from ten to thirty minutes to solve a quantitative problem. Due to the time constraint of in-class testing, a typical test may only contain a limited number of problems, covering an insufficient range of problem types. This can potentially reduce validity and reliability, two crucial factors which contribute to assessment results. A test with high validity should cover proper content. It should be able to distinguish high-performing students from low-performing students and every student in between. A reliable test should have a sufficient number of items to provide consistent information about students’ mastery of the materials. In this work-in-progress study, we will investigate to what extent a newly developed assessment is valid and reliable. Symbolic problem solving in this study refers to solving problems by setting up a system of equations without finding numeric solutions. Such problems usually take much less time. As a result, we can include more problems of a variety of types in a test. We evaluate the new assessment's validity and reliability. The efficient approach focused in symbolic problem-solving allows for a diverse range of problems in a single test. We will follow Standards for Educational and Psychological Testing, referred to as the Standards, for our study. The Standards were developed jointly by three professional organizations including the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME). We will use the standards to evaluate the content validity and internal consistency of a collection of symbolic problems. Examples on rectilinear kinematics and angular motion will be provided to illustrate how symbolic problem solving is used in both homework and assessments. Numerous studies in the literature have shown that symbolic questions impose greater challenges because of students’ algebraic difficulties. Thus, we will share strategies on how to prepare students to approach such problems. 
    more » « less
  3. Problem solving is central to mathematics learning (NCTM, 2014). Assessments are needed that appropriately measure students’ problem-solving performance. More importantly, assessments must be grounded in robust validity evidence that justifies their interpretations and outcomes (AERA et al., 2014). Thus, measures that are grounded in validity evidence are warranted for use by practitioners and scholars. The purpose of this presentation is to convey validity evidence for a new measure titled Problem-Solving Measure for grade four (PSM4). The research question is: What validity evidence supports PSM4 administration? The PSM4 is one assessment within the previously published PSM series designed for elementary and middle grades students. Problems are grounded in Schoenfeld’s (2011) framework and rely upon Verschaffel et al. (1999) perspective that word problems be open, complex, and realistic. The mathematics in the problems is tied to USA grade-level content and practice standards (CCSSI, 2010). 
    more » « less
  4. This evidence-based practices paper discusses the method employed in validating the use of a project modified version of the PROCESS tool (Grigg, Van Dyken, Benson, & Morkos, 2013) for measuring student problem solving skills. The PROCESS tool allows raters to score students’ ability in the domains of Problem definition, Representing the problem, Organizing information, Calculations, Evaluating the solution, Solution communication, and Self-assessment. Specifically, this research compares student performance on solving traditional textbook problems with novel, student-generated learning activities (i.e. reverse engineering videos in order to then create their own homework problem and solution). The use of student-generated learning activities to assess student problem solving skills has theoretical underpinning in Felder’s (1987) work of “creating creative engineers,” as well as the need to develop students’ abilities to transfer learning and solve problems in a variety of real world settings. In this study, four raters used the PROCESS tool to score the performance of 70 students randomly selected from two undergraduate chemical engineering cohorts at two Midwest universities. Students from both cohorts solved 12 traditional textbook style problems and students from the second cohort solved an additional nine student-generated video problems. Any large scale assessment where multiple raters use a rating tool requires the investigation of several aspects of validity. The many-facets Rasch measurement model (MFRM; Linacre, 1989) has the psychometric properties to determine if there are any characteristics other than “student problem solving skills” that influence the scores assigned, such as rater bias, problem difficulty, or student demographics. Before implementing the full rating plan, MFRM was used to examine how raters interacted with the six items on the modified PROCESS tool to score a random selection of 20 students’ performance in solving one problem. An external evaluator led “inter-rater reliability” meetings where raters deliberated rationale for their ratings and differences were resolved by recourse to Pretz, et al.’s (2003) problem-solving cycle that informed the development of the PROCESS tool. To test the new understandings of the PROCESS tool, raters were assigned to score one new problem from a different randomly selected group of six students. Those results were then analyzed in the same manner as before. This iterative process resulted in substantial increases in reliability, which can be attributed to increased confidence that raters were operating with common definitions of the items on the PROCESS tool and rating with consistent and comparable severity. This presentation will include examples of the student-generated problems and a discussion of common discrepancies and solutions to the raters’ initial use of the PROCESS tool. Findings as well as the adapted PROCESS tool used in this study can be useful to engineering educators and engineering education researchers. 
    more » « less
  5. Entrepreneurship Support Programs (ESP) in engineering provide education, mentoring, and advising for emerging entrepreneurs and their ventures. The impact of ESPs on engineering students’ professional formation and the acquisition of different attributes—such as creativity, risk-taking, empathy, and curiosity—is largely unknown. Though the social sciences have a strong and robust history of studying many of the attributes, such as creativity and problem-solving, typically associated with entrepreneurship, there has been little connection between this foundational research and the work of ESPs. In fact, two separate systematic reviews have shown that most published work in STEM entrepreneurship education is not theoretically grounded and does not follow standards of quality research approaches in the social sciences. In an effort to bridge the gap between social scientists and engineering entrepreneurship practitioners, the authors are conducting a two-phase study. Phase 1 of the study involves conducting a Delphi study to identify the top entrepreneurial attributes of professionals and researchers who lead ESPs. Phase 2 of the study includes conducting workshops with social scientists who study the attributes and ESP leaders. The goal of the workshops is to identify assessment frameworks grounded in social science theory and literature that will guide the measurement of the attributes. This session will focus on the results of the Delphi phase. Delphi study is a common research technique used to achieve consensus among experts (Hasson, Keeney, and McKenna, 2000). Seventy-three participants who lead or have led an ESP, have conducted research in entrepreneurship education, or act as administrators for relevant entrepreneurship programs were invited to participate in the Delphi study. Of the 73 invited, 14 completed at least two rounds of the Delphi study. All participants were experts in the field of engineering entrepreneurship education. The Delphi Study comprised three rounds- brainstorming, narrowing, and ranking. Each phase of the Delphi asked participants to think about three different sets of attributes: 1) entrepreneurial attributes that they thought were important in the development of an entrepreneur, 2) attributes in becoming a successful professional, and 3) attributes in working in an inclusive workspace. In the brainstorming phase, participants were sent an online questionnaire and were asked to brainstorm as many attributes as they could think of. The results of the brainstorming questionnaire were consolidated and used to develop the narrowing questionnaire, where participants were asked to narrow all attributes to the top 10 key attributes The results from the narrowing questionnaire were then used to develop a ranking questionnaire, where participants were asked to rank the items on a scale of importance with 1 being the most important to 10 being the least important for each set of attributes. The results of the phase 3 questionnaire were analyzed to identify the attributes that were ranked the highest among a majority of the participants. This paper discusses the findings of the Delphi Study and its implications in assessing the impact of ESP on entrepreneur formation. 
    more » « less