The Standards for educational and psychological assessment were developed by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (AERA et al., 2014). The Standards specify assessment developers establish five types of validity evidence: test content, response processes, internal structure, relationship to other variables, and consequential/bias. Relevant to this proposal is consequential validity evidence that identifies the potential negative impact of testing or bias. Standard 3.1 of The Standards (2014) on fairness in testing states that “those responsible for test development, revision, and administration should design all steps of the testing process to promote valid score interpretations for intended score uses for the widest possible range of individuals and relevant sub-groups in the intended populations” (p. 63). Three types of bias include construct, method, and item bias (Boer et al., 2018). Testing for differential item functioning (DIF) is a standard analysis adopted to detect item bias against a subgroup (Boer et al., 2018). Example subgroups include gender, race/ethnic group, socioeconomic status, native language, or disability. DIF is when “equally able test takers differ in their probabilities answering a test item correctly as a function of group membership” (AERA et al., 2005, p. 51). DIF indicates systematic error as compared to real mean group differences (Camilli & Shepard, 1994). Items exhibiting significant DIF are removed or reviewed for sources leading to bias to determine modifications to retain and further test an item. The Delphi technique is an emergent systematic research method whereby expert panel members review item content through an iterative process (Yildirim & Büyüköztürk, 2018). Experts independently evaluate each item for potential sources leading to DIF, researchers group their responses, and experts then independently complete a survey to rate their level of agreement with the anonymously grouped responses. This process continues until saturation and consensus are reached among experts as established through some criterion (e.g., median agreement rating, item quartile range, and percent agreement). The technique allows researchers to “identify, learn, and share the ideas of experts by searching for agreement among experts” (Yildirim & Büyüköztürk, 2018, p. 451). Research has illustrated this technique applied after DIF is detected, but not before administering items in the field. The current research is a methodological illustration of the Delphi technique applied in the item construction phase of assessment development as part of a five-year study to develop and test new problem-solving measures (PSM; Bostic et al., 2015, 2017) for U.S.A. grades 6-8 in a computer adaptive testing environment. As part of an iterative design-science-based methodology (Middleton et al., 2008), we illustrate the integration of the Delphi technique into the item writing process. Results from two three-person panels each reviewing a set of 45 PSM items are utilized to illustrate the technique. Advantages and limitations identified through a survey by participating experts and researchers are outlined to advance the method.
more »
« less
Do loss aversion and the ownership effect bias content validation procedures?
In making validity arguments, a central consideration is whether the instrument fairly and adequately covers intended content, and this is often evaluated by experts. While common procedures exist for quantitatively assessing this, the effect of loss aversion—a cognitive bias that would predict a tendency to retain items—on these procedures has not been investigated. For more novel constructs, experts are typically drawn from adjacent domains. In such cases, a related cognitive bias, the ownership effect, would predict that experts would be more loss averse when considering items closer to their domains. This study investigated whether loss aversion and the ownership effect are a concern in standard content validity evaluation procedures. In addition to including promising items to measure a relatively novel construct, framing agency, we included distractor items linked to other areas of our evaluators’ expertise. Experts evaluated all items following procedures outlined by Lawshe (1975). We found on average, experts were able to distinguish between the intended items and distractor items. Likewise, on average, experts were somewhat more likely to reject distractor items closer to their expertise. This suggests that loss aversion and the ownership effect are not likely to bias content validation procedures.
more »
« less
- Award ID(s):
- 1751369
- PAR ID:
- 10311734
- Date Published:
- Journal Name:
- Practical assessment research evaluation
- Volume:
- 26
- Issue:
- 7
- ISSN:
- 1531-7714
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Three studies (1 survey, 2 experiments) examine cognitive biases in the professional judgments of nationally-representative samples of psychologists working in legal contexts. Study 1 (N= 84) demonstrates robust evidence of the bias blind spot (Pronin, Lin, & Ross, 2002) in experts’ judgments. Psychologists rated their own susceptibility to bias in their professional work lower than their colleagues (and laypeople). As expected, they perceived bias mitigating procedures as more threatening to their own domain than outside domains, and more experience was correlated with higher perceived threat of bias mitigating procedures. Experimental studies 2 (N=118) & 3 (N=128) with randomly-selected psychologists reveals psychologists overwhelmingly engage in confirmation bias (93% with one decision opportunity in study 1, and 90%, 87%, and 82% across three decision opportunities in study 2). Cognitive reflection was negatively correlated with confirmation bias. Psychologists were also susceptible to order effects in that the order of symptoms presented affected their diagnoses–even though the same symptoms existed in the different scenarios (in opposite orders).more » « less
-
Large-scale standardized tests are regularly used to measure student achievement overall and for student subgroups. These uses assume tests provide comparable measures of outcomes across student subgroups, but prior research suggests score comparisons across gender groups may be complicated by the type of test items used. This paper presents evidence that among nationally representative samples of 15-year-olds in the United States participating in the 2009, 2012, and 2015 PISA math and reading tests, there are consistent item format by gender differences. On average, male students answer multiple-choice items correctly relatively more often and female students answer constructed-response items correctly relatively more often. These patterns were consistent across 34 additional participating PISA jurisdictions, although the size of the format differences varied and were larger on average in reading than math. The average magnitude of the format differences is not large enough to be flagged in routine differential item functioning analyses intended to detect test bias but is large enough to raise questions about the validity of inferences based on comparisons of scores across gender groups. Researchers and other test users should account for test item format, particularly when comparing scores across gender groups.more » « less
-
This study presents qualitative findings from a larger instrument validation study. Undergraduates and subject matter experts (SMEs) were pivotal in early-stage development of a survey focusing on the four domains of Validation Theory (academic-in-class, academic-out-of-class, interpersonal-in-class, interpersonal-out-of-class). An iterative approach allowed for a more rigorously constructed survey refined through multiple phases. The research team met regularly to determine how feedback from undergraduates and SMEs could improve items and if certain populations were potentially being excluded. To date, the research team has expanded on the original 47 items up to 51 to address feedback provided by SMEs and undergraduate participants. Numerous item wording revisions have been made. Support for content, response process, and consequential validity evidence is strong.more » « less
-
Objective Online surveys are a common method of data collection. The use of “attention-check” questions are an effective method of identifying careless responding in surveys (Liu & Wronski, 2018; Meade & Craig 2012; Ward & Meade, 2023), which occurs in 10-12% of undergraduate samples (Meade & Craig, 2012). Instructed response type attention checks are straightforward and the most recommended (Meade & Craig, 2012; Ward & Meade, 2023). This study evaluated the effect of instructed response attention check questions on the measurement of math ability and non-cognitive factors commonly related to math (self-efficacy and math anxiety). We evaluated both level differences as well as whether check questions alter the relationship of non-cognitive factors to math. We expected that incorrect responding to check questions would lower math performance but were unable to make hypotheses about level of self-report non-cognitive factors. We predicted that incorrect responding to check questions would moderate the relationship between both math anxiety and self-efficacy to math performance. Participants and Methods Participants were 424 undergraduates (age 20.4, SD=2.7) at a large southwestern university. The sample was majority female (74%) but diverse socioeconomically and in race/ethnicity. The non-cognitive measures were researcher developed Math Anxiety (MA) and Math Self-Efficacy (MSE; Betz & Hackett, 1993) scales, with items selected directly targeting the use/manipulation of math in everyday life; both showed good reliability (α=.95). The two math scales were also researcher developed; one was a pure symbolic computational measure (EM-A) and the other consisted of word problems in an everyday context (EM-B). These measures had good reliability (α=.80 and α=.73). The four check questions were embedded in the surveys and two groupings were formed – one consisting of those who provided the correct answer for all items versus those who did not, and a second consisting of those who got all correct or only one answer incorrect versus those with more items incorrect. Correlational, ANOVA, and ANCOVA models were utilized. Results Descriptively, check questions were skewed – 75% participants answered all check questions correctly, and 8% missed only one. Relations of both MA and MSE with EM-A and EM-B were modest though significant (|r|=.22 to .37) and in the expected directions (all p<.001). Check questions were related to level of all tasks (p<.001), with incorrect responses resulting in lower math performance, lower MSE, and higher MA. Check questions did not moderate the relation of MA or MSE to either math performance, with some suggestion that MA was more strongly related to EM-B in those who missed check questions, though only when failing several. Conclusions Check questions showed a clear relation to both self-report and math performance measures. However, check questions did not alter the relation of MA or MSE to math performance in general. These results affirm extant relations of key self-perceptions to math using novel measures and highlight the need to evaluate the validity of self-report measures, even outside of objective performance indicators. Future work could examine the effect of attention checks in domains other than math and investigate other types of attention checks.more » « less
An official website of the United States government

