skip to main content

This content will become publicly available on April 18, 2023

Title: The More You Ask, the Less You Get: When Additional Questions Hurt External Validity

Researchers and practitioners in marketing, economics, and public policy often use preference elicitation tasks to forecast real-world behaviors. These tasks typically ask a series of similarly structured questions. The authors posit that every time a respondent answers an additional elicitation question, two things happen: (1) they provide information about some parameter(s) of interest, such as their time preference or the partworth for a product attribute, and (2) the respondent increasingly “adapts” to the task—that is, using task-specific decision processes specialized for this task that may or may not apply to other tasks. Importantly, adaptation comes at the cost of potential mismatch between the task-specific decision process and real-world processes that generate the target behaviors, such that asking more questions can reduce external validity. The authors used mouse and eye tracking to trace decision processes in time preference measurement and conjoint choice tasks. Respondents increasingly relied on task-specific decision processes as more questions were asked, leading to reduced external validity for both related tasks and real-world behaviors. Importantly, the external validity of measured preferences peaked after as few as seven questions in both types of tasks. When measuring preferences, less can be more.

; ; ; ; ;
Publication Date:
Journal Name:
Journal of Marketing Research
Page Range or eLocation-ID:
p. 963-982
SAGE Publications
Sponsoring Org:
National Science Foundation
More Like this
  1. Response time (RT) – the time elapsing from the beginning of question reading for a given question until the start of the next question – is a potentially important indicator of data quality that can be reliably measured for all questions in a computer-administered survey using a latent timer (i.e., triggered automatically by moving on to the next question). In interviewer-administered surveys, RTs index data quality by capturing the entire length of time spent on a question–answer sequence, including interviewer question-asking behaviors and respondent question-answering behaviors. Consequently, longer RTs may indicate longer processing or interaction on the part of the interviewer, respondent, or both. RTs are an indirect measure of data quality; they do not directly measure reliability or validity, and we do not directly observe what factors lengthen the administration time. In addition, either too long or too short RTs could signal a problem (Ehlen, Schober, and Conrad 2007). However, studies that link components of RTs (interviewers’ question reading and response latencies) to interviewer and respondent behaviors that index data quality strengthen the claim that RTs indicate data quality (Bergmann and Bristle 2019; Draisma and Dijkstra 2004; Olson, Smyth, and Kirchner 2019). In general, researchers tend to consider longermore »RTs as signaling processing problems for the interviewer, respondent, or both (Couper and Kreuter 2013; Olson and Smyth 2015; Yan and Olson 2013; Yan and Tourangeau 2008). Previous work demonstrates that RTs are associated with various characteristics of interviewers (where applicable), questions, and respondents in web, telephone, and face-to-face interviews (e.g., Couper and Kreuter 2013; Olson and Smyth 2015; Yan and Tourangeau 2008). We replicate and extend this research by examining how RTs are associated with various question characteristics and several established tools for evaluating questions. We also examine whether increased interviewer experience in the study shortens RTs for questions with characteristics that impact the complexity of the interviewer’s task (i.e., interviewer instructions and parenthetical phrases). We examine these relationships in the context of a sample of racially diverse respondents who answered questions about participation in medical research and their health.« less
  2. Social interaction is inherently bidirectional, but research on autistic peer interactions often frames communication as unidirectional and in isolation from the peer context. This study investigated natural peer interactions among six autistic and six non-autistic adolescents in an inclusive school club over 5 months (14 45-min sessions in total) to examine the students’ peer preferences in real-world social interactions and how the preferences changed over time. We further examined whether social behavior characteristics differ between student and peer neurotype combinations. Findings showed that autistic students were more likely to interact with autistic peers then non-autistic peers. In both autistic and non-autistic students, the likelihood of interacting with a same-neurotype peer increased over time. Autistic and non-autistic students’ within-neurotype social interactions were more likely to reflect relational than functional purposes, be characterized as sharing thoughts and experiences rather than requesting help or objects, and be highly reciprocal, as compared with cross-neurotype interactions. These peer preferences and patterns of social interactions were not found among student-peer dyads with the same genders. These findings suggest that peer interaction is determined by more than just a student’s autism diagnosis, but by a combination of student and peer neurotypes. Lay abstract Autistic students often experience challengesmore »in peer interactions, especially for young adolescents who are navigating the increased social expectations in secondary education. Previous research on the peer interactions of autistic adolescents mainly compared the social behaviors of autistic and non-autistic students and overlooked the peers in the social context. However, recent research has shown that the social challenges faced by autistic may not be solely contributed by their social differences, but a mismatch in the social communication styles between autistic and non-autistic people. As such, this study aimed to investigate the student-and-peer match in real-world peer interactions between six autistic and six non-autistic adolescents in an inclusive school club. We examined the odds of autistic and non-autistic students interacting with either an autistic peer, a non-autistic peer, or multiple peers, and the results showed that autistic students were more likely to interact with autistic peers then non-autistic peers. This preference for same-group peer interactions strengthened over the 5-month school club in both autistic and non-autistic students. We further found that same-group peer interactions, in both autistic and non-autistic students, were more likely to convey a social interest rather than a functional purpose or need, be sharing thoughts, experiences, or items rather than requesting help or objects, and be highly reciprocal than cross-group social behaviors. Collectively, our findings support that peer interaction outcomes may be determined by the match between the group memberships of the student and their peers, either autistic or non-autistic, rather than the student’s autism diagnosis.« less
  3. Abstract: 100 words Jurors are increasingly exposed to scientific information in the courtroom. To determine whether providing jurors with gist information would assist in their ability to make well-informed decisions, the present experiment utilized a Fuzzy Trace Theory-inspired intervention and tested it against traditional legal safeguards (i.e., judge instructions) by varying the scientific quality of the evidence. The results indicate that jurors who viewed high quality evidence rated the scientific evidence significantly higher than those who viewed low quality evidence, but were unable to moderate the credibility of the expert witness and apply damages appropriately resulting in poor calibration. Summary: <1000 words Jurors and juries are increasingly exposed to scientific information in the courtroom and it remains unclear when they will base their decisions on a reasonable understanding of the relevant scientific information. Without such knowledge, the ability of jurors and juries to make well-informed decisions may be at risk, increasing chances of unjust outcomes (e.g., false convictions in criminal cases). Therefore, there is a critical need to understand conditions that affect jurors’ and juries’ sensitivity to the qualities of scientific information and to identify safeguards that can assist with scientific calibration in the courtroom. The current project addresses thesemore »issues with an ecologically valid experimental paradigm, making it possible to assess causal effects of evidence quality and safeguards as well as the role of a host of individual difference variables that may affect perceptions of testimony by scientific experts as well as liability in a civil case. Our main goal was to develop a simple, theoretically grounded tool to enable triers of fact (individual jurors) with a range of scientific reasoning abilities to appropriately weigh scientific evidence in court. We did so by testing a Fuzzy Trace Theory-inspired intervention in court, and testing it against traditional legal safeguards. Appropriate use of scientific evidence reflects good calibration – which we define as being influenced more by strong scientific information than by weak scientific information. Inappropriate use reflects poor calibration – defined as relative insensitivity to the strength of scientific information. Fuzzy Trace Theory (Reyna & Brainerd, 1995) predicts that techniques for improving calibration can come from presentation of easy-to-interpret, bottom-line “gist” of the information. Our central hypothesis was that laypeople’s appropriate use of scientific information would be moderated both by external situational conditions (e.g., quality of the scientific information itself, a decision aid designed to convey clearly the “gist” of the information) and individual differences among people (e.g., scientific reasoning skills, cognitive reflection tendencies, numeracy, need for cognition, attitudes toward and trust in science). Identifying factors that promote jurors’ appropriate understanding of and reliance on scientific information will contribute to general theories of reasoning based on scientific evidence, while also providing an evidence-based framework for improving the courts’ use of scientific information. All hypotheses were preregistered on the Open Science Framework. Method Participants completed six questionnaires (counterbalanced): Need for Cognition Scale (NCS; 18 items), Cognitive Reflection Test (CRT; 7 items), Abbreviated Numeracy Scale (ABS; 6 items), Scientific Reasoning Scale (SRS; 11 items), Trust in Science (TIS; 29 items), and Attitudes towards Science (ATS; 7 items). Participants then viewed a video depicting a civil trial in which the defendant sought damages from the plaintiff for injuries caused by a fall. The defendant (bar patron) alleged that the plaintiff (bartender) pushed him, causing him to fall and hit his head on the hard floor. Participants were informed at the outset that the defendant was liable; therefore, their task was to determine if the plaintiff should be compensated. Participants were randomly assigned to 1 of 6 experimental conditions: 2 (quality of scientific evidence: high vs. low) x 3 (safeguard to improve calibration: gist information, no-gist information [control], jury instructions). An expert witness (neuroscientist) hired by the court testified regarding the scientific strength of fMRI data (high [90 to 10 signal-to-noise ratio] vs. low [50 to 50 signal-to-noise ratio]) and gist or no-gist information both verbally (i.e., fairly high/about average) and visually (i.e., a graph). After viewing the video, participants were asked if they would like to award damages. If they indicated yes, they were asked to enter a dollar amount. Participants then completed the Positive and Negative Affect Schedule-Modified Short Form (PANAS-MSF; 16 items), expert Witness Credibility Scale (WCS; 20 items), Witness Credibility and Influence on damages for each witness, manipulation check questions, Understanding Scientific Testimony (UST; 10 items), and 3 additional measures were collected, but are beyond the scope of the current investigation. Finally, participants completed demographic questions, including questions about their scientific background and experience. The study was completed via Qualtrics, with participation from students (online vs. in-lab), MTurkers, and non-student community members. After removing those who failed attention check questions, 469 participants remained (243 men, 224 women, 2 did not specify gender) from a variety of racial and ethnic backgrounds (70.2% White, non-Hispanic). Results and Discussion There were three primary outcomes: quality of the scientific evidence, expert credibility (WCS), and damages. During initial analyses, each dependent variable was submitted to a separate 3 Gist Safeguard (safeguard, no safeguard, judge instructions) x 2 Scientific Quality (high, low) Analysis of Variance (ANOVA). Consistent with hypotheses, there was a significant main effect of scientific quality on strength of evidence, F(1, 463)=5.099, p=.024; participants who viewed the high quality evidence rated the scientific evidence significantly higher (M= 7.44) than those who viewed the low quality evidence (M=7.06). There were no significant main effects or interactions for witness credibility, indicating that the expert that provided scientific testimony was seen as equally credible regardless of scientific quality or gist safeguard. Finally, for damages, consistent with hypotheses, there was a marginally significant interaction between Gist Safeguard and Scientific Quality, F(2, 273)=2.916, p=.056. However, post hoc t-tests revealed significantly higher damages were awarded for low (M=11.50) versus high (M=10.51) scientific quality evidence F(1, 273)=3.955, p=.048 in the no gist with judge instructions safeguard condition, which was contrary to hypotheses. The data suggest that the judge instructions alone are reversing the pattern, though nonsignificant, those who received the no gist without judge instructions safeguard awarded higher damages in the high (M=11.34) versus low (M=10.84) scientific quality evidence conditions F(1, 273)=1.059, p=.30. Together, these provide promising initial results indicating that participants were able to effectively differentiate between high and low scientific quality of evidence, though inappropriately utilized the scientific evidence through their inability to discern expert credibility and apply damages, resulting in poor calibration. These results will provide the basis for more sophisticated analyses including higher order interactions with individual differences (e.g., need for cognition) as well as tests of mediation using path analyses. [References omitted but available by request] Learning Objective: Participants will be able to determine whether providing jurors with gist information would assist in their ability to award damages in a civil trial.« less
  4. This study was performed to investigate the validity of a real world version of the Trail Making Test (TMT) across age strata, compared to the current standard TMT which is delivered using a pen-paper protocol. We developed a real world version of the TMT, the Can-TMT, that involves the retrieval of food cans, with numeric or alphanumerical labels, from a shelf in ascending order. Eye tracking data was acquired during the Can-TMT to calculate task completion time and compared to that of the Paper-TMT. Results indicated a strong significant correlation between the real world and paper tasks for both TMTA and TMTB versions of the tasks, indicative of the validity of the real world task. Moreover, the two age groups exhibited significant differences on the TMTA and TMTB versions of both task modalities (paper and can), further supporting the validity of the real world task. This work will have a significant impact on our ability to infer skill or impairment with visual search, spatial reasoning, working memory, and motor proficiency during complex real-world tasks. Thus, we hope to fill a critical need for an exam with the resolution capable of determining deficits which subjective or reductionist assessments may otherwise miss.
  5. People often search for information in order to learn something new. In recent years, the “search-as-learning” movement has argued that search systems should be better designed to support learning. Current search systems (especially Web search engines) are largely designed and optimized to fulfill simple look-up tasks (e.g., navigational or fact-finding search tasks). However, they provide less support for searchers working on complex tasks that involve learning. Search-as-learning studies have investigated a wide range of research questions. For example, studies have aimed to better understand how characteristics of the individual searcher, the type of search task, and interactive features provided by the system can influence learning outcomes. Learning assessment is a key component in search-as-learning studies. Assessment materials are used to both gauge prior knowledge and measure learning during or after one or more search sessions. In this paper, we provide a systematic review of different types of assessments used in search-as-learning studies to date. The paper makes the following three contributions. First, we review different types of assessments used and discuss their potential benefits and drawbacks. Second, we review assessments used outside of search-as-learning, which may provide insights and opportunities for future research. Third, we provide recommendations for future research.more »Importantly, we argue that future studies should clearly define learning objectives and develop assessment materials that reliably capture the intended type of learning. For example, assessment materials should test a participant’s ability to engage with specific cognitive processes, which may range from simple (e.g., memorization) to more complex (e.g., critical and creative thinking). Additionally, we argue that future studies should consider two dimensions that are understudied in search-as-learning: long-term retention (i.e., being able to use what was learned in the long term) and transfer of learning (i.e., being able to use what was learned in a novel context).« less