skip to main content

Title: AI and Cognitive Testing: A New Conceptual Framework and Roadmap
Understanding how a person thinks, i.e., measuring a single individual’s cognitive characteristics, is challenging because cognition is not directly observable. Practically speaking, standardized cognitive tests (tests of IQ, memory, attention, etc.), with results interpreted by expert clinicians, represent the state of the art in measuring a person’s cognition. Three areas of AI show particular promise for improving the effectiveness of this kind of cognitive testing: 1) behavioral sensing, to more robustly quantify individual test-taker behaviors, 2) data mining, to identify and extract meaningful patterns from behavioral datasets; and 3) cognitive modeling, to help map ob- served behaviors onto hypothesized cognitive strategies. We bring these three areas of AI research together in a unified conceptual framework and provide a sampling of recent work in each area. Continued research at the nexus of AI and cognitive testing has potentially far-reaching implications for society in virtually every context in which measuring cognition is important, including research across many disciplines of cognitive science as well as applications in clinical, educational, and workforce settings.
Authors:
Award ID(s):
1730044
Publication Date:
NSF-PAR ID:
10209942
Journal Name:
Proceedings of the 41st Annual Conference of the Cognitive Science Society
Page Range or eLocation-ID:
2065-2070
Sponsoring Org:
National Science Foundation
More Like this
  1. Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
  2. Background The classic Marshmallow Test, where children were offered a choice between one small but immediate reward (eg, one marshmallow) or a larger reward (eg, two marshmallows) if they waited for a period of time, instigated a wealth of research on the relationships among impulsive responding, self-regulation, and clinical and life outcomes. Impulsivity is a hallmark feature of self-regulation failures that lead to poor health decisions and outcomes, making understanding and treating impulsivity one of the most important constructs to tackle in building a culture of health. Despite a large literature base, impulsivity measurement remains difficult due to the multidimensional nature of the construct and limited methods of assessment in daily life. Mobile devices and the rise of mobile health (mHealth) have changed our ability to assess and intervene with individuals remotely, providing an avenue for ambulatory diagnostic testing and interventions. Longitudinal studies with mobile devices can further help to understand impulsive behaviors and variation in state impulsivity in daily life. Objective The aim of this study was to develop and validate an impulsivity mHealth diagnostics and monitoring app called Digital Marshmallow Test (DMT) using both the Apple and Android platforms for widespread dissemination to researchers, clinicians, and the generalmore »public. Methods The DMT app was developed using Apple’s ResearchKit (iOS) and Android’s ResearchStack open source frameworks for developing health research study apps. The DMT app consists of three main modules: self-report, ecological momentary assessment, and active behavioral and cognitive tasks. We conducted a study with a 21-day assessment period (N=116 participants) to validate the novel measures of the DMT app. Results We used a semantic differential scale to develop self-report trait and momentary state measures of impulsivity as part of the DMT app. We identified three state factors (inefficient, thrill seeking, and intentional) that correlated highly with established measures of impulsivity. We further leveraged momentary semantic differential questions to examine intraindividual variability, the effect of daily life, and the contextual effect of mood on state impulsivity and daily impulsive behaviors. Our results indicated validation of the self-report sematic differential and related results, and of the mobile behavioral tasks, including the Balloon Analogue Risk Task and Go-No-Go task, with relatively low validity of the mobile Delay Discounting task. We discuss the design implications of these results to mHealth research. Conclusions This study demonstrates the potential for assessing different facets of trait and state impulsivity during everyday life and in clinical settings using the DMT mobile app. The DMT app can be further used to enhance our understanding of the individual facets that underlie impulsive behaviors, as well as providing a promising avenue for digital interventions. Trial Registration ClinicalTrials.gov NCT03006653; https://www.clinicaltrials.gov/ct2/show/NCT03006653« less
  3. Abstract: 100 words Jurors are increasingly exposed to scientific information in the courtroom. To determine whether providing jurors with gist information would assist in their ability to make well-informed decisions, the present experiment utilized a Fuzzy Trace Theory-inspired intervention and tested it against traditional legal safeguards (i.e., judge instructions) by varying the scientific quality of the evidence. The results indicate that jurors who viewed high quality evidence rated the scientific evidence significantly higher than those who viewed low quality evidence, but were unable to moderate the credibility of the expert witness and apply damages appropriately resulting in poor calibration. Summary: <1000 words Jurors and juries are increasingly exposed to scientific information in the courtroom and it remains unclear when they will base their decisions on a reasonable understanding of the relevant scientific information. Without such knowledge, the ability of jurors and juries to make well-informed decisions may be at risk, increasing chances of unjust outcomes (e.g., false convictions in criminal cases). Therefore, there is a critical need to understand conditions that affect jurors’ and juries’ sensitivity to the qualities of scientific information and to identify safeguards that can assist with scientific calibration in the courtroom. The current project addresses thesemore »issues with an ecologically valid experimental paradigm, making it possible to assess causal effects of evidence quality and safeguards as well as the role of a host of individual difference variables that may affect perceptions of testimony by scientific experts as well as liability in a civil case. Our main goal was to develop a simple, theoretically grounded tool to enable triers of fact (individual jurors) with a range of scientific reasoning abilities to appropriately weigh scientific evidence in court. We did so by testing a Fuzzy Trace Theory-inspired intervention in court, and testing it against traditional legal safeguards. Appropriate use of scientific evidence reflects good calibration – which we define as being influenced more by strong scientific information than by weak scientific information. Inappropriate use reflects poor calibration – defined as relative insensitivity to the strength of scientific information. Fuzzy Trace Theory (Reyna & Brainerd, 1995) predicts that techniques for improving calibration can come from presentation of easy-to-interpret, bottom-line “gist” of the information. Our central hypothesis was that laypeople’s appropriate use of scientific information would be moderated both by external situational conditions (e.g., quality of the scientific information itself, a decision aid designed to convey clearly the “gist” of the information) and individual differences among people (e.g., scientific reasoning skills, cognitive reflection tendencies, numeracy, need for cognition, attitudes toward and trust in science). Identifying factors that promote jurors’ appropriate understanding of and reliance on scientific information will contribute to general theories of reasoning based on scientific evidence, while also providing an evidence-based framework for improving the courts’ use of scientific information. All hypotheses were preregistered on the Open Science Framework. Method Participants completed six questionnaires (counterbalanced): Need for Cognition Scale (NCS; 18 items), Cognitive Reflection Test (CRT; 7 items), Abbreviated Numeracy Scale (ABS; 6 items), Scientific Reasoning Scale (SRS; 11 items), Trust in Science (TIS; 29 items), and Attitudes towards Science (ATS; 7 items). Participants then viewed a video depicting a civil trial in which the defendant sought damages from the plaintiff for injuries caused by a fall. The defendant (bar patron) alleged that the plaintiff (bartender) pushed him, causing him to fall and hit his head on the hard floor. Participants were informed at the outset that the defendant was liable; therefore, their task was to determine if the plaintiff should be compensated. Participants were randomly assigned to 1 of 6 experimental conditions: 2 (quality of scientific evidence: high vs. low) x 3 (safeguard to improve calibration: gist information, no-gist information [control], jury instructions). An expert witness (neuroscientist) hired by the court testified regarding the scientific strength of fMRI data (high [90 to 10 signal-to-noise ratio] vs. low [50 to 50 signal-to-noise ratio]) and gist or no-gist information both verbally (i.e., fairly high/about average) and visually (i.e., a graph). After viewing the video, participants were asked if they would like to award damages. If they indicated yes, they were asked to enter a dollar amount. Participants then completed the Positive and Negative Affect Schedule-Modified Short Form (PANAS-MSF; 16 items), expert Witness Credibility Scale (WCS; 20 items), Witness Credibility and Influence on damages for each witness, manipulation check questions, Understanding Scientific Testimony (UST; 10 items), and 3 additional measures were collected, but are beyond the scope of the current investigation. Finally, participants completed demographic questions, including questions about their scientific background and experience. The study was completed via Qualtrics, with participation from students (online vs. in-lab), MTurkers, and non-student community members. After removing those who failed attention check questions, 469 participants remained (243 men, 224 women, 2 did not specify gender) from a variety of racial and ethnic backgrounds (70.2% White, non-Hispanic). Results and Discussion There were three primary outcomes: quality of the scientific evidence, expert credibility (WCS), and damages. During initial analyses, each dependent variable was submitted to a separate 3 Gist Safeguard (safeguard, no safeguard, judge instructions) x 2 Scientific Quality (high, low) Analysis of Variance (ANOVA). Consistent with hypotheses, there was a significant main effect of scientific quality on strength of evidence, F(1, 463)=5.099, p=.024; participants who viewed the high quality evidence rated the scientific evidence significantly higher (M= 7.44) than those who viewed the low quality evidence (M=7.06). There were no significant main effects or interactions for witness credibility, indicating that the expert that provided scientific testimony was seen as equally credible regardless of scientific quality or gist safeguard. Finally, for damages, consistent with hypotheses, there was a marginally significant interaction between Gist Safeguard and Scientific Quality, F(2, 273)=2.916, p=.056. However, post hoc t-tests revealed significantly higher damages were awarded for low (M=11.50) versus high (M=10.51) scientific quality evidence F(1, 273)=3.955, p=.048 in the no gist with judge instructions safeguard condition, which was contrary to hypotheses. The data suggest that the judge instructions alone are reversing the pattern, though nonsignificant, those who received the no gist without judge instructions safeguard awarded higher damages in the high (M=11.34) versus low (M=10.84) scientific quality evidence conditions F(1, 273)=1.059, p=.30. Together, these provide promising initial results indicating that participants were able to effectively differentiate between high and low scientific quality of evidence, though inappropriately utilized the scientific evidence through their inability to discern expert credibility and apply damages, resulting in poor calibration. These results will provide the basis for more sophisticated analyses including higher order interactions with individual differences (e.g., need for cognition) as well as tests of mediation using path analyses. [References omitted but available by request] Learning Objective: Participants will be able to determine whether providing jurors with gist information would assist in their ability to award damages in a civil trial.« less
  4. Abstract

    Individual differences in behavior are the raw material upon which natural selection acts, but despite increasing recognition of the value of considering individual differences in the behavior of wild animals to test evolutionary hypotheses, this approach has only recently become popular for testing cognitive abilities. In order for the intraspecific approach with wild animals to be useful for testing evolutionary hypotheses about cognition, researchers must provide evidence that measures of cognitive ability obtained from wild subjects reflect stable, general traits. Here, we used a multi-access box paradigm to investigate the intra-individual reliability of innovative problem-solving ability across time and contexts in wild spotted hyenas (Crocuta crocuta). We also asked whether estimates of reliability were affected by factors such as age-sex class, the length of the interval between tests, or the number of times subjects were tested. We found significant contextual and temporal reliability for problem-solving. However, problem-solving was not reliable for adult subjects, when trials were separated by more than 17 days, or when fewer than seven trials were conducted per subject. In general, the estimates of reliability for problem-solving were comparable to estimates from the literature for other animal behaviors, which suggests that problem-solving is a stable, general traitmore »in wild spotted hyenas.

    « less
  5. My dissertation research to date has focused on understanding how incident management teams (IMTs), hastily formed multidisciplinary multiteam systems, cognitively function together as adaptive, joint cognitive systems-of-systems embedded in complex sociotechnical systems. Catastrophic disasters such as Hurricane Harvey highlight the importance of collective efforts for adaptive incident management. Team cognition has emerged as a coordinating mechanism in safety-critical disciplines; however, little is known about cognition in IMTs. Through a scoping review of existing definitions, I proposed an expanded definition that deliberately takes into account IMT’s unique contextual characteristics, based on three premises: cognition in IMTs (1) manifests as interactions among humans, teams, and technologies at multiple levels of multiteam systems, (2) aims to achieve the system-level cognitive goals of perceiving (P), diagnosing, (D), and adapting (A) to information, and (3) serves as an open communication platform for adaptive coordination.Then, I operationalized our proposed definition in a simulated environment as an initial attempt to model IMTs’ system-level cognition. Based on several observations of IMTs’ naturalistic interactive behaviors under different types of disaster scenarios, I proposed a model that can capture how IMTs as joint cognitive systems (or systems-of-systems) perceive (P), diagnose, (D), and adapt (A) to information, i.e., perceive, diagnose, adaptmore »(P, D, A) model. With an emphasis on system-level cognitive goals that applies to multiple units of analysis (e.g., individuals, dyads, teams, and multiteam systems), I could gain an understanding of system-level cognitive adaptation in incident management. Using the P, D, A model as a base platform, I expect to discuss resilience as cognitive adaptation processes along with its implications on human information processing and joint cognitive systems theories.I became a Ph.D. candidate after successfully proposing my dissertation research in last June. After completing data collection and processing, I am currently working on data analysis and manuscript preparation. As a part of NSF-funded project (NSF EArly-concept Grant for Exploratory Research, #1724676), I believe my dissertation work has a potential to practically impact scenario-based training practices of incident management, and thereby lead to a more rapid and better coordinated decision-making in saving lives and infrastructures.« less