Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
more »
« less
AI and Cognitive Testing: A New Conceptual Framework and Roadmap
Understanding how a person thinks, i.e., measuring a single individual’s cognitive characteristics, is challenging because cognition is not directly observable. Practically speaking, standardized cognitive tests (tests of IQ, memory, attention, etc.), with results interpreted by expert clinicians, represent the state of the art in measuring a person’s cognition. Three areas of AI show particular promise for improving the effectiveness of this kind of cognitive testing: 1) behavioral sensing, to more robustly quantify individual test-taker behaviors, 2) data mining, to identify and extract meaningful patterns from behavioral datasets; and 3) cognitive modeling, to help map ob- served behaviors onto hypothesized cognitive strategies. We bring these three areas of AI research together in a unified conceptual framework and provide a sampling of recent work in each area. Continued research at the nexus of AI and cognitive testing has potentially far-reaching implications for society in virtually every context in which measuring cognition is important, including research across many disciplines of cognitive science as well as applications in clinical, educational, and workforce settings.
more »
« less
- Award ID(s):
- 1730044
- PAR ID:
- 10209942
- Date Published:
- Journal Name:
- Proceedings of the 41st Annual Conference of the Cognitive Science Society
- Page Range / eLocation ID:
- 2065-2070
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Purpose As online course enrollments increase, it is important to understand how common course features influence students' behaviors and performance. Asynchronous online courses often include a discussion forum to promote community through interaction between students and instructors. Students interact both socially and cognitively; instructors' engagement often demonstrates social or teaching presence. Students' engagement in the discussions introduces both intrinsic and extraneous cognitive load. The purpose of this study is to validate an instrument for measuring cognitive load in asynchronous online discussions. Design/methodology/approach This study presents the validation of the NASA-TLX instrument for measuring cognitive load in asynchronous online discussions in an introductory physics course. Findings The instrument demonstrated reliability for a model with four subscales for all five discrete tasks. This study is foundational for future work that aims at testing the efficacy of interventions, and reducing extraneous cognitive load in asynchronous online discussions. Research limitations/implications Nonresponse error due to the unincentivized, voluntary nature of the survey introduces a sample-related limitation. Practical implications This study provides a strong foundation for future research focused on testing the effects of interventions aimed at reducing extraneous cognitive load in asynchronous online discussions. Originality/value This is a novel application of the NASA-TLX instrument for measuring cognitive load in asynchronous online discussions.more » « less
-
Feldman, Marcus (Ed.)Characterizing the relationship between disease testing behaviors and infectious disease dynamics is of great importance for public health. Tests for both current and past infection can influence disease-related behaviors at the individual level, while population-level knowledge of an epidemic’s course may feed back to affect one’s likelihood of taking a test. The COVID-19 pandemic has generated testing data on an unprecedented scale for tests detecting both current infection (PCR, antigen) and past infection (serology); this opens the way to characterizing the complex relationship between testing behavior and infection dynamics. Leveraging a rich database of individualized COVID-19 testing histories in New Jersey, we analyze the behavioral relationships between PCR and serology tests, infection, and vaccination. We quantify interactions between individuals’ test-taking tendencies and their past testing and infection histories, finding that PCR tests were disproportionately taken by people currently infected, and serology tests were disproportionately taken by people with past infection or vaccination. The effects of previous positive test results on testing behavior are less consistent, as individuals with past PCR positives were more likely to take subsequent PCR and serology tests at some periods of the epidemic time course and less likely at others. Lastly, we fit a model to the titer values collected from serology tests to infer vaccination trends, finding a marked decrease in vaccination rates among individuals who had previously received a positive PCR test. These results exemplify the utility of individualized testing histories in uncovering hidden behavioral variables affecting testing and vaccination.more » « less
-
The purpose of this study is to investigate the combined impact of mask-wearing on cognitive performance and risk-taking behaviors. Participants were divided into a control group (N=24) without and an experimental group (N=27) with a surgical mask. Both groups completed the tasks in a warm environment (30 oC) where the conditions can reduce cognition and decision-making as well. These conditions are common in indoor spaces without sufficient air conditioning during a heat wave. Cognition and risk-taking behaviors were assessed using computerized tests. Results showed that mask-wearing in warm environment did not negatively impact cognitive performance, nor did it increase risk-taking behavior as the concept of risk compensation predicts, even when the CO2 concentration was elevated to approximately 29,000 ppm on average inside the mask. On the contrary, mask-wearing participants showed less risk-taking behaviors, slightly better response inhibition and better short-term memory. These results do not support previous findings suggesting that even a moderately increased indoor CO2 level can reduce cognition. We hypothesize that human adaptation effects (due to mask-wearing on a daily basis) make people less vulnerable to the adverse environment (i.e., excessive air temperature and CO2 levels), which will be investigated in the future studies.more » « less
-
We report on an emerging undergraduate research framework from the NSF Research Experiences for Undergraduate (REU) Site in Computational Sensing at Rochester Institute of Technology. Unobtrusive observation of people's physiological, behavioral, cognitive, and environmental data is increasingly enabling new computing experiences. This REU Site recognizes the accumulating need for training emerging researchers to gain experience in and grapple with systematic collection, processing, analysis, and interpretation of heterogeneous human-elicited information. Instead of merely leveraging traditional physiological measurements, our research program takes a holistic approach to the capture and integration of such sensing data. For instance, the data may also include linguistic and eye movement behaviors, or social and geospatial contextual information. These modalities provide rich information. An example of a multimodal data collection scenario from a project in the REU Site's first year is in Figure 1. This project applied sensing for observing and measuring cognitive reactions as participants engaged in tasks involving web-based video lecturing.more » « less