skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.  more » « less
Award ID(s):
1756023
PAR ID:
10347368
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Annual Meeting of the Association for Computational Linguistics
Page Range / eLocation ID:
4902 to 4912
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Software testing is an essential skill for computer science students. Prior work reports that students desire support in determining what code to test and which scenarios should be tested. In response to this, we present a lightweight testing checklist that contains both tutorial information and testing strategies to guide students in what and how to test. To assess the impact of the testing checklist, we conducted an experimental, controlled A/B study with 32 undergraduate and graduate students. The study task was writing a test suite for an existing program. Students were given either the testing checklist (the experimental group) or a tutorial on a standard coverage tool with which they were already familiar (the control group). By analyzing the combination of student-written tests and survey responses, we found students with the checklist performed as well as or better than the coverage tool group, suggesting a potential positive impact of the checklist (or at minimum, a non-negative impact). This is particularly noteworthy given the control condition of the coverage tool is the state of the practice. These findings suggest that the testing tool support does not need to be sophisticated to be effective. 
    more » « less
  2. Software testing is a critical skill for computing students, but learning and practicing testing can be challenging, particularly for beginners. A recent study suggests that a lightweight testing checklist that contains testing strategies and tutorial information could assist students in writing quality tests. However, students expressed a desire for more support in knowing how to test the code/scenario. Moreover, the potential costs and benefits of the testing checklist are not yet examined in a classroom setting. To that end, we improved the checklist by integrating explicit testing strategies to it (ETS Checklist), which provide step-by-step guidance on how to transfer semantic information from instructions to the possible testing scenarios. In this paper, we report our experiences in designing explicit strategies in unit testing, as well as adapting the ETS Checklist as optional tool support in a CS1.5 course. With the quantitative and qualitative analysis of the survey responses and lab assignment submissions generated by students, we discuss students' engagement with the ETS Checklists. Our results suggest that students who used the checklist intervention had significantly higher quality in their student-authored test code, in terms of code coverage, compared to those who did not, especially for assignments earlier in the course. We also observed students' unawareness of their need for help in writing high-quality tests. 
    more » « less
  3. Natural language processing (NLP) has gained widespread adoption in the development of real-world applications. However, the black-box nature of neural networks in NLP applications poses a challenge when evaluating their performance, let alone ensuring it. Recent research has proposed testing techniques to enhance the trustworthiness of NLP-based applications. However, most existing works use a single, aggregated metric (i.e., accuracy) which is difficult for users to assess NLP model performance on fine-grained aspects, such as LCs. To address this limitation, we present ALiCT, an automated testing technique for validating NLP applications based on their LCs. ALiCT takes user-specified LCs as inputs and produces diverse test suite with test oracles for each of given LC. We evaluate ALiCT on two widely adopted NLP tasks, sentiment analysis and hate speech detection, in terms of diversity, effectiveness, and consistency. Using Self-BLEU and syntactic diversity metrics, our findings reveal that ALiCT generates test cases that are 190% and 2213% more diverse in semantics and syntax, respectively, compared to those produced by state-of-the-art techniques. In addition, ALiCT is capable of producing a larger number of NLP model failures in 22 out of 25 LCs over the two NLP applications. 
    more » « less
  4. Infrastructure as code (IaC) scripts, such as Ansible scripts, are used to provision computing infrastructure at scale. Existence of bugs in IaC test scripts, such as, configuration and security bugs, can be consequential for the provisioned computing infrastructure. A characterization study of bugs in IaC test scripts is the first step to understand the quality concerns that arise during testing of IaC scripts, and also provide recommendations for practitioners on quality assurance. We conduct an empirical study with 4,831 Ansible test scripts mined from 104 open source software (OSS) repositories where we quantify bug frequency, and categorize bugs in test scripts. We further categorize testing patterns, i.e., recurring coding patterns in test scripts, which also correlate with appearance of bugs. From our empirical study, we observe 1.8% of 4,831 Ansible test scripts to include a bug, and 45.2% of the 104 repositories to contain at least one test script that includes bugs. We identify 7 categories of bugs, which includes security bugs and performance bugs that are related with metadata extraction. We also identify 3 testing patterns that correlate with appearance of bugs: 'assertion roulette’, 'local only testing’, and 'remote mystery guest‘. Based on our findings, we advocate for detection and mitigation of the 3 testing patterns as these patterns can have negative implications for troubleshooting failures, reproducible deployments of software, and provisioning of computing infrastructure. 
    more » « less
  5. Compiler correctness is crucial, as miscompilation can falsify program behaviors, leading to serious consequences over the software supply chain. In the literature, fuzzing has been extensively studied to uncover compiler defects. However, compiler fuzzing remains challenging: Existing arts focus on black- and grey-box fuzzing, which generates test programs without sufficient understanding of internal compiler behaviors. As such, they often fail to construct test programs to exercise intricate optimizations. Meanwhile, traditional white-box techniques, such as symbolic execution, are computationally inapplicable to the giant codebase of compiler systems. Recent advances demonstrate that Large Language Models (LLMs) excel in code generation/understanding tasks and even have achieved state-of-the-art performance in black-box fuzzing. Nonetheless, guiding LLMs with compiler source-code information remains a missing piece of research in compiler testing. To this end, we propose WhiteFox, the first white-box compiler fuzzer using LLMs with source-code information to test compiler optimization, with a spotlight on detecting deep logic bugs in the emerging deep learning (DL) compilers. WhiteFox adopts a multi-agent framework: (i) an LLM-based analysis agent examines the low-level optimization source code and produces requirements on the high-level test programs that can trigger the optimization; (ii) an LLM-based generation agent produces test programs based on the summarized requirements. Additionally, optimization-triggering tests are also used as feedback to further enhance the test generation prompt on the fly. Our evaluation on the three most popular DL compilers (i.e., PyTorch Inductor, TensorFlow-XLA, and TensorFlow Lite) shows that WhiteFox can generate high-quality test programs to exercise deep optimizations requiring intricate conditions, practicing up to 8 times more optimizations than state-of-the-art fuzzers. To date, WhiteFox has found in total 101 bugs for the compilers under test, with 92 confirmed as previously unknown and 70 already fixed. Notably, WhiteFox has been recently acknowledged by the PyTorch team, and is in the process of being incorporated into its development workflow. Finally, beyond DL compilers, WhiteFox can also be adapted for compilers in different domains, such as LLVM, where WhiteFox has already found multiple bugs. 
    more » « less