skip to main content


Title: FIXREVERTER: A Realistic Bug Injection Methodology for Benchmarking Fuzz Testing
Fuzz testing is an active area of research with proposed improvements published at a rapid pace. Such proposals are assessed empirically: Can they be shown to perform better than the status quo? Such an assessment requires a benchmark of target programs with well-identified, realistic bugs. To ease the construction of such a benchmark, this paper presents FIXREVERTER, a tool that automatically injects realistic bugs in a program. FIXREVERTER takes as input a bugfix pattern which contains both code syntax and semantic conditions. Any code site that matches the specified syntax is undone if the semantic conditions are satisfied, as checked by static analysis, thus (re)introducing a likely bug. This paper focuses on three bugfix patterns, which we call conditional-abort, conditional-execute, and conditional-assign, based on a study of fixes in a corpus of Common Vulnerabilities and Exposures (CVEs). Using FIXREVERTER we have built REVBUGBENCH, which consists of 10 programs into which we have injected nearly 8,000 bugs; the programs are taken from FuzzBench and Binutils, and represent common targets of fuzzing evaluations. We have integrated REVBUGBENCH into the FuzzBench service, and used it to evaluate five fuzzers. Fuzzing performance varies by fuzzer and program, as desired/expected. Overall, 219 unique bugs were reported, 19% of which were detected by just one fuzzer.  more » « less
Award ID(s):
1801545
NSF-PAR ID:
10357972
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
31st USENIX Security Symposium
Page Range / eLocation ID:
3699-3715
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Fuzzing reliably and efficiently finds bugs in software, including operating system kernels. In general, higher code coverage leads to the discovery of more bugs. This is why most existing kernel fuzzers adopt strategies to generate a series of inputs that attempt to greedily maximize the amount of code that they exercise. However, simply executing code may not be sufficient to reveal bugs that require specific sequences of actions. Synthesizing inputs to trigger such bugs depends on two aspects: (i) the actions the executed code takes, and (ii) the order in which those actions are taken. An action is a high-level operation, such as a heap allocation, that is performed by the executed code and has a specific semantic meaning. ACTOR, our action-guided kernel fuzzing framework, deviates from traditional methods. Instead of focusing on code coverage optimization, our approach generates fuzzer programs (inputs) that leverage our understanding of triggered actions and their temporal relationships. Specifically, we first capture actions that potentially operate on shared data structures at different times. Then, we synthesize programs using those actions as building blocks, guided by bug templates expressed in our domain-specific language. We evaluated ACTOR on four different versions of the Linux kernel, including two well-tested and frequently updated long-term (5.4.206, 5.10.131) versions, a stable (5.19), and the latest (6.2-rc5) release. Our evaluation revealed a total of 41 previously unknown bugs, of which 9 have already been fixed. Interestingly, 15 (36.59%) of them were discovered in less than a day. 
    more » « less
  2. Fuzzing reliably and efficiently finds bugs in software, including operating system kernels. In general, higher code coverage leads to the discovery of more bugs. This is why most existing kernel fuzzers adopt strategies to generate a series of inputs that attempt to greedily maximize the amount of code that they exercise. However, simply executing code may not be sufficient to reveal bugs that require specific sequences of actions. Synthesizing inputs to trigger such bugs depends on two aspects: (i) the actions the executed code takes, and (ii) the order in which those actions are taken. An action is a high-level operation, such as a heap allocation, that is performed by the executed code and has a specific semantic meaning. ACTOR, our action-guided kernel fuzzing framework, deviates from traditional methods. Instead of focusing on code coverage optimization, our approach generates fuzzer programs (inputs) that leverage our understanding of triggered actions and their temporal relationships. Specifically, we first capture actions that potentially operate on shared data structures at different times. Then, we synthesize programs using those actions as building blocks, guided by bug templates expressed in our domain-specific language. We evaluated ACTOR on four different versions of the Linux kernel, including two well-tested and frequently updated long-term (5.4.206, 5.10.131) versions, a stable (5.19), and the latest (6.2-rc5) release. Our evaluation revealed a total of 41 previously unknown bugs, of which 9 have already been fixed. Interestingly, 15 (36.59%) of them were discovered in less than a day. 
    more » « less
  3. Fuzzing reliably and efficiently finds bugs in software, including operating system kernels. In general, higher code coverage leads to the discovery of more bugs. This is why most existing kernel fuzzers adopt strategies to generate a series of inputs that attempt to greedily maximize the amount of code that they exercise. However, simply executing code may not be sufficient to reveal bugs that require specific sequences of actions. Synthesizing inputs to trigger such bugs depends on two aspects: (i) the actions the executed code takes, and (ii) the order in which those actions are taken. An action is a high-level operation, such as a heap allocation, that is performed by the executed code and has a specific semantic meaning. ACTOR, our action-guided kernel fuzzing framework, deviates from traditional methods. Instead of focusing on code coverage optimization, our approach generates fuzzer programs (inputs) that leverage our understanding of triggered actions and their temporal relationships. Specifically, we first capture actions that potentially operate on shared data structures at different times. Then, we synthesize programs using those actions as building blocks, guided by bug templates expressed in our domain-specific language. We evaluated ACTOR on four different versions of the Linux kernel, including two well-tested and frequently updated long-term (5.4.206, 5.10.131) versions, a stable (5.19), and the latest (6.2-rc5) release. Our evaluation revealed a total of 41 previously unknown bugs, of which 9 have already been fixed. Interestingly, 15 (36.59%) of them were discovered in less than a day. 
    more » « less
  4. Greybox fuzzing and mutation testing are two popular but mostly independent fields of software testing research that have so far had limited overlap. Greybox fuzzing, generally geared towards searching for new bugs, predominantly uses code coverage for selecting inputs to save. Mutation testing is primarily used as a stronger alternative to code coverage in assessing the quality of regression tests; the idea is to evaluate tests for their ability to identify artificially injected faults in the target program. But what if we wanted to use greybox fuzzing to synthesize high-quality regression tests? In this paper, we develop and evaluate Mu2, a Java-based framework for incorporating mutation analysis in the greybox fuzzing loop, with the goal of producing a test-input corpus with a high mutation score. Mu2 makes use of a differential oracle for identifying inputs that exercise interesting program behavior without causing crashes. This paper describes several dynamic optimizations implemented in Mu2 to overcome the high cost of performing mutation analysis with every fuzzer-generated input. These optimizations introduce trade-offs in fuzzing throughput and mutation killing ability, which we evaluate empirically on five real-world Java benchmarks. Overall, variants of Mu2 are able to synthesize test-input corpora with a higher mutation score than state-of-the-art Java fuzzer Zest. 
    more » « less
  5. Compiler fuzzing tools such as Csmith have uncovered many bugs in compilers by randomly sampling programs from a generative model. The success of these tools is often attributed to their ability to generate unexpected corner case inputs that developers tend to overlook during manual testing. At the same time, their chaotic nature makes fuzzer-generated test cases notoriously hard to interpret, which has lead to the creation of input simplification tools such as C-Reduce (for C compiler bugs). In until now unrelated work, researchers have also shown that human-written software tends to be rather repetitive and predictable to language models. Studies show that developers deliberately write more predictable code, whereas code with bugs is relatively unpredictable. In this study, we ask the natural questions of whether this high predictability property of code also, and perhaps counter-intuitively, applies to fuzzer-generated code. That is, we investigate whether fuzzer-generated compiler inputs are deemed unpredictable by a language model built on human-written code and surprisingly conclude that it is not. To the contrary, Csmith fuzzer-generated programs are more predictable on a per-token basis than human-written C programs. Furthermore, bug-triggering tended to be more predictable still than random inputs, and the C-Reduce minimization tool did not substantially increase this predictability. Rather, we find that bug-triggering inputs are unpredictable relative to Csmith's own generative model. This is encouraging; our results suggest promising research directions on incorporating predictability metrics in the fuzzing and reduction tools themselves. 
    more » « less