skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: On the Naturalness of Fuzzer-Generated Code
Compiler fuzzing tools such as Csmith have uncovered many bugs in compilers by randomly sampling programs from a generative model. The success of these tools is often attributed to their ability to generate unexpected corner case inputs that developers tend to overlook during manual testing. At the same time, their chaotic nature makes fuzzer-generated test cases notoriously hard to interpret, which has lead to the creation of input simplification tools such as C-Reduce (for C compiler bugs). In until now unrelated work, researchers have also shown that human-written software tends to be rather repetitive and predictable to language models. Studies show that developers deliberately write more predictable code, whereas code with bugs is relatively unpredictable. In this study, we ask the natural questions of whether this high predictability property of code also, and perhaps counter-intuitively, applies to fuzzer-generated code. That is, we investigate whether fuzzer-generated compiler inputs are deemed unpredictable by a language model built on human-written code and surprisingly conclude that it is not. To the contrary, Csmith fuzzer-generated programs are more predictable on a per-token basis than human-written C programs. Furthermore, bug-triggering tended to be more predictable still than random inputs, and the C-Reduce minimization tool did not substantially increase this predictability. Rather, we find that bug-triggering inputs are unpredictable relative to Csmith's own generative model. This is encouraging; our results suggest promising research directions on incorporating predictability metrics in the fuzzing and reduction tools themselves.  more » « less
Award ID(s):
2120955
NSF-PAR ID:
10338202
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
19th International Conference on Mining Software Repositories
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Fuzz testing has been used to find bugs in programs since the 1990s, but despite decades of dedicated research, there is still no consensus on which fuzzing techniques work best. One reason for this is the paucity of ground truth: bugs in real programs with known root causes and triggering inputs are difficult to collect at a meaningful scale. Bug injection technologies that add synthetic bugs into real programs seem to offer a solution, but the differences in finding these synthetic bugs versus organic bugs have not previously been explored at a large scale. Using over 80 years of CPU time, we ran eight fuzzers across 20 targets from the Rode0day bug-finding competition and the LAVA-M corpus. Experiments were standardized with respect to compute resources and metrics gathered. These experiments show differences in fuzzer performance as well as the impact of various configuration options. For instance, it is clear that integrating symbolic execution with mutational fuzzing is very effective and that using dictionaries improves performance. Other conclusions are less clear-cut; for example, no one fuzzer beat all others on all tests. It is noteworthy that no fuzzer found any organic bugs (i.e., one reported in a CVE), despite 50 such bugs being available for discovery in the fuzzing corpus. A close analysis of results revealed a possible explanation: a dramatic difference between where synthetic and organic bugs live with respect to the "main path" discovered by fuzzers. We find that recent updates to bug injection systems have made synthetic bugs more difficult to discover, but they are still significantly easier to find than organic bugs in our target programs. Finally, this study identifies flaws in bug injection techniques and suggests a number of axes along which synthetic bugs should be improved. 
    more » « less
  2. Fuzzing reliably and efficiently finds bugs in software, including operating system kernels. In general, higher code coverage leads to the discovery of more bugs. This is why most existing kernel fuzzers adopt strategies to generate a series of inputs that attempt to greedily maximize the amount of code that they exercise. However, simply executing code may not be sufficient to reveal bugs that require specific sequences of actions. Synthesizing inputs to trigger such bugs depends on two aspects: (i) the actions the executed code takes, and (ii) the order in which those actions are taken. An action is a high-level operation, such as a heap allocation, that is performed by the executed code and has a specific semantic meaning. ACTOR, our action-guided kernel fuzzing framework, deviates from traditional methods. Instead of focusing on code coverage optimization, our approach generates fuzzer programs (inputs) that leverage our understanding of triggered actions and their temporal relationships. Specifically, we first capture actions that potentially operate on shared data structures at different times. Then, we synthesize programs using those actions as building blocks, guided by bug templates expressed in our domain-specific language. We evaluated ACTOR on four different versions of the Linux kernel, including two well-tested and frequently updated long-term (5.4.206, 5.10.131) versions, a stable (5.19), and the latest (6.2-rc5) release. Our evaluation revealed a total of 41 previously unknown bugs, of which 9 have already been fixed. Interestingly, 15 (36.59%) of them were discovered in less than a day. 
    more » « less
  3. Fuzzing reliably and efficiently finds bugs in software, including operating system kernels. In general, higher code coverage leads to the discovery of more bugs. This is why most existing kernel fuzzers adopt strategies to generate a series of inputs that attempt to greedily maximize the amount of code that they exercise. However, simply executing code may not be sufficient to reveal bugs that require specific sequences of actions. Synthesizing inputs to trigger such bugs depends on two aspects: (i) the actions the executed code takes, and (ii) the order in which those actions are taken. An action is a high-level operation, such as a heap allocation, that is performed by the executed code and has a specific semantic meaning. ACTOR, our action-guided kernel fuzzing framework, deviates from traditional methods. Instead of focusing on code coverage optimization, our approach generates fuzzer programs (inputs) that leverage our understanding of triggered actions and their temporal relationships. Specifically, we first capture actions that potentially operate on shared data structures at different times. Then, we synthesize programs using those actions as building blocks, guided by bug templates expressed in our domain-specific language. We evaluated ACTOR on four different versions of the Linux kernel, including two well-tested and frequently updated long-term (5.4.206, 5.10.131) versions, a stable (5.19), and the latest (6.2-rc5) release. Our evaluation revealed a total of 41 previously unknown bugs, of which 9 have already been fixed. Interestingly, 15 (36.59%) of them were discovered in less than a day. 
    more » « less
  4. Fuzzing reliably and efficiently finds bugs in software, including operating system kernels. In general, higher code coverage leads to the discovery of more bugs. This is why most existing kernel fuzzers adopt strategies to generate a series of inputs that attempt to greedily maximize the amount of code that they exercise. However, simply executing code may not be sufficient to reveal bugs that require specific sequences of actions. Synthesizing inputs to trigger such bugs depends on two aspects: (i) the actions the executed code takes, and (ii) the order in which those actions are taken. An action is a high-level operation, such as a heap allocation, that is performed by the executed code and has a specific semantic meaning. ACTOR, our action-guided kernel fuzzing framework, deviates from traditional methods. Instead of focusing on code coverage optimization, our approach generates fuzzer programs (inputs) that leverage our understanding of triggered actions and their temporal relationships. Specifically, we first capture actions that potentially operate on shared data structures at different times. Then, we synthesize programs using those actions as building blocks, guided by bug templates expressed in our domain-specific language. We evaluated ACTOR on four different versions of the Linux kernel, including two well-tested and frequently updated long-term (5.4.206, 5.10.131) versions, a stable (5.19), and the latest (6.2-rc5) release. Our evaluation revealed a total of 41 previously unknown bugs, of which 9 have already been fixed. Interestingly, 15 (36.59%) of them were discovered in less than a day. 
    more » « less
  5. null (Ed.)
    Compiler bugs can be disastrous since they could affect all the software systems built on the buggy compilers. Meanwhile, diagnosing compiler bugs is extremely challenging since usually limited debugging information is available and a large number of compiler files can be suspicious. More specifically, when compiling a given bug-triggering test program, hundreds of compiler files are usually involved, and can all be treated as suspicious buggy files. To facilitate compiler debugging, in this paper we propose the first reinforcement compiler bug isolation approach via structural mutation, called RecBi. For a given bug-triggering test program, RecBi first augments traditional local mutation operators with structural ones to transform it into a set of passing test programs. Since not all the passing test programs can help isolate compiler bugs effectively, RecBi further leverages reinforcement learning to intelligently guide the process of passing test program generation. Then, RecBi ranks all the suspicious files by analyzing the compiler execution traces of the generated passing test programs and the given failing test program following the practice of compiler bug isolation. The experimental results on 120 real bugs from two most popular C open-source compilers, i.e., GCC and LLVM, show that RecBi is able to isolate about 23%/58%/78% bugs within Top-1/Top-5/Top-10 compiler files, and significantly outperforms the state-of-the-art compiler bug isolation approach by improving 92.86%/55.56%/25.68% isolation effectiveness in terms of Top-1/Top-5/Top-10 results. 
    more » « less