When software engineering researchers discuss "similar" code, we often mean code determined by static analysis to be textually, syntactically or structurally similar, known as code clones (looks alike). Ideally, we would like to also include code that is behaviorally or functionally similar, even if it looks completely different. The state of the art in detecting these behavioral clones focuses on checking the functional equivalence of the inputs and outputs of code fragments, regardless of its internal behavior (focusing only on input and output states). We argue that with an advance in dynamic code clone detection towards detecting behavioral clones (i.e., those with similar execution behavior), we can greatly increase the applications of behavioral clones as a whole for general program understanding tasks.
more »
« less
Identifying Functionally Similar Code in Complex Codebases
Identifying similar code in software systems can assist many software engineering tasks such as program understanding and software refactoring. While most approaches focus on identifying code that looks alike, some techniques aim at detecting code that functions alike. Detecting these functional clones - code that functions alike - in object oriented languages remains an open question because of the difficulty in exposing and comparing programs' functionality effectively. We propose a novel technique, In-Vivo Clone Detection, that detects functional clones in arbitrary programs by identifying and mining their inputs and outputs. The key insight is to use existing workloads to execute programs and then measure functional similarities between programs based on their inputs and outputs, which mitigates the problems in object oriented languages reported by prior work. We implement such technique in our system, HitoshiIO, which is open source and freely available. Our experimental results show that HitoshiIO detects more than 800 functional clones across a corpus of 118 projects. In a random sample of the detected clones, HitoshiIO achieves 68+% true positive rate with only 15% false positive rate.
more »
« less
- PAR ID:
- 10112156
- Date Published:
- Journal Name:
- 24th IEEE International Conference on Program Comprehension (ICPC)
- Page Range / eLocation ID:
- 1 - 10
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
When software engineering researchers discuss "similar" code, we often mean code determined by static analysis to be textually, syntactically or structurally similar, known as code clones (looks alike). Ideally, we would like to also include code that is behaviorally or functionally similar, even if it looks completely different. The state of the art in detecting these behavioral clones focuses on checking the functional equivalence of the inputs and outputs of code fragments, regardless of its internal behavior (focusing only on input and output states). We argue that with an advance in dynamic code clone detection towards detecting behavioral clones (i.e., those with similar execution behavior), we can greatly increase the applications of behavioral clones as a whole for general program understanding tasks.more » « less
-
Modern software engineering practices rely on program comprehension as the most basic underlying component for improving developer productivity and software reliability. Software developers are often tasked to work with unfamiliar code in order to remove security vulnerabilities, port and refactor legacy code, and enhance software with new features desired by users. Automatic identification of behavioral clones, or behaviorally-similar code, is one program comprehension technique that can provide developers with assistance. The idea is to identify other code that "does the same thing" and that may be more intuitive; better documented; or familiar to the developer, to help them understand the code at hand. Unlike the detection of syntactic or structural code clones, behavioral clone detection requires executing workloads or test cases to find code that executes similarly on the same inputs. However, a key problem in behavioral clone detection that has not received adequate attention is the "preponderance of the evidence" problem, which advocates for more convincing evidence from nontrivial test case executions to gain confidence in the behavioral similarities. In other words, similar outputs for some inputs matter more than for others. We present a novel system, SABER, to address the "preponderance of the evidence" problem, for which we adapt the legal metaphor of "more likely to be true than not true" burden of proof. We develop a novel test case generation methodology with three primary dynamic analysis techniques for identifying important behavioral clones. Further, we investigate filtering and weighting schemes to guide developers toward the most convincing behavioral similarities germane to specific software engineering tasks, such as code review, debugging, and introducing new features.more » « less
-
Successful cross-language clone detection could enable researchers and developers to create robust language migration tools, facilitate learning additional programming languages once one is mastered, and promote reuse of code snippets over a broader codebase. How- ever, identifying cross-language clones presents special challenges to the clone detection problem. A lack of common underlying rep- resentation between arbitrary languages means detecting clones requires one of the following solutions: 1) a static analysis frame- work replicated across each targeted language with annotations matching language features across all languages, or 2) a dynamic analysis framework that detects clones based on runtime behavior. In this work, we demonstrate the feasibility of the latter solution, a dynamic analysis approach called SLACC for cross-language clone detection. Like prior clone detection techniques, we use input/out- put behavior to match clones, though we overcome limitations of prior work by amplifying the number of inputs and covering more data types; and as a result, achieve better clusters than prior at- tempts. Since clusters are generated based on input/output behav- ior, SLACC supports cross-language clone detection. As an added challenge, we target a static typed language, Java, and a dynamic typed language, Python. Compared to HitoshiIO, a recent clone de- tection tool for Java, SLACC retrieves 6 times as many clusters and has higher precision (86.7% vs. 30.7%). This is the first work to perform clone detection for dynamic typed languages (precision = 87.3%) and the first to perform clone detection across languages that lack a common underlying repre- sentation (precision = 94.1%). It provides a first step towards the larger goal of scalable language migration tools.more » « less
-
null; null; null (Ed.)Code clones are fragments of code that are duplicated in the codebase of an application. They create problems with maintainability, duplicate buggy code, and increase the size of the repository. To combat these issues, there currently exists a multitude of programs to detect duplicated code segments. However, there are not many varieties of languages among the benchmarks for code clone detection tools. Without covering enough languages for modern software development, the development of code-clone detection tools remains stunted. This paper describes a novel tool that will take a seed of Python source code and generate Type 1, 2, and 3 code clones in Python. As one of the most used and rapidly-growing languages in modern software development, our testbed will provide the opportunity for Python code-clone detection tools to be developed and tested.more » « less
An official website of the United States government

