skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Pyclone: A Python Code Clone Test Bank Generator
Code clones are fragments of code that are duplicated in the codebase of an application. They create problems with maintainability, duplicate buggy code, and increase the size of the repository. To combat these issues, there currently exists a multitude of programs to detect duplicated code segments. However, there are not many varieties of languages among the benchmarks for code clone detection tools. Without covering enough languages for modern software development, the development of code-clone detection tools remains stunted. This paper describes a novel tool that will take a seed of Python source code and generate Type 1, 2, and 3 code clones in Python. As one of the most used and rapidly-growing languages in modern software development, our testbed will provide the opportunity for Python code-clone detection tools to be developed and tested.  more » « less
Award ID(s):
1854049
PAR ID:
10310332
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Information Science and Applications. Lecture Notes in Electrical Engineering
Volume:
739
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Successful cross-language clone detection could enable researchers and developers to create robust language migration tools, facilitate learning additional programming languages once one is mastered, and promote reuse of code snippets over a broader codebase. How- ever, identifying cross-language clones presents special challenges to the clone detection problem. A lack of common underlying rep- resentation between arbitrary languages means detecting clones requires one of the following solutions: 1) a static analysis frame- work replicated across each targeted language with annotations matching language features across all languages, or 2) a dynamic analysis framework that detects clones based on runtime behavior. In this work, we demonstrate the feasibility of the latter solution, a dynamic analysis approach called SLACC for cross-language clone detection. Like prior clone detection techniques, we use input/out- put behavior to match clones, though we overcome limitations of prior work by amplifying the number of inputs and covering more data types; and as a result, achieve better clusters than prior at- tempts. Since clusters are generated based on input/output behav- ior, SLACC supports cross-language clone detection. As an added challenge, we target a static typed language, Java, and a dynamic typed language, Python. Compared to HitoshiIO, a recent clone de- tection tool for Java, SLACC retrieves 6 times as many clusters and has higher precision (86.7% vs. 30.7%). This is the first work to perform clone detection for dynamic typed languages (precision = 87.3%) and the first to perform clone detection across languages that lack a common underlying repre- sentation (precision = 94.1%). It provides a first step towards the larger goal of scalable language migration tools. 
    more » « less
  2. A code clone refers to code fragments in the source code that are identical or similar to each other. Code clones lead difficulties in software maintenance, bug fixing, present poor design and increase the system size. Code clone detection techniques and tools have been proposed by many researchers, however, there is a lack of clone detection techniques especially for large scale repositories. In this paper, we present a token-based clone detector called Intelligent Clone Detection Tool (ICDT) that can detect both exact and near-miss clones from large repositories using a standard workstation environment. In order to evaluate the scalability and the efficiency of ICDT, we use the most recent benchmark which is a big benchmark of real clones, BigCloneBench. In addition, we compare ICDT to four publicly available and state-of-the-art tools. 
    more » « less
  3. Identifying similar code in software systems can assist many software engineering tasks such as program understanding and software refactoring. While most approaches focus on identifying code that looks alike, some techniques aim at detecting code that functions alike. Detecting these functional clones - code that functions alike - in object oriented languages remains an open question because of the difficulty in exposing and comparing programs' functionality effectively. We propose a novel technique, In-Vivo Clone Detection, that detects functional clones in arbitrary programs by identifying and mining their inputs and outputs. The key insight is to use existing workloads to execute programs and then measure functional similarities between programs based on their inputs and outputs, which mitigates the problems in object oriented languages reported by prior work. We implement such technique in our system, HitoshiIO, which is open source and freely available. Our experimental results show that HitoshiIO detects more than 800 functional clones across a corpus of 118 projects. In a random sample of the detected clones, HitoshiIO achieves 68+% true positive rate with only 15% false positive rate. 
    more » « less
  4. Modern software engineering practices rely on program comprehension as the most basic underlying component for improving developer productivity and software reliability. Software developers are often tasked to work with unfamiliar code in order to remove security vulnerabilities, port and refactor legacy code, and enhance software with new features desired by users. Automatic identification of behavioral clones, or behaviorally-similar code, is one program comprehension technique that can provide developers with assistance. The idea is to identify other code that "does the same thing" and that may be more intuitive; better documented; or familiar to the developer, to help them understand the code at hand. Unlike the detection of syntactic or structural code clones, behavioral clone detection requires executing workloads or test cases to find code that executes similarly on the same inputs. However, a key problem in behavioral clone detection that has not received adequate attention is the "preponderance of the evidence" problem, which advocates for more convincing evidence from nontrivial test case executions to gain confidence in the behavioral similarities. In other words, similar outputs for some inputs matter more than for others. We present a novel system, SABER, to address the "preponderance of the evidence" problem, for which we adapt the legal metaphor of "more likely to be true than not true" burden of proof. We develop a novel test case generation methodology with three primary dynamic analysis techniques for identifying important behavioral clones. Further, we investigate filtering and weighting schemes to guide developers toward the most convincing behavioral similarities germane to specific software engineering tasks, such as code review, debugging, and introducing new features. 
    more » « less
  5. In recent years, there has been a growing consensus among researchers regarding the dual nature of code clones. While some instances of code are valuable for reuse or extraction as components, the utilization of specific code segments can pose significant maintenance challenges for developers. Consequently, the judicious management of code clones has emerged as a pivotal solution to address these issues. Nevertheless, it remains critical to ascertain the number of code clones within a project, and identify components where code clones are more concentrated. In this paper, we introduce three novel metrics, namely Clone Distribution, Clone Density, and Clone Entropy (the dispersion of code clone within a project), for the quantification and characterization of code clones. We have formulated associated mathematical expressions to precisely represent these code clone metrics. We collected a dataset covering three different domains of Java projects, formulated research questions for the proposed three metrics, conducted a large-scale empirical study, and provided detailed numerical statistics. Furthermore, we have introduced a novel clone visualization approach, which effectively portrays Clone Distribution and Clone Density. Developers can leverage this approach to efficiently identify target clones. By reviewing clone code concerning its distribution, we have identified nine distinct code clone patterns and summarized specific clone management strategies that have the potential to enhance the efficiency of clone management practices. Our experiments demonstrate that the proposed code clone metrics provide valuable insights into the nature of code clones, and the visualization approach assists developers in inspecting and summarizing clone code patterns. 
    more » « less