skip to main content


Title: Intelligent token-based code clone detection system for large scale source code
A code clone refers to code fragments in the source code that are identical or similar to each other. Code clones lead difficulties in software maintenance, bug fixing, present poor design and increase the system size. Code clone detection techniques and tools have been proposed by many researchers, however, there is a lack of clone detection techniques especially for large scale repositories. In this paper, we present a token-based clone detector called Intelligent Clone Detection Tool (ICDT) that can detect both exact and near-miss clones from large repositories using a standard workstation environment. In order to evaluate the scalability and the efficiency of ICDT, we use the most recent benchmark which is a big benchmark of real clones, BigCloneBench. In addition, we compare ICDT to four publicly available and state-of-the-art tools.  more » « less
Award ID(s):
1854049
NSF-PAR ID:
10125885
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the Conference on Research in Adaptive and Convergent Systems
Page Range / eLocation ID:
256 to 260
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Successful cross-language clone detection could enable researchers and developers to create robust language migration tools, facilitate learning additional programming languages once one is mastered, and promote reuse of code snippets over a broader codebase. How- ever, identifying cross-language clones presents special challenges to the clone detection problem. A lack of common underlying rep- resentation between arbitrary languages means detecting clones requires one of the following solutions: 1) a static analysis frame- work replicated across each targeted language with annotations matching language features across all languages, or 2) a dynamic analysis framework that detects clones based on runtime behavior. In this work, we demonstrate the feasibility of the latter solution, a dynamic analysis approach called SLACC for cross-language clone detection. Like prior clone detection techniques, we use input/out- put behavior to match clones, though we overcome limitations of prior work by amplifying the number of inputs and covering more data types; and as a result, achieve better clusters than prior at- tempts. Since clusters are generated based on input/output behav- ior, SLACC supports cross-language clone detection. As an added challenge, we target a static typed language, Java, and a dynamic typed language, Python. Compared to HitoshiIO, a recent clone de- tection tool for Java, SLACC retrieves 6 times as many clusters and has higher precision (86.7% vs. 30.7%). This is the first work to perform clone detection for dynamic typed languages (precision = 87.3%) and the first to perform clone detection across languages that lack a common underlying repre- sentation (precision = 94.1%). It provides a first step towards the larger goal of scalable language migration tools. 
    more » « less
  2. Clone detection techniques have been explored for decades. Recently, deep learning techniques has been adopted to improve the code representation capability, and improve the state-of-the-art in code clone detection. These approaches usually require a transformation from AST to binary tree to incorporate syntactical information, which introduces overheads. Moreover, these approaches conduct term-embedding, which requires large training datasets. In this paper, we introduce a tree embedding technique to conduct clone detection. Our approach first conducts tree embedding to obtain a node vector for each intermediate node in the AST, which captures the structure information of ASTs. Then we compose a tree vector from its involving node vectors using a lightweight method. Lastly Euclidean distances between tree vectors are measured to determine code clones. We implement our approach in a tool called TECCD and conduct an evaluation using the BigCloneBench (BCB) and 7 other large scale Java projects. The results show that our approach achieves good accuracy and recall and outperforms existing approaches. 
    more » « less
  3. null ; null ; null (Ed.)
    Code clones are fragments of code that are duplicated in the codebase of an application. They create problems with maintainability, duplicate buggy code, and increase the size of the repository. To combat these issues, there currently exists a multitude of programs to detect duplicated code segments. However, there are not many varieties of languages among the benchmarks for code clone detection tools. Without covering enough languages for modern software development, the development of code-clone detection tools remains stunted. This paper describes a novel tool that will take a seed of Python source code and generate Type 1, 2, and 3 code clones in Python. As one of the most used and rapidly-growing languages in modern software development, our testbed will provide the opportunity for Python code-clone detection tools to be developed and tested. 
    more » « less
  4. Identifying similar code in software systems can assist many software engineering tasks such as program understanding and software refactoring. While most approaches focus on identifying code that looks alike, some techniques aim at detecting code that functions alike. Detecting these functional clones - code that functions alike - in object oriented languages remains an open question because of the difficulty in exposing and comparing programs' functionality effectively. We propose a novel technique, In-Vivo Clone Detection, that detects functional clones in arbitrary programs by identifying and mining their inputs and outputs. The key insight is to use existing workloads to execute programs and then measure functional similarities between programs based on their inputs and outputs, which mitigates the problems in object oriented languages reported by prior work. We implement such technique in our system, HitoshiIO, which is open source and freely available. Our experimental results show that HitoshiIO detects more than 800 functional clones across a corpus of 118 projects. In a random sample of the detected clones, HitoshiIO achieves 68+% true positive rate with only 15% false positive rate. 
    more » « less
  5. Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors. Thus, as a proxy to incorporate developers' coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD's clone-aware contrastive learning drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code. 
    more » « less