skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: TECCD: A Tree Embedding Approach for Code Clone Detection
Clone detection techniques have been explored for decades. Recently, deep learning techniques has been adopted to improve the code representation capability, and improve the state-of-the-art in code clone detection. These approaches usually require a transformation from AST to binary tree to incorporate syntactical information, which introduces overheads. Moreover, these approaches conduct term-embedding, which requires large training datasets. In this paper, we introduce a tree embedding technique to conduct clone detection. Our approach first conducts tree embedding to obtain a node vector for each intermediate node in the AST, which captures the structure information of ASTs. Then we compose a tree vector from its involving node vectors using a lightweight method. Lastly Euclidean distances between tree vectors are measured to determine code clones. We implement our approach in a tool called TECCD and conduct an evaluation using the BigCloneBench (BCB) and 7 other large scale Java projects. The results show that our approach achieves good accuracy and recall and outperforms existing approaches.  more » « less
Award ID(s):
1816594
PAR ID:
10194570
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
IEEE International Conference on Software Maintenance and Evolution (ICSME)
Page Range / eLocation ID:
145 to 156
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Automated analysis of programming data using code representation methods offers valuable services for programmers, from code completion to clone detection to bug detection. Recent studies show the effectiveness of Abstract Syntax Trees (AST), pre-trained Transformer-based models, and graph-based embeddings in programming code representation. However, pre-trained large language models lack interpretability, while other embedding-based approaches struggle with extracting important information from large ASTs. This study proposes a novel Subtree-based Attention Neural Network (SANN) to address these gaps by integrating different components: an optimized sequential subtree extraction process using Genetic algorithm optimization, a two-way embedding approach, and an attention network. We investigate the effectiveness of SANN by applying it to two different tasks: program correctness prediction and algorithm detection on two educational datasets containing both small and large-scale code snippets written in Java and C, respectively. The experimental results show SANN's competitive performance against baseline models from the literature, including code2vec, ASTNN, TBCNN, CodeBERT, GPT-2, and MVG, regarding accurate predictive power. Finally, a case study is presented to show the interpretability of our model prediction and its application for an important human-centered computing application, student modeling. Our results indicate the effectiveness of the SANN model in capturing important syntactic and semantic information from students' code, allowing the construction of accurate student models, which serve as the foundation for generating adaptive instructional support such as individualized hints and feedback. 
    more » « less
  2. A code clone refers to code fragments in the source code that are identical or similar to each other. Code clones lead difficulties in software maintenance, bug fixing, present poor design and increase the system size. Code clone detection techniques and tools have been proposed by many researchers, however, there is a lack of clone detection techniques especially for large scale repositories. In this paper, we present a token-based clone detector called Intelligent Clone Detection Tool (ICDT) that can detect both exact and near-miss clones from large repositories using a standard workstation environment. In order to evaluate the scalability and the efficiency of ICDT, we use the most recent benchmark which is a big benchmark of real clones, BigCloneBench. In addition, we compare ICDT to four publicly available and state-of-the-art tools. 
    more » « less
  3. null (Ed.)
    Traditional network embedding primarily focuses on learning a continuous vector representation for each node, preserving network structure and/or node content information, such that off-the-shelf machine learning algorithms can be easily applied to the vector-format node representations for network analysis. However, the learned continuous vector representations are inefficient for large-scale similarity search, which often involves finding nearest neighbors measured by distance or similarity in a continuous vector space. In this article, we propose a search efficient binary network embedding algorithm called BinaryNE to learn a binary code for each node, by simultaneously modeling node context relations and node attribute relations through a three-layer neural network. BinaryNE learns binary node representations using a stochastic gradient descent-based online learning algorithm. The learned binary encoding not only reduces memory usage to represent each node, but also allows fast bit-wise comparisons to support faster node similarity search than using Euclidean or other distance measures. Extensive experiments and comparisons demonstrate that BinaryNE not only delivers more than 25 times faster search speed, but also provides comparable or better search quality than traditional continuous vector based network embedding methods. The binary codes learned by BinaryNE also render competitive performance on node classification and node clustering tasks. The source code of the BinaryNE algorithm is available at https://github.com/daokunzhang/BinaryNE. 
    more » « less
  4. Ko, Hanseok (Ed.)
    Malware represents a significant security concern in today’s digital landscape, as it can destroy or disable operating systems, steal sensitive user information, and occupy valuable disk space. However, current malware detection methods, such as static-based and dynamic-based approaches, struggle to identify newly developed ("zero-day") malware and are limited by customized virtual machine (VM) environments. To overcome these limitations, we propose a novel malware detection approach that leverages deep learning, mathematical techniques, and network science. Our approach focuses on static and dynamic analysis and utilizes the Low-Level Virtual Machine (LLVM) to profile applications within a complex network. The generated network topologies are input into the GraphSAGE architecture to efficiently distinguish between benign and malicious software applications, with the operation names denoted as node features. Importantly, the GraphSAGE models analyze the network’s topological geometry to make predictions, enabling them to detect state-of-the-art malware and prevent potential damage during execution in a VM. To evaluate our approach, we conduct a study on a dataset comprising source code from 24,376 applications, specifically written in C/C++, sourced directly from widely-recognized malware and various types of benign software. The results show a high detection performance with an Area Under the Receiver Operating Characteristic Curve (AUROC) of 99.85%. Our approach marks a substantial improvement in malware detection, providing a notably more accurate and efficient solution when compared to current state-of-the-art malware detection methods. The code is released at https://github.com/HantangZhang/MGN. 
    more » « less
  5. In recent years, there has been a growing consensus among researchers regarding the dual nature of code clones. While some instances of code are valuable for reuse or extraction as components, the utilization of specific code segments can pose significant maintenance challenges for developers. Consequently, the judicious management of code clones has emerged as a pivotal solution to address these issues. Nevertheless, it remains critical to ascertain the number of code clones within a project, and identify components where code clones are more concentrated. In this paper, we introduce three novel metrics, namely Clone Distribution, Clone Density, and Clone Entropy (the dispersion of code clone within a project), for the quantification and characterization of code clones. We have formulated associated mathematical expressions to precisely represent these code clone metrics. We collected a dataset covering three different domains of Java projects, formulated research questions for the proposed three metrics, conducted a large-scale empirical study, and provided detailed numerical statistics. Furthermore, we have introduced a novel clone visualization approach, which effectively portrays Clone Distribution and Clone Density. Developers can leverage this approach to efficiently identify target clones. By reviewing clone code concerning its distribution, we have identified nine distinct code clone patterns and summarized specific clone management strategies that have the potential to enhance the efficiency of clone management practices. Our experiments demonstrate that the proposed code clone metrics provide valuable insights into the nature of code clones, and the visualization approach assists developers in inspecting and summarizing clone code patterns. 
    more » « less