Software developers often repeat the same code changes within a project or across different projects. These repetitive changes are known as “code change patterns” (CPATs). Automating CPATs is crucial to expedite the software development process. While current Transformation by Example (TBE) techniques can automate CPATs, they are limited by the quality and quantity of the provided input examples. Thus, they miss transforming code variations that do not have the exact syntax, data-, or control-flow of the provided input examples, despite being semantically similar. Large Language Models (LLMs), pre-trained on extensive source code datasets, offer a potential solution. Harnessing the capability of LLMs to generate semantically equivalent, yet previously unseen variants of the original CPAT could significantly increase the effectiveness of TBE systems. In this paper, we first discover best practices for harnessing LLMs to generate code variants that meet three criteria: correctness (semantic equivalence to the original CPAT), usefulness (reflecting what developers typically write), and applicability (aligning with the primary intent of the original CPAT). We then implement these practices in our tool PyCraft, which synergistically combines static code analysis, dynamic analysis, and LLM capabilities. By employing chain-of-thought reasoning, PyCraft generates variations of input examples and comprehensive test cases that identify correct variations with an F-measure of 96.6%. Our algorithm uses feedback iteration to expand the original input examples by an average factor of 58x. Using these richly generated examples, we inferred transformation rules and then automated these changes, resulting in an increase of up to 39x, with an average increase of 14x in target codes compared to a previous state-of-the-art tool that relies solely on static analysis. We submitted patches generated by PyCraft to a range of projects, notably esteemed ones like microsoft/DeepSpeed and IBM/inFairness. Their developers accepted and merged 83% the 86 CPAT instances submitted through 44 pull requests. This confirms the usefulness of these changes.
more »
« less
PYEVOLVE: Automating Frequent Code Changes in Python ML Systems
Because of the naturalness of software and the rapid evolution of Machine Learning (ML) techniques, frequently repeated code change patterns (CPATs) occur often. They range from simple API migrations to changes involving several complex control structures such as for loops. While manually performing CPATs is tedious, the current state-of-the-art techniques for inferring transformation rules are not advanced enough to handle unseen variants of complex CPATs, resulting in a low recall rate. In this paper we present a novel, automated workflow that mines CPATs, infers the transformation rules, and then transplants them automatically to new target sites. We designed, implemented, evaluated and released this in a tool, PYEVOLVE. At its core is a novel data-flow, control-flow aware transformation rule inference engine. Our technique allows us to advance the state-of-the-art for transformation-by-example tools; without it, 70% of the code changes that PYEVOLVE transforms would not be possible to automate. Our thorough empirical evaluation of over 40,000 transformations shows 97% precision and 94% recall. By accepting 90% of CPATs generated by PYEVOLVE in famous open-source projects, developers confirmed its changes are useful.
more »
« less
- Award ID(s):
- 2213767
- PAR ID:
- 10471901
- Publisher / Repository:
- IEEE
- Date Published:
- ISBN:
- 978-1-6654-5701-9
- Page Range / eLocation ID:
- 995 to 1007
- Format(s):
- Medium: X
- Location:
- Melbourne, Australia
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Clone detection techniques have been explored for decades. Recently, deep learning techniques has been adopted to improve the code representation capability, and improve the state-of-the-art in code clone detection. These approaches usually require a transformation from AST to binary tree to incorporate syntactical information, which introduces overheads. Moreover, these approaches conduct term-embedding, which requires large training datasets. In this paper, we introduce a tree embedding technique to conduct clone detection. Our approach first conducts tree embedding to obtain a node vector for each intermediate node in the AST, which captures the structure information of ASTs. Then we compose a tree vector from its involving node vectors using a lightweight method. Lastly Euclidean distances between tree vectors are measured to determine code clones. We implement our approach in a tool called TECCD and conduct an evaluation using the BigCloneBench (BCB) and 7 other large scale Java projects. The results show that our approach achieves good accuracy and recall and outperforms existing approaches.more » « less
-
Extensive research has been conducted to explore cryptographic API misuse in Java. However, despite the tremendous popularity of the Python language, uncovering similar issues has not been fully explored. The current static code analysis tools for Python are unable to scan the increasing complexity of the source code. This limitation decreases the analysis depth, resulting in more undetected cryptographic misuses. In this research, we propose Cryptolation, a Static Code Analysis (SCA) tool that provides security guarantees for complex Python cryptographic code. Most existing analysis tools for Python solely focus on specific Frameworks such as Django or Flask. However, using a SCA approach, Cryptolation focuses on the language and not any framework. Cryptolation performs an inter-procedural data-flow analysis to handle many Python language features through variable inference (statically predicting what the variable value is) and SCA. Cryptolation covers 59 Python cryptographic modules and can identify 18 potential cryptographic misuses that involve complex language features. In this paper, we also provide a comprehensive analysis and a state-of-the-art benchmark for understanding the Python cryptographic Application Program Interface (API) misuses and their detection. Our state-of-the-art benchmark PyCryptoBench includes 1,836 Python cryptographic test cases that cover both 18 cryptographic rules and five language features. PyCryptoBench also provides a framework for evaluating and comparing different cryptographic scanners for Python. To evaluate the performance of our proposed cryptographic Python scanner, we evaluated Cryptolation against three other state-of-the-art tools: Bandit, Semgrep, and Dlint. We evaluated these four tools using our benchmark PyCryptoBench and manual evaluation of (four Top-Ranked and 939 Un-Ranked) real-world projects. Our results reveal that, overall, Cryptolation achieved the highest precision throughout our testing; and the highest accuracy on our benchmark. Cryptolation had 100% precision on PyCryptoBench, and the highest precision on real-world projects.more » « less
-
Disassembling stripped binaries is a prominent challenge for binary analysis, due to the interleaving of code segments and data, and the difficulties of resolving control transfer targets of indirect calls and jumps. As a result, most existing disassemblers have both false positives (FP) and false negatives (FN). We observe that uncertainty is inevitable in disassembly due to the information loss during compilation and code generation. Therefore, we propose to model such uncertainty using probabilities and propose a novel disassembly technique, which computes a probability for each address in the code space, indicating its likelihood of being a true positive instruction. The probability is computed from a set of features that are reachable to an address, including control flow and data flow features. Our experiments with more than two thousands binaries show that our technique does not have any FN and has only 3.7% FP. In comparison, a state-of-the-art superset disassembly technique has 85% FP. A rewriter built on our disassembly can generate binaries that are only half of the size of those by superset disassembly and run 3% faster. While many widely-used disassemblers such as IDA and BAP suffer from missing function entries, our experiment also shows that even without any function entry information, our disassembler can still achieve 0 FN and 6.8% FP.more » « less
-
Android devices, handling sensitive data like call records and text messages, are prone to privacy breaches. Existing information flow tracking systems face difficulties in detecting these breaches due to two main challenges: the multi-layered Android platform using different programming languages (Java and C/C++), and the complex, event-driven execution flow of Android apps that complicates tracking, especially across these language barriers. Our system, DryJIN, addresses this by effectively tracking information flow within and across both Java and native modules. Utilizing symbolic execution for native code data flows and integrating it with Java data flows, DryJIN enhances existing static analysis techniques (Argus-SAF, JuCify, and FlowDroid) to cover previously unaddressed information flow patterns. We validated DryJIN ’s effectiveness through a comprehensive evaluation on over 168k apps, including malware and real-world apps, demonstrating its superiority over current state-of-the-art methods.more » « less
An official website of the United States government

