skip to main content

Title: Lightweight multi-language syntax transformation with parser parser combinators
Automatically transforming programs is hard, yet critical for automated program refactoring, rewriting, and repair. Multi-language syntax transformation is especially hard due to heterogeneous representations in syntax, parse trees, and abstract syntax trees (ASTs). Our insight is that the problem can be decomposed such that (1) a common grammar expresses the central context-free language (CFL) properties shared by many contemporary languages and (2) open extension points in the grammar allow customizing syntax (e.g., for balanced delimiters) and hooks in smaller parsers to handle language-specific syntax (e.g., for comments). Our key contribution operationalizes this decomposition using a Parser Parser combinator (PPC), a mechanism that generates parsers for matching syntactic fragments in source code by parsing declarative user-supplied templates. This allows our approach to detach from translating input programs to any particular abstract syntax tree representation, and lifts syntax rewriting to a modularly-defined parsing problem. A notable effect is that we skirt the complexity and burden of defining additional translation layers between concrete user input templates and an underlying abstract syntax representation. We demonstrate that these ideas admit efficient and declarative rewrite templates across 12 languages, and validate effectiveness of our approach by producing correct and desirable lightweight transformations on popular real-world projects more » (over 50 syntactic changes produced by our approach have been merged into 40+). Our declarative rewrite patterns require an order of magnitude less code compared to analog implementations in existing, language-specific tools. « less
Award ID(s):
Publication Date:
Journal Name:
Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation
Page Range or eLocation-ID:
363 to 378
Sponsoring Org:
National Science Foundation
More Like this
  1. A native cross-platform mobile app has multiple platform-specific implementations. Typically, an app is developed for one platform and then ported to the remaining ones. Translating an app from one language (e.g., Java) to another (e.g., Swift) by hand is tedious and error-prone, while automated translators either require manually defined translation rules or focus on translating APIs. To automate the translation of native cross-platform apps, we present J2SINFERER, a novel approach that iteratively infers syntactic transformation rules and API mappings from Java to Swift. Given a software corpus in both languages, J2SLNFERER first identifies the syntactically equivalent code based on braces and string similarity. For each pair of similar code segments, J2SLNFERER then creates syntax trees of both languages, leveraging the minimalist domain knowledge of language correspondence (e.g., operators and markers) to iteratively align syntax tree nodes, and to infer both syntax and API mapping rules. J2SLNFERER represents inferred rules as string templates, stored in a database, to translate code from Java to Swift. We evaluated J2SLNFERER with four applications, using one part of the data to infer translation rules, and the other part to apply the rules. With 76% in-project accuracy and 65% cross-project accuracy, J2SLNFERER outperforms in accuracy j2swift,more »a state-of-the-art Java-to-Swift conversion tool. As native cross-platform mobile apps grow in popularity, J2SLNFERER can shorten their time to market by automating the tedious and error prone task of source-to-source translation.« less
  2. We target the problem of automatically synthesizing proofs of semantic equivalence between two programs made of sequences of statements. We represent programs using abstract syntax trees (AST), where a given set of semantics-preserving rewrite rules can be applied on a specific AST pattern to generate a transformed and semantically equivalent program. In our system, two programs are equivalent if there exists a sequence of application of these rewrite rules that leads to rewriting one program into the other. We propose a neural network architecture based on a transformer model to generate proofs of equivalence between program pairs. The system outputs a sequence of rewrites, and the validity of the sequence is simply checked by verifying it can be applied. If no valid sequence is produced by the neural network, the system reports the programs as non-equivalent, ensuring by design no programs may be incorrectly reported as equivalent. Our system is fully implemented for a given grammar. To efficiently train the system to generate such sequences, we develop an original incremental training technique, named self-supervised sample selection. We extensively study the effectiveness of this novel training approach on proofs of increasing complexity and length. Our system, S4Eq, achieves 97% proof successmore »on a curated dataset of 10,000 pairs of equivalent programs.« less
  3. “Episodic Logic: Unscoped Logical Form” (EL-ULF) is a semantic representation capturing predicate-argument structure as well as more challenging aspects of language within the Episodic Logic formalism. We present the first learned approach for parsing sentences into ULFs, using a growing set of annotated examples. The results provide a strong baseline for future improvement. Our method learns a sequence-to-sequence model for predicting the transition action sequence within a modified cache transition system. We evaluate the efficacy of type grammar-based constraints, a word-to-symbol lexicon, and transition system state features in this task. Our system is availableat ulf-transition-parser. We also present the first official annotated ULF dataset at gkim21/ulf/resources/.
  4. Static analysis is a proven technique for catching bugs during software development. However, analysis tooling must approximate, both theoretically and in the interest of practicality. False positives are a pervading manifestation of such approximations—tool configuration and customization is therefore crucial for usability and directing analysis behavior. To suppress false positives, developers readily disable bug checks or insert comments that suppress spurious bug reports. Existing work shows that these mechanisms fall short of developer needs and present a significant pain point for using or adopting analyses. We draw on the insight that an analysis user always has one notable ability to influence analysis behavior regardless of analyzer options and implementation: modifying their program. We present a new technique for automated, generic, and temporary code changes that tailor to suppress spurious analysis errors. We adopt a rule-based approach where simple, declarative templates describe general syntactic changes for code patterns that are known to be problematic for the analyzer. Our technique promotes program transformation as a general primitive for improving the fidelity of analysis reports (we treat any given analyzer as a black box). We evaluate using five different static analyzers supporting three different languages (C, Java, and PHP) on large, real worldmore »programs (up to 800KLOC). We show that our approach is effective in sidestepping long-standing and complex issues in analysis implementations.« less
  5. Fuzz testing has been gaining ground recently with substantial efforts devoted to the area. Typically, fuzzers take a set of seed inputs and leverage random mutations to continually improve the inputs with respect to a cost, e.g. program code coverage, to discover vulnerabilities or bugs. Following this methodology, fuzzers are very good at generating unstructured inputs that achieve high coverage. However fuzzers are less effective when the inputs are structured, say they conform to an input grammar. Due to the nature of random mutations, the overwhelming abundance of inputs generated by this common fuzzing practice often adversely hinders the effectiveness and efficiency of fuzzers on grammar-aware applications. The problem of testing becomes even harder, when the goal is not only to achieve increased code coverage, but also to nd complex vulnerabilities related to other cost measures, say high resource consumption in an application. We propose Saffron an adaptive grammar-based fuzzing approach to effectively and efficiently generate inputs that expose expensive executions in programs. Saffron takes as input a user-provided grammar, which describes the input space of the program under analysis, and uses it to generate test inputs. Saffron assumes that the grammar description is approximate since precisely describing the inputmore »program space is often difficult as a program may accept unintended inputs due to e.g., errors in parsing. Yet these inputs may reveal worst-case complexity vulnerabilities. The novelty of Saffron is then twofold: (1) Given the user-provided grammar, Saffron attempts to discover whether the program accepts unexpected inputs outside of the provided grammar, and if so, it repairs the grammar via grammar mutations. The repaired grammar serves as a specification of the actual inputs accepted by the application. (2) Based on the refined grammar, it generates concrete test inputs. It starts by treating every production rule in the grammar with equal probability of being used for generating concrete inputs. It then adaptively refines the probabilities along the way by increasing the probabilities for rules that have been used to generate inputs that improve a cost, e.g., code coverage or arbitrary user-defined cost. Evaluation results show that Saffron significantly outperforms state-of-the-art baselines.« less