Lightweight syntactic analysis tools like Semgrep and Comby leverage the tree structure of code, making them more expressive than string and regex search. Unlike traditional language frameworks (e.g., ESLint) that analyze codebases via explicit syntax tree manipulations, these tools use query languages that closely resemble the source language. However, state-of-the-art matching techniques for these tools require queries to be complete and parsable snippets, which makes in-progress query specifications useless. We propose a new search architecture that relies only on tokenizing (not parsing) a query. We introduce a novel language and matching algorithm to support tree-aware wildcards on this architecture by building on tree automata. We also present stsearch, a syntactic search tool leveraging our approach. In contrast to past work, our approach supports syntactic search even for previously unparsable queries. We show empirically that stsearch can support all tokenizable queries, while still providing results comparable to Semgrep for existing queries. Our work offers evidence that lightweight syntactic code search can accept in-progress specifications, potentially improving support for interactive settings.
more »
« less
Lightweight multi-language syntax transformation with parser parser combinators
Automatically transforming programs is hard, yet critical for automated program refactoring, rewriting, and repair. Multi-language syntax transformation is especially hard due to heterogeneous representations in syntax, parse trees, and abstract syntax trees (ASTs). Our insight is that the problem can be decomposed such that (1) a common grammar expresses the central context-free language (CFL) properties shared by many contemporary languages and (2) open extension points in the grammar allow customizing syntax (e.g., for balanced delimiters) and hooks in smaller parsers to handle language-specific syntax (e.g., for comments). Our key contribution operationalizes this decomposition using a Parser Parser combinator (PPC), a mechanism that generates parsers for matching syntactic fragments in source code by parsing declarative user-supplied templates. This allows our approach to detach from translating input programs to any particular abstract syntax tree representation, and lifts syntax rewriting to a modularly-defined parsing problem. A notable effect is that we skirt the complexity and burden of defining additional translation layers between concrete user input templates and an underlying abstract syntax representation. We demonstrate that these ideas admit efficient and declarative rewrite templates across 12 languages, and validate effectiveness of our approach by producing correct and desirable lightweight transformations on popular real-world projects (over 50 syntactic changes produced by our approach have been merged into 40+). Our declarative rewrite patterns require an order of magnitude less code compared to analog implementations in existing, language-specific tools.
more »
« less
- Award ID(s):
- 1750116
- PAR ID:
- 10135833
- Date Published:
- Journal Name:
- Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation
- Page Range / eLocation ID:
- 363 to 378
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We target the problem of automatically synthesizing proofs of semantic equivalence between two programs made of sequences of statements. We represent programs using abstract syntax trees (AST), where a given set of semantics-preserving rewrite rules can be applied on a specific AST pattern to generate a transformed and semantically equivalent program. In our system, two programs are equivalent if there exists a sequence of application of these rewrite rules that leads to rewriting one program into the other. We propose a neural network architecture based on a transformer model to generate proofs of equivalence between program pairs. The system outputs a sequence of rewrites, and the validity of the sequence is simply checked by verifying it can be applied. If no valid sequence is produced by the neural network, the system reports the programs as non-equivalent, ensuring by design no programs may be incorrectly reported as equivalent. Our system is fully implemented for a given grammar. To efficiently train the system to generate such sequences, we develop an original incremental training technique, named self-supervised sample selection. We extensively study the effectiveness of this novel training approach on proofs of increasing complexity and length. Our system, S4Eq, achieves 97% proof success on a curated dataset of 10,000 pairs of equivalent programs.more » « less
-
null (Ed.)“Episodic Logic: Unscoped Logical Form” (EL-ULF) is a semantic representation capturing predicate-argument structure as well as more challenging aspects of language within the Episodic Logic formalism. We present the first learned approach for parsing sentences into ULFs, using a growing set of annotated examples. The results provide a strong baseline for future improvement. Our method learns a sequence-to-sequence model for predicting the transition action sequence within a modified cache transition system. We evaluate the efficacy of type grammar-based constraints, a word-to-symbol lexicon, and transition system state features in this task. Our system is availableat https://github.com/genelkim/ ulf-transition-parser. We also present the first official annotated ULF dataset at https://www.cs.rochester.edu/u/ gkim21/ulf/resources/.more » « less
-
A native cross-platform mobile app has multiple platform-specific implementations. Typically, an app is developed for one platform and then ported to the remaining ones. Translating an app from one language (e.g., Java) to another (e.g., Swift) by hand is tedious and error-prone, while automated translators either require manually defined translation rules or focus on translating APIs. To automate the translation of native cross-platform apps, we present J2SINFERER, a novel approach that iteratively infers syntactic transformation rules and API mappings from Java to Swift. Given a software corpus in both languages, J2SLNFERER first identifies the syntactically equivalent code based on braces and string similarity. For each pair of similar code segments, J2SLNFERER then creates syntax trees of both languages, leveraging the minimalist domain knowledge of language correspondence (e.g., operators and markers) to iteratively align syntax tree nodes, and to infer both syntax and API mapping rules. J2SLNFERER represents inferred rules as string templates, stored in a database, to translate code from Java to Swift. We evaluated J2SLNFERER with four applications, using one part of the data to infer translation rules, and the other part to apply the rules. With 76% in-project accuracy and 65% cross-project accuracy, J2SLNFERER outperforms in accuracy j2swift, a state-of-the-art Java-to-Swift conversion tool. As native cross-platform mobile apps grow in popularity, J2SLNFERER can shorten their time to market by automating the tedious and error prone task of source-to-source translation.more » « less
-
We introduce a graph polynomial that distinguishes tree structures to represent dependency grammar and a measure based on the polynomial representation to quantify syntax similarity. The polynomial encodes accurate and comprehensive information about the dependency structure and dependency relations of words in a sentence, which enables in-depth analysis of dependency trees with data analysis tools. We apply the polynomial-based methods to analyze sentences in the ParallelUniversal Dependencies treebanks. Specifically, we compare the syntax of sentences and their translations in different languages, and we perform a syntactic typology study of available languages in the Parallel Universal Dependencies treebanks. We also demonstrate and discuss the potential of the methods in measuring syntax diversity of corpora.more » « less