skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Quantitative Symbolic Similarity Analysis
Similarity analysis plays a crucial role in various software engineering tasks, such as detecting software changes, version merging, identifying plagiarism, and analyzing binary code. Equivalence analysis, a stricter form of similarity, focuses on determining whether different programs or versions of the same program behave identically. While extensive research exists on code and binary similarity as well as equivalence analysis, there is a lack of quantitative reasoning in these areas. Non-equivalence is a spectrum that requires deeper exploration, as it can manifest in different ways across the input domain space. This paper emphasizes the importance of quantitative reasoning on non-equivalence which arises due to semantic differences. By quantitatively reasoning about non-equivalence, it becomes possible to identify specific input ranges for which programs are equivalent or non-equivalent. We aim to address the gap in quantitative reasoning in symbolic similarity analysis, enabling a more comprehensive understanding of program behavior.  more » « less
Award ID(s):
2008660
PAR ID:
10539041
Author(s) / Creator(s):
Publisher / Repository:
ACM
Date Published:
ISBN:
9798400702211
Page Range / eLocation ID:
1549 to 1551
Format(s):
Medium: X
Location:
Seattle WA USA
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper we generalise the notion of extensional (functional) equivalence of programs to abstract equivalences induced by abstract interpretations. The standard notion of extensional equivalence is recovered as the special case, induced by the concrete interpretation. Some properties of the extensional equivalence, such as the one spelled out in Rice’s theorem, lift to the abstract equivalences in suitably generalised forms. On the other hand, the generalised framework gives rise to interesting and important new properties, and allows refined, non-extensional analyses. In particular, since programs turn out to be extensionally equivalent if and only if they are equivalent just for the concrete interpretation, it follows that any non-trivial abstract interpretation uncovers some intensional aspect of programs. This striking result is also effective, in the sense that it allows constructing, for any non-trivial abstraction, a pair of programs that are extensionally equivalent, but have different abstract semantics. The construction is based on the fact that abstract interpretations are always sound, but that they can be made incomplete through suitable code ransformations. To construct these transformations, we introduce a novel technique for building incompleteness cliques of extensionally equivalent yet abstractly distinguishable programs: They are built together with abstract interpretations that produce false alarms. While programs are forced into incompleteness cliques using both control-flow and data-flow transformations, the main result follows from limitations of data-flow ransformations with respect to control-flow ones. A further consequence is that the class of incomplete programs for a non-trivial abstraction is Turing complete. The obtained results also shed a new light on the relation between the techniques of code obfuscation and the precision in program analysis. 
    more » « less
  2. Commutativity of program code (the equivalence of two code fragments composed in alternate orders) is of ongoing interest in many settings such as program verification, scalable concurrency, and security analysis. While some recent works have explored static analysis for code commutativity, few have specifically catered to heap-manipulating programs. We introduce an abstract domain in which commutativity synthesis or verification techniques can safely be performed on abstract mathematical models and, from those results, one can directly obtain commutativity conditions for concrete heap programs. This approach offloads challenges of concrete heap reasoning into the simpler abstract space. We show this reasoning supports framing and composition, and conclude with commutativity analysis of programs operating on example heap data structures. Our work has been mechanized in Coq and is available in the supplement. 
    more » « less
  3. Abstract Binary analysis, the process of examining software without its source code, plays a crucial role in understanding program behavior, e.g., evaluating the security properties of commercial software, and analyzing malware. One challenging aspect of this process is to classify data encoding schemes, such as encryption and compression, due to the absence of high-level semantic information. Existing approaches either rely on code similarity, which only works for known schemes, or heuristic rules, which lack scalability. In this paper, we propose DESCG, a novel deep learning-based method for automatically classifying four widely employed kinds of data encoding schemes in binary programs: encryption, compression, decompression, and hashing. Our approach leverages dynamic analysis to extract execution traces from binary programs, builds data dependency graphs from these traces, and incorporates critical feature engineering. By combining the specialized graph representation with the Graph Neural Network (GNN), our approach enables accurate classification without requiring prior knowledge of specific encoding schemes. The Evaluation result shows that DESCG achieves 97.7% accuracy and an F1 score of 97.67%, outperforming baseline models. We also conducted an extensive evaluation of DESCG to explore which feature is more important for it and examine its performance and overhead. 
    more » « less
  4. We present a new domain-agnostic synthesis technique for generating programs from input-output examples. Our method, called metric program synthesis, relaxes the well-known observational equivalence idea (used widely in bottom-up enumerative synthesis) into a weaker notion of observational similarity, with the goal of reducing the search space that the synthesizer needs to explore. Our method clusters programs into equivalence classes based on a distance metric and constructs a version space that compactly represents ""approximately correct"" programs. Then, given a ""close enough"" program sampled from this version space, our approach uses a distance-guided repair algorithm to find a program that exactly matches the given input-output examples. We have implemented our proposed metric program synthesis technique in a tool called SyMetric and evaluate it in three different domains considered in prior work. Our evaluation shows that SyMetric outperforms other domain-agnostic synthesizers that use observational equivalence and that it achieves results competitive with domain-specific synthesizers that are either designed for or trained on those domains. 
    more » « less
  5. We present a new general-purpose synthesis technique for generating programs from input-output examples. Our method, called metric program synthesis, relaxes the observational equivalence idea (used widely in bottom-up enumerative synthesis) into a weaker notion of observational similarity, with the goal of reducing the search space that the synthesizer needs to explore. Our method clusters programs into equivalence classes based on an expert-provided distance metric and constructs a version space that compactly represents “approximately correct” programs. Then, given a “close enough” program sampled from this version space, our approach uses a distance-guided repair algorithm to find a program that exactly matches the given input-output examples. We have implemented our proposed metric program synthesis technique in a tool called SyMetric and evaluate it in three different domains considered in prior work. Our evaluation shows that SyMetric outperforms other domain-agnostic synthesizers that use observational equivalence and that it achieves results competitive with domain-specific synthesizers that are either designed for or trained on those domains. 
    more » « less