skip to main content


Title: MalwareHunt: Semantics-Based Malware Diffing Speedup by Normalized Basic Block Memoization
The key challenge of software reverse engi- neering is that the source code of the program under in- vestigation is typically not available. Identifying differ- ences between two executable binaries (binary diffing) can reveal valuable information in the absence of source code, such as vulnerability patches, software plagiarism evidence, and malware variant relations. Recently, a new binary diffing method based on symbolic execution and constraint solving has been proposed to look for the code pairs with the same semantics, even though they are ostensibly different in syntactics. Such semantics- based method captures intrinsic differences/similarities of binary code, making it a compelling choice to analyze highly-obfuscated malicious programs. However, due to the nature of symbolic execution, semantics-based bi- nary diffing suffers from significant performance slow- down, hindering it from analyzing large numbers of malware samples. In this paper, we attempt to miti- gate the high overhead of semantics-based binary diff- ing with application to malware lineage inference. We first study the key obstacles that contribute to the performance bottleneck. Then we propose normalized basic block memoization to speed up semantics-based binary diffing. We introduce an unionfind set structure that records semantically equivalent basic blocks. Managing the union-find structure during successive comparisons allows direct reuse of previously computed results. Moreover, we utilize a set of enhanced optimization methods to further cut down the invocation numbers of constraint solver. We have implemented our tech- nique, called MalwareHunt, on top of a trace-oriented binary diffing tool and evaluated it on 15 polymorphic and metamorphic malware families. We perform intra- family comparisons for the purpose of malware lineage inference. Our experimental results show that Malware- Huntcan accelerate symbolic execution from 2.8X to 5.3X (with an average 4.1X), and reduce constraint solver invocation by a factor of 3.0X to 6.0X (with an average 4.5X).  more » « less
Award ID(s):
1320605
NSF-PAR ID:
10066920
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Journal of computer virology and hacking techniques
ISSN:
2263-8733
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We present LLSC, a prototype compiler for nondeterministic par- allel symbolic execution of the LLVM intermediate representation (IR). Given an LLVM IR program, LLSC generates code preserving the symbolic execution semantics and orchestrating solver invo- cations. The generated code runs efficiently, since the code has eliminated the interpretation overhead and explores multiple paths in parallel. To the best of our knowledge, LLSC is the first compiler for fork-based symbolic execution semantics that can generate parallel execution code. In this demonstration paper, we present the current development and preliminary evaluation of LLSC. The principle behind LLSC is to automatically specialize a symbolic interpreter via the 1st Futamura projection, a fundamental connection between in- terpreters and compilers. The symbolic interpreter is written in an expressive high-level language equipped with a multi-stage programming facility. We demonstrate the run time performance through a set of benchmark programs, showing that LLSC outperforms interpretation-based symbolic execution engines in significant ways. 
    more » « less
  2. Cryptographic functions have been commonly abused by malware developers to hide malicious behaviors, disguise destructive payloads, and bypass network-based fire- walls. Now-infamous crypto-ransomware even encrypts victim’s computer documents until a ransom is paid. Therefore, de- tecting cryptographic functions in binary code is an appealing approach to complement existing malware defense and forensics. However, pervasive control and data obfuscation schemes make cryptographic function identification a challenging work. Existing detection methods are either brittle to work on obfuscated binaries or ad hoc in that they can only identify specific cryp- tographic functions. In this paper, we propose a novel technique called bit-precise symbolic loop mapping to identify cryptographic functions in obfuscated binary code. Our trace-based approach captures the semantics of possible cryptographic algorithms with bit-precise symbolic execution in a loop. Then we perform guided fuzzing to efficiently match boolean formulas with known reference implementations. We have developed a prototype called CryptoHunt and evaluated it with a set of obfuscated synthetic examples, well-known cryptographic libraries, and malware. Compared with the existing tools, CryptoHunt is a general approach to detecting commonly used cryptographic functions such as TEA, AES, RC4, MD5, and RSA under different control and data obfuscation scheme combinations. 
    more » « less
  3. In database-as-a-service platforms, automated ver-ification of query equivalence helps eliminate redundant computation in the form of overlapping sub-queries. Researchers have proposed two pragmatic techniques to tackle this problem. The first approach consists of reducing the queries to algebraic expressions and proving their equivalence using an algebraic theory. The limitations of this technique are threefold. It cannot prove the equivalence of queries with significant differences in the attributes of their relational operators (e.g., predicates in the filter operator). It does not support certain widely-used SQL features (e.g., NULL values). Its verification procedure is computationally intensive. The second approach transforms this problem to a constraint satisfaction problem and leverages a general-purpose solver to determine query equivalence. This technique consists of deriving the symbolic representation of the queries and proving their equivalence by determining the query containment relationship between the symbolic expressions. While the latter approach addresses all the limitations of the former technique, it only proves the equivalence of queries under set semantics (i.e., output tables must not contain duplicate tuples). However, in practice, database applications use bag semantics (i.e., output tables may contain duplicate tuples) In this paper, we introduce a novel symbolic approach for proving query equivalence under bag semantics. We transform the problem of proving query equivalence under bag semantics to that of proving the existence of a bijective, identity map between tuples returned by the queries on all valid inputs. We classify SQL queries into four categories, and propose a set of novel category-specific verification algorithms. We implement this symbolic approach in SPES and demonstrate that it proves the equivalence of a larger set of query pairs (95/232) under bag semantics compared to the SOTA tools based on algebraic (30/232) and symbolic approaches (67/232) under set and bag semantics, respectively. Furthermore, SPES is 3X faster than the symbolic tool that proves equivalence under set semantics. 
    more » « less
  4. Effective symbolic evaluation is key to building scalable ver- ification and synthesis tools based on SMT solving. These tools use sym- bolic evaluators to reduce the semantics of all paths through a finite program to logical constraints, discharged with an SMT solver. Using an evaluator effectively requires tool developers to be able to identify and re- pair performance bottlenecks in code under all-path evaluation, a difficult task, even for experts. This paper presents a new method for repairing such bottlenecks automatically. The key idea is to formulate the symbolic performance repair problem as combinatorial search through a space of semantics-preserving transformations, or repairs, to find an equivalent program with minimal cost under symbolic evaluation. The key to real- izing this idea is (1) defining a small set of generic repairs that can be combined to fix common bottlenecks, and (2) searching for combinations of these repairs to find good solutions quickly and best ones eventually. Our technique, SymFix, contributes repairs based on deforestation and symbolic reflection, and an efficient algorithm that uses symbolic profil- ing to guide the search for fixes. To evaluate SymFix, we implement it for the Rosette solver-aided language and symbolic evaluator. Applying SymFix to 18 published verification and synthesis tools built in Rosette, we find that it automatically improves the performance of 12 tools by a factor of 1.1×–91.7×, and 4 of these fixes match or outperform expert- written repairs. SymFix also finds 5 fixes that were missed by experts. 
    more » « less
  5. Quantitative program analysis is an emerging area with applications to software reliability, quantitative information flow, side-channel detection and attack synthesis. Most quantitative program analysis techniques rely on model counting constraint solvers, which are typically the bottleneck for scalability. Although the effectiveness of formula caching in expediting expensive model-counting queries has been demonstrated in prior work, our key insight is that many subformulas are shared across non-identical constraints generated during program analyses. This has not been utilized by prior formula caching approaches. In this paper we present a subformula caching framework and integrate it into a model counting constraint solver. We experimentally evaluate its effectiveness under three quantitative program analysis scenarios: 1) model counting constraints generated by symbolic execution, 2) reliability analysis using probabilistic symbolic execution, 3) adaptive attack synthesis for side-channels. Our experimental results demonstrate that our subformula caching approach significantly improves the performance of quantitative program analysis. 
    more » « less