MalwareHunt: Semantics-Based Malware Diffing Speedup by Normalized Basic Block Memoization

Ming, Jiang; Xu, Dongpeng; Wu, Dinghao

The key challenge of software reverse engi- neering is that the source code of the program under in- vestigation is typically not available. Identifying differ- ences between two executable binaries (binary diffing) can reveal valuable information in the absence of source code, such as vulnerability patches, software plagiarism evidence, and malware variant relations. Recently, a new binary diffing method based on symbolic execution and constraint solving has been proposed to look for the code pairs with the same semantics, even though they are ostensibly different in syntactics. Such semantics- based method captures intrinsic differences/similarities of binary code, making it a compelling choice to analyze highly-obfuscated malicious programs. However, due to the nature of symbolic execution, semantics-based bi- nary diffing suffers from significant performance slow- down, hindering it from analyzing large numbers of malware samples. In this paper, we attempt to miti- gate the high overhead of semantics-based binary diff- ing with application to malware lineage inference. We first study the key obstacles that contribute to the performance bottleneck. Then we propose normalized basic block memoization to speed up semantics-based binary diffing. We introduce an unionfind set structure that records semantically equivalent basic blocks. Managing the union-find structure during successive comparisons allows direct reuse of previously computed results. Moreover, we utilize a set of enhanced optimization methods to further cut down the invocation numbers of constraint solver. We have implemented our tech- nique, called MalwareHunt, on top of a trace-oriented binary diffing tool and evaluated it on 15 polymorphic and metamorphic malware families. We perform intra- family comparisons for the purpose of malware lineage inference. Our experimental results show that Malware- Huntcan accelerate symbolic execution from 2.8X to 5.3X (with an average 4.1X), and reduce constraint solver invocation by a factor of 3.0X to 6.0X (with an average 4.5X).

More Like this