skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: When Malware is Packin' Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features
Machine learning techniques are widely used in addition to signatures and heuristics to increase the detection rate of anti-malware software, as they automate the creation of detection models, making it possible to handle an ever-increasing number of new malware samples. In order to foil the analysis of anti-malware systems and evade detection, malware uses packing and other forms of obfuscation. However, few realize that benign applications use packing and obfuscation as well, to protect intellectual property and prevent license abuse. In this paper, we study how machine learning based on static analysis features operates on packed samples. Malware researchers have often assumed that packing would prevent machine learning techniques from building effective classifiers. However, both industry and academia have published results that show that machine-learning-based classifiers can achieve good detection rates, leading many experts to think that classifiers are simply detecting the fact that a sample is packed, as packing is more prevalent in malicious samples. We show that, different from what is commonly assumed, packers do preserve some information when packing programs that is “useful” for malware classification. However, this information does not necessarily capture the sample’s behavior. We demonstrate that the signals extracted from packed executables are not rich enough for machine-learning-based models to (1) generalize their knowledge to operate on unseen packers, and (2) be robust against adversarial examples. We also show that a na¨ıve application of machine learning techniques results in a substantial number of false positives, which, in turn, might have resulted  more » « less
Award ID(s):
1704253
PAR ID:
10155112
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Network and Distributed Systems Security (NDSS) Symposium 2020
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Since malware has caused serious damages and evolving threats to computer and Internet users, its detection is of great interest to both anti-malware industry and researchers. In recent years, machine learning-based systems have been successfully deployed in malware detection, in which different kinds of classifiers are built based on the training samples using different feature representations. Unfortunately, as classifiers become more widely deployed, the incentive for defeating them increases. In this paper, we explore the adversarial machine learning in malware detection. In particular, on the basis of a learning-based classifier with the input of Windows Application Programming Interface (API) calls extracted from the Portable Executable (PE) files, we present an effective evasion attack model (named EvnAttack) by considering different contributions of the features to the classification problem. To be resilient against the evasion attack, we further propose a secure-learning paradigm for malware detection (named SecDefender), which not only adopts classifier retraining technique but also introduces the security regularization term which considers the evasion cost of feature manipulations by attackers to enhance the system security. Comprehensive experimental results on the real sample collections from Comodo Cloud Security Center demonstrate the effectiveness of our proposed methods. 
    more » « less
  2. null (Ed.)
    Malicious software, popularly known as malware, is a serious threat to modern computing systems. A comprehensive cybercrime study by Ponemon Institute highlights that malware is the most expensive attack for organizations, with an average revenue loss of $2.6 million per organization in 2018 (11% increase compared to 2017). Recent high-profile malware attacks coupled with serious economic implications have dramatically changed our perception of threat from malware. Software-based solutions, such as anti-virus programs, are not effective since they rely on matching patterns (signatures) that can be easily fooled by carefully crafted malware with obfuscation or other deviation capabilities. Moreover, software-based solutions are not fast enough for real-time malware detection in safety-critical systems. In this paper, we investigate promising approaches for hardware-assisted malware detection using machine learning. Specifically, we explore how machine learning can be effective for malware detection utilizing hardware performance counters, embedded trace buffer as well as on-chip network traffic analysis. 
    more » « less
  3. Malware written in dynamic languages such as PHP routinely employ anti-analysis techniques such as obfuscation schemes and evasive tricks to avoid detection. On top of that, attackers use automated malware creation tools to create numerous variants with little to no manual effort. This paper presents a system called Cubismo to solve this pressing problem. It processes potentially malicious files and decloaks their obfuscations, exposing the hidden malicious code into multiple files. The resulting files can be scanned by existing malware detection tools, leading to a much higher chance of detection. Cubismo achieves improved detection by exploring all executable statements of a suspect program counterfactually to see through complicated polymorphism, metamorphism and, obfuscation techniques and expose any malware. Our evaluation on a real-world data set collected from a commercial web hosting company shows that Cubismo is highly effective in dissecting sophisticated metamorphic malware with multiple layers of obfuscation. In particular, it enables VirusTotal to detect 53 out of 56 zero-day malware samples in the wild, which were previously undetectable. 
    more » « less
  4. As the underground market of malware flourishes, there is an exponential increase in the number and diversity of malware. A crucial question in malware analysis research is how to define malware specifications or signatures that faithfully describe similar malicious intent and also clearly stand out from other programs. Although the traditional malware specifications based on syntactic signatures are efficient, they can be easily defeated by various obfuscation techniques. Since the malicious behavior is often stable across similar malware instances, behavior-based specifications which capture real malicious characteristics during run time, have become more prevalent in anti-malware tasks, such as malware detection and malware clustering. This kind of specification is typically extracted from the system call dependence graph that a malware sample invokes. In this paper, we present replacement attacks to cam- ouflage similar behaviors by poisoning behavior-based specifications. The key method of our attacks is to replace a system call dependence graph to its semantically equivalent variants so that the similar malware samples within one family turn out to be different. As a result, malware analysts have to put more efforts into reexamining the similar samples which may have been investigated before. We distil general attacking strategies by mining more than 5, 200 malware samples’ behavior specifications and implement a compiler-level prototype to automate replacement attacks. Experiments on 960 real malware samples demonstrate the effectiveness of our approach to impede various behavior-based mal- ware analysis tasks, such as similarity comparison and malware clustering. In the end, we also discuss possible countermeasures in order to strengthen existing mal- ware defense. 
    more » « less
  5. null (Ed.)
    Malicious software, popularly known as malware, is widely acknowledged as a serious threat to modern computing systems. Software-based solutions, such as anti-virus software, are not effective since they rely on matching patterns that can be easily fooled by carefully crafted malware with obfuscation or other deviation capabilities. While recent malware detection methods provide promising results through effective utilization of hardware features, the detection results cannot be interpreted in a meaningful way. In this paper, we propose a hardware-assisted malware detection framework using explainable machine learning. This paper makes three important contributions. First, we theoretically establish that our proposed method can provide interpretable explanation of classification results to address the challenge of transparency. Next, we show that the explainable outcome can lead to accurate localization of malicious behaviors. Finally, experimental evaluation using a wide variety of realworld malware benchmarks demonstrates that our framework can produce accurate and human-understandable malware detection results with provable guarantees. 
    more » « less