skip to main content


Title: Out-of-sample Node Representation Learning for Heterogeneous Graph in Real-time Android Malware Detection
The increasingly sophisticated Android malware calls for new defensive techniques that are capable of protecting mobile users against novel threats. In this paper, we first extract the runtime Application Programming Interface (API) call sequences from Android apps, and then analyze higher-level semantic relations within the ecosystem to comprehensively characterize the apps. To model different types of entities (i.e., app, API, device, signature, affiliation) and rich relations among them, we present a structured heterogeneous graph (HG) for modeling. To efficiently classify nodes (e.g., apps) in the constructed HG, we propose the HG-Learning method to first obtain in-sample node embeddings and then learn representations of out-of-sample nodes without rerunning/adjusting HG embeddings at the first attempt. We later design a deep neural network classifier taking the learned HG representations as inputs for real-time Android malware detection. Comprehensive experiments on large-scale and real sample collections from Tencent Security Lab are performed to compare various baselines. Promising results demonstrate that our developed system AiDroid which integrates our proposed method outperforms others in real-time Android malware detection.  more » « less
Award ID(s):
1940859 1951504
NSF-PAR ID:
10135597
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
28th International Joint Conference on Artificial Intelligence (IJCAI)
Page Range / eLocation ID:
4150 to 4156
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Over the last decade, userland memory forensics techniques and algorithms have gained popularity among practitioners, as they have proven to be useful in real forensics and cybercrime investigations. These techniques analyze and recover objects and artifacts from process memory space that are of critical importance in investigations. Nonetheless, the major drawback of existing techniques is that they cannot determine the origin and context within which the recovered object exists without prior knowledge of the application logic. Thus, in this research, we present a solution to close the gap between application-specific and application-generic techniques. We introduce OAGen, a post-execution and app-agnostic semantic analysis approach designed to help investigators establish concrete evidence by identifying the provenance and relationships between in-memory objects in a process memory image. OAGen utilizes Points-to analysis to reconstruct a runtime’s object allocation network. The resulting graph is then fed as an input into our semantic analysis algorithms to determine objects’ origin, context, and scope in the network. The results of our experiments exhibit OAGen’s ability to effectively create an allocation network even for memory-intensive applications with thousands of objects, like Facebook. The performance evaluation of our approach across fourteen different Android apps shows OAGen can efficiently search and decode nodes, and identify their references with a modest throughput rate. Further practical application of OAGen demonstrated in two case studies shows that our approach can aid investigators in the recovery of deleted messages and the detection of malware functionality in post-execution program analysis. 
    more » « less
  2. Since malware has caused serious damages and evolving threats to computer and Internet users, its detection is of great interest to both anti-malware industry and researchers. In recent years, machine learning-based systems have been successfully deployed in malware detection, in which different kinds of classifiers are built based on the training samples using different feature representations. Unfortunately, as classifiers become more widely deployed, the incentive for defeating them increases. In this paper, we explore the adversarial machine learning in malware detection. In particular, on the basis of a learning-based classifier with the input of Windows Application Programming Interface (API) calls extracted from the Portable Executable (PE) files, we present an effective evasion attack model (named EvnAttack) by considering different contributions of the features to the classification problem. To be resilient against the evasion attack, we further propose a secure-learning paradigm for malware detection (named SecDefender), which not only adopts classifier retraining technique but also introduces the security regularization term which considers the evasion cost of feature manipulations by attackers to enhance the system security. Comprehensive experimental results on the real sample collections from Comodo Cloud Security Center demonstrate the effectiveness of our proposed methods. 
    more » « less
  3. Network embedding has become the cornerstone of a variety of mining tasks, such as classification, link prediction, clustering, anomaly detection and many more, thanks to its superior ability to encode the intrinsic network characteristics in a compact low-dimensional space. Most of the existing methods focus on a single network and/or a single resolution, which generate embeddings of different network objects (node/subgraph/network) from different networks separately. A fundamental limitation with such methods is that the intrinsic relationship across different networks (e.g., two networks share same or similar subgraphs) and that across different resolutions (e.g., the node-subgraph membership) are ignored, resulting in disparate embeddings. Consequentially, it leads to sub-optimal performance or even becomes inapplicable for some downstream mining tasks (e.g., role classification, network alignment. etc.). In this paper, we propose a unified framework MrMine to learn the representations of objects from multiple networks at three complementary resolutions (i.e., network, subgraph and node) simultaneously. The key idea is to construct the cross-resolution cross-network context for each object. The proposed method bears two distinctive features. First, it enables and/or boosts various multi-network downstream mining tasks by having embeddings at different resolutions from different networks in the same embedding space. Second, Our method is efficient and scalable, with a O(nlog(n)) time complexity for the base algorithm and a linear time complexity w.r.t. the number of nodes and edges of input networks for the accelerated version. Extensive experiments on real-world data show that our methods (1) are able to enable and enhance a variety of multi-network mining tasks, and (2) scale up to million-node networks. 
    more » « less
  4. Android, the most dominant Operating System (OS), experiences immense popularity for smart devices for the last few years. Due to its' popularity and open characteristics, Android OS is becoming the tempting target of malicious apps which can cause serious security threat to financial institutions, businesses, and individuals. Traditional anti-malware systems do not suffice to combat newly created sophisticated malware. Hence, there is an increasing need for automatic malware detection solutions to reduce the risks of malicious activities. In recent years, machine learning algorithms have been showing promising results in classifying malware where most of the methods are shallow learners like Logistic Regression (LR). In this paper, we propose a deep learning framework, called Droid-NNet, for malware classification. However, our proposed method Droid-NNet is a deep learner that outperforms existing cutting-edge machine learning methods. We performed all the experiments on two datasets (Malgenome-215 & Drebin-215) of Android apps to evaluate Droid-NNet. The experimental result shows the robustness and effectiveness of Droid-NNet. 
    more » « less
  5. More than 6 billion smartphones available worldwide can enable governments and public health organizations to develop apps to manage global pandemics. However, hackers can take advantage of this opportunity to target the public in nefarious ways through malware disguised as pandemics-related apps. A recent analysis conducted during the COVID-19 pandemic showed that several variants of COVID-19 related malware were installed by the public from non-trusted sources. We propose the use of app permissions and an extra feature (the total number of permissions) to develop a static detector using machine learning (ML) models to enable the fast-detection of pandemics-related Android malware at installation time. Using a dataset of more than 2000 COVID-19 related apps and by evaluating ML models created using decision trees and Naive Bayes, our results show that pandemics-related malware apps can be detected with an accuracy above 90% using decision tree models with app permissions and the proposed feature. 
    more » « less