skip to main content

This content will become publicly available on October 16, 2024

Title: Comparing Classifiers: A Look at Machine-Learning and the Detection of Mobile Malware in COVID-19 Android Mobile Applications
The COVID-19 pandemic was a catalyst for many different trends in our daily life worldwide. While there has been an overall rise in cybercrime during this time, there has been relatively little research done about malicious COVID-19 themed AndroidOS applications. With the rise in reports of users falling victim to malicious COVID-19 themed AndroidOS applications, there is a need to learn about the detection of malware for pandemics-themed mobile apps.. In this project, we extracted the permissions requests from 1959 APK files from a dataset containing benign and malware COVID-19 themed apps. We then created and compared eight unique models of four varying classifiers to determine their ability to identify potentially malicious APK files based on the permissions the APK file requests: support vector machine, neural network, decision trees, and categorical naive bayes. These classifiers were then trained using Synthetic Minority Oversampling Technique (SMOTE) to balance the dataset due to the lack of samples of malware compared to non-malware APKs. Finally, we evaluated the models using K-Fold Cross-Validation and found the decision tree classifier to be the best performing classifier.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Publisher / Repository:
Date Published:
Journal Name:
MobiHoc '23: Proceedings of the Twenty-fourth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing
Page Range / eLocation ID:
498 to 503
Medium: X
Washington DC USA
Sponsoring Org:
National Science Foundation
More Like this
  1. More than 6 billion smartphones available worldwide can enable governments and public health organizations to develop apps to manage global pandemics. However, hackers can take advantage of this opportunity to target the public in nefarious ways through malware disguised as pandemics-related apps. A recent analysis conducted during the COVID-19 pandemic showed that several variants of COVID-19 related malware were installed by the public from non-trusted sources. We propose the use of app permissions and an extra feature (the total number of permissions) to develop a static detector using machine learning (ML) models to enable the fast-detection of pandemics-related Android malware at installation time. Using a dataset of more than 2000 COVID-19 related apps and by evaluating ML models created using decision trees and Naive Bayes, our results show that pandemics-related malware apps can be detected with an accuracy above 90% using decision tree models with app permissions and the proposed feature. 
    more » « less
  2. Today’s mobile apps employ third-party advertising and tracking (A&T) libraries, which may pose a threat to privacy. State-of-the-art detects and blocks outgoing A&T HTTP/S requests by using manually curated filter lists (e.g. EasyList), and recently, using machine learning approaches. The major bottleneck of both filter lists and classifiers is that they rely on experts and the community to inspect traffic and manually create filter list rules that can then be used to block traffic or label ground truth datasets. We propose NoMoATS – a system that removes this bottleneck by reducing the daunting task of manually creating filter rules, to the much easier and scalable task of labeling A&T libraries. Our system leverages stack trace analysis to automatically label which network requests are generated by A&T libraries. Using NoMoATS, we collect and label a new mobile traffic dataset. We use this dataset to train decision tree classifiers, which can be applied in real-time on the mobile device and achieve an average F-score of 93%. We show that both our automatic labeling and our classifiers discover thousands of requests destined to hundreds of different hosts, previously undetected by popular filter lists. To the best of our knowledge, our system is the first to (1) automatically label which mobile network requests are engaged in A&T, while requiring to only manually label libraries to their purpose and (2) apply on-device machine learning classifiers that operate at the granularity of URLs, can inspect connections across all apps, and detect not only ads, but also tracking. 
    more » « less
  3. The detection of zero-day attacks and vulnerabilities is a challenging problem. It is of utmost importance for network administrators to identify them with high accuracy. The higher the accuracy is, the more robust the defense mechanism will be. In an ideal scenario (i.e., 100% accuracy) the system can detect zero-day malware without being concerned about mistakenly tagging benign files as malware or enabling disruptive malicious code running as none-malicious ones. This paper investigates different machine learning algorithms to find out how well they can detect zero-day malware. Through the examination of 34 machine/deep learning classifiers, we found that the random forest classifier offered the best accuracy. The paper poses several research questions regarding the performance of machine and deep learning algorithms when detecting zero-day malware with zero rates for false positive and false negative. 
    more » « less
  4. The phenomenal growth in use of android devices in the recent years has also been accompanied by the rise of android malware. This reality warrants development of tools and techniques to analyze android apps in large scale for security vetting. Most of the state-of-the-art vetting tools are either based on static analysis or on dynamic analysis. Static analysis has limited success if the malware app utilizes sophisticated evading tricks. Dynamic analysis on the other hand may not find all the code execution paths, which let some malware apps remain undetected. Moreover, the existing static and dynamic analysis vetting techniques require extensive human interaction. To ad- dress the above issues, we design a deep learning based hybrid analysis technique, which combines the complementary strengths of each analysis paradigm to attain better accuracy. Moreover, automated feature engineering capability of the deep learning framework addresses the human interaction issue. In particular, using lightweight static and dynamic analysis procedure, we obtain multiple artifacts, and with these artifacts we train the deep learner to create independent models, and then combine them to build a hybrid classifier to obtain the final vetting decision (malicious apps vs. benign apps). The experiments show that our best deep learning model with hybrid analysis achieves an area under the precision-recall curve (AUC) of 0.9998. In this paper, we also present a comparative study of performance measures of the various variants of the deep learning framework. Additional experiments indicate that our vetting system is fairly robust against imbalanced data and is scalable. 
    more » « less
  5. The proliferation of Android applications has resulted in many malicious apps entering the market and causing significant damage. Robust techniques that determine if an app is malicious are greatly needed. We propose the use of network-based approaches to effectively separate malicious from benign apps, based on a small labeled dataset. The apps in our dataset come from the Google Play Store and have been scanned for malicious behavior using VirusTotal to produce a ground truth dataset with labels malicious or benign. The apps in the resulting dataset have been represented in the form of binary feature vectors (where the features represent permissions, intent actions, discriminative APIs, obfuscation signatures, and native code signatures). We have used these vectors to build a weighted network that captures the “closeness” between apps. We propagate labels from the labeled apps to unlabeled apps, and evaluate the effectiveness of the approaches studied using the Fl-measure. We have conducted experiments to compare three variants of the label propagation approaches on datasets that consist of increasingly larger amounts of labeled data. 
    more » « less