In applying deep learning for malware classifica- tion, it is crucial to account for the prevalence of malware evolution, which can cause trained classifiers to fail on drifted malware. Existing solutions to address concept drift use active learning. They select new samples for analysts to label and then retrain the classifier with the new labels. Our key finding is that the current retraining techniques do not achieve optimal results. These techniques overlook that updating the model with scarce drifted samples requires learning features that remain consistent across pre-drift and post-drift data. The model should thus be able to disregard specific features that, while beneficial for the classification of pre-drift data, are absent in post-drift data, thereby preventing prediction degradation. In this paper, we propose a new technique for detecting and classifying drifted malware that learns drift-invariant features in malware control flow graphs by leveraging graph neural networks with adversarial domain adaptation. We compare it with existing model retraining methods in active learning-based malware detection systems and other domain adaptation techniques from the vision domain. Our approach significantly improves drifted malware detection on publicly available benchmarks and real-world malware databases reported daily by security companies in 2024. We also tested our approach in predicting multiple malware families drifted over time. A thorough evaluation shows that our approach outperforms the state-of-the-art approaches.
more »
« less
What do malware analysts want from academia? A survey on the state-of-the-practice to guide research developments
Malware analysis tasks are as fundamental for modern cybersecurity as they are challenging to perform. More than depending on any tool capability, malware analysis tasks depend on human analysts’ abilities, experiences, and practices when using the tools. Academic research has traditionally been focused on producing solutions to overcome malware analysis technical challenges, but are these solutions adopted in practice by malware analysts? Are these solutions useful? If not, how can the academic community improve its practices to foster adoption and cause a greater impact? To answer these questions, we surveyed 21 professional malware analysts working in different companies, from CSIRTs to AV companies, to hear their opinions about existing tools, practices, and the challenges they face in their daily tasks. In 31 questions, we cover a broad range of aspects, from the number of observed malware variants to the use of public sandboxes and the tools the analysts would like to exist to make their lives easier. We aim to bridge the gap between academic developments and malware practices. To do so, on the one hand, we suggest to the analysts the solutions proposed in the literature that could be integrated into their practices. On the other hand, we also point out to the academic community possible future directions to bridge existing development gaps that significantly affect malware analysis practices.
more »
« less
- Award ID(s):
- 2327427
- PAR ID:
- 10546319
- Publisher / Repository:
- ACM
- Date Published:
- ISBN:
- 9798400709593
- Page Range / eLocation ID:
- 77 to 96
- Subject(s) / Keyword(s):
- Malware Malware Analysis Reverse Engineering Analysis Tools
- Format(s):
- Medium: X
- Location:
- Padua Italy
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Static binary analysis is critical to various security tasks such as vulnerability discovery and malware detection. In recent years, binary analysis has faced new challenges as vendors of the Internet of Things (IoT) and Industrial Control Systems (ICS) continue to introduce customized or non-standard binary formats that existing tools cannot readily process. Reverse-engineering each of the new formats is costly as it requires extensive expertise and analysts’ time. In this paper, we investigate the first step to automate the analysis of non-standard binaries, which is to recognize the bytes representing “code” from “data” (i.e., data-code separation). We propose Loadstar, and its key idea is to use the abundant labeled data from standard binaries to train a classifier and adapt it for processing unlabeled non-standard binaries. We use a pseudo-label-based method for domain adaption and leverage knowledge-inspired rules for pseudo-label correction, which serves as the guardrail for the adaption process. A key advantage of the system is that it does not require labeling any non-standard binaries. Using three datasets of non-standard PLC binaries, we evaluate Loadstar and show it outperforms existing tools in terms of both accuracy and processing speed. We will share the tool (open source) with the community.more » « less
-
Malware poses an increasing threat to critical computing infrastructure, driving demand for more advanced detection and analysis methods. Although raw-binary malware classifiers show promise, they are limited in their capabilities and struggle with the challenges of modeling long sequences. Meanwhile, the rise of large language models (LLMs) in natural language processing showcases the power of massive, self-supervised models trained on heterogeneous datasets, offering flexible representations for numerous downstream tasks. The success behind these models is rooted in the size and quality of their training data, the expressiveness and scalability of their neural architecture, and their ability to learn from unlabeled data in a self-supervised manner. In this work, we take the first steps toward developing large malware language models (LMLMs), the malware analog to LLMs. We tackle the core aspects of this objective, namely, questions about data, models, pretraining, and finetuning. By pretraining a malware classification model with language modeling objectives, we were able to improve downstream performance on diverse practical malware classification tasks on average by 1.1% and up to 28.6%, indicating that these models could serve to succeed raw-binary malware classifiers.more » « less
-
Background and Context. Research software in the Computing Education Research (CER) domain frequently encounters issues with scalability and sustained adoption, which limits its educational impact. Despite the development of numerous CER programming (CER-P) tools designed to enhance learning and instruction, many fail to see widespread use or remain relevant over time. Previous research has primarily examined the challenges educators face in adopting and reusing CER tools, with few focusing on understanding the barriers to scaling and adoption practices from the tool developers’ perspective. Objectives. To address this, we conducted semi-structured interviews with 16 tool developers within the computing education community, focusing on the challenges they encounter and the practices they employ in scaling their CER-P tools. Method. Our study employs thematic analysis of the semi-structured interviews conducted with developers of CER-P tools. Findings. Our analysis revealed several barriers to scaling highlighted by participants, including funding issues, maintenance burdens, and the challenge of ensuring tool interoperability for a broader user base. Despite these challenges, developers shared various practices and strategies that facilitated some degree of success in scaling their tools. These strategies include the development of teaching materials and units of curriculum, active marketing within the academic community, and the adoption of flexible design principles to facilitate easier adaptation and use by educators and students. Implications. Our findings lay the foundation for further discussion on potential community action initiatives, such as the repository of CS tools and the community of tool developers, to allow educators to discover and integrate tools more easily in their classrooms and support tool developers by exchanging design practices to build high-quality education tools. Furthermore, our study suggests the potential benefits of exploring alternative funding models.more » « less
-
Companies' privacy policies and their contents are being analyzed for many reasons, including to assess the readability, usability, and utility of privacy policies; to extract and analyze data practices of apps and websites; to assess compliance of companies with relevant laws and their own privacy policies, and to develop tools and machine learning models to summarize and read policies. Despite the importance and interest in studying privacy policies from researchers, regulators, and privacy activists, few best practices or approaches have emerged and infrastructure and tool support is scarce or scattered. In order to provide insight into how researchers study privacy policies and the challenges they face when doing so, we conducted 26 interviews with researchers from various disciplines who have conducted research on privacy policies. We provide insights on a range of challenges around policy selection, policy retrieval, and policy content analysis, as well as multiple overarching challenges researchers experienced across the research process. Based on our findings, we discuss opportunities to better facilitate privacy policy research, including research directions for methodologically advancing privacy policy analysis, potential structural changes around privacy policies, and avenues for fostering an interdisciplinary research community and maturing the field.more » « less
An official website of the United States government

