<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>On the Robustness of Machine Learning Training in Security Sensitive Environments</title></titleStmt>
			<publicationStmt>
				<publisher>https://repository.library.northeastern.edu/files/neu:ms35v291c</publisher>
				<date>08/31/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10623868</idno>
					<idno type="doi"></idno>
					
					<author>Giorgio Severi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Modern machine learning underpins a large variety of commercial software products, including many cybersecurity solutions. Widely different models, from large transformerstrained for auto-regressive natural language modeling to gradient boosting forests designed to recognize malicious software, all share a common element: they are trained on an ever increasing quantity of data to achieve impressive performance levels in their tasks. Consequently, the training phase of modern machine learning systems holds dualsignificance: it is pivotal in achieving the expected high-performance levels of these models, and concurrently, it presents a prime attack surface for adversaries strivingto manipulate the behavior of the final trained system. This dissertation explores the complexities and hidden dangers of training supervised machine learning models in anadversarial setting, with a particular focus on models designed for cybersecurity tasks. Guided by the belief that an accurate understanding of the offensive capabilities of theadversary is the cornerstone on which to found any successful defensive strategy, the bulk of this thesis is composed by the introduction of novel training-time attacks. Westart by proposing training-time attack strategies that operate in a clean-label regime, requiring minimal adversarial control over the training process, allowing the attacker to subvert the victim model’s prediction through simple poisoned data dissemination. Leveraging the characteristics of the data domain and model explanation techniques,we craft training data perturbations that stealthily subvert malicious software classifiers. We then shift the focus of our analysis on the long-standing problem of networkflow traffic classification. In this context we develop new poisoning strategies that work around the constraints of the data domain through different strategies, including generative modeling. Finally, we examine unusual attack vectors, when the adversary is capable of tampering with different elements of the training process, such as the network connections during a federated learning protocol. We show that such an attacker can induce targeted performance degradation through strategic network interference, while maintaining stable the performance of the victim model on other data instances. We conclude by investigating mitigation techniques designed to target these insidious clean-label backdoor attacks in the cybersecurity domain.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>List of Figures</head><p>xv List of Tables 3.1 Summary of attacker scenarios. Fullness of the circle indicates relative level of knowledge or control. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Performance metrics for the clean models. . . . . . . . . . . . . . . . . . . . 3.3 LargeAbsSHAP x CountAbsSHAP -All features. Average percentage over 5 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 LargeAbsSHAP x MinPopulation -All features. Average percentage over 5 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Greedy Combined Feature and Value Selector -All features. Average percentage over 5 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Watermarks for LightGBM and EmberNN used during feasibility testing. 3.7 Summary of results analyzing a random sample of 100 watermarked goodware and malware samples in the dynamic analysis environment. . . . . . 3.8 Mitigation results for both LightGBM and EmberNN. All attacks were targeted towards the 17 controllable features (see Section 3.5), with a 1% poison set size, 6000 backdoored benign samples. We show Acc(F b , X b ) for the backdoored model, and after the defense is applied. We also include number of poisoned and goodware points filtered out by the defensive approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Network data format. Our data is represented by connection logs ("conn.log" files) extracted with the Zeek monitoring tool from publicly-available packet-level PCAP files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Statistical features aggregated over connection logs within each data point grouping. The grouping is comprised of connections within 30-sec time windows, aggregated separately for each internal IP and destination port within the time window. Note that the internal IP versus external IP distinction pertains to the subnet, not to the two ends of the connection (source/destination). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi 4.3 Sampling method for each dependency described in the DAG from Figure 4.2. In this example, we assume that the most important features correspond to protocol and port; their values (TCP protocol on port 80) have been determined in Phase II of our strategy. Here, our generative method samples the rest of the log field values. D a represents the attacker's dataset. 4.4 Base performance of the classifiers, avg. over 5 runs. . . . . . . . . . . . . . 4.5 Area under the Precision-Recall Curve and F1 score obtained by performing anomaly detection on the poisoned data with an Isolation Forest model trained on a clean subset of the training data. CTU-13 Neris, at 1% poisoning rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Results on the CTU-13 Neris Botnet scenario, where the victim model uses an auto-encoder to learn the feature representation. Entropy strategy. 5.1 Statistical features for network data. . . . . . . . . . . . . . . . . . . . . . . 5.2 Statistical features for binary files. . . . . . . . . . . . . . . . . . . . . . . . 5.3 Average model utility metrics on CTU-13. Results reported for different victim architectures, at different poisoning percentages. All results are averages of 10 runs, with two attack strategies and 5 random seeds. . . . . 5.4 Comparison of patching and filtering sanitization approaches at fixed threshold = 80%. Gradient boosting model on the CTU-13 dataset. Also showing the Base ASR value for the undefended attack. Results are averages of 5 runs on different random seeds, for two attack strategies Entropy and SHAP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Comparison of patching and filtering sanitization approaches after loss analysis, using z t = 2&#963;. Gradient boosting model on the CTU-13 dataset. Results are averages of 5 runs on different random seeds, with two feature selection approaches, for different trigger refinement strategies. . . . . . . 5.6 Comparison of patching and filtering sanitization approaches at fixed threshold = 80%. LightGBM model on the EMBER dataset. Also showing the Base ASR value for the undefended attack. Results are averages of two runs on different random seeds, for the two Independent attack strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Comparison of patching and filtering sanitization approaches after loss analysis, using z t = 2&#963;. LightGBM model on the EMBER dataset. Also showing the Base ASR value for the undefended attack. Results are averages of two runs on different random seeds, for different strategies. . . . . 6.1 The parameters used in our experiments. . . . . . . . . . . . . . . . . . . . xvii 6.2 Perfect knowledge dropping. Accuracy on target population at rounds T/2 and T. T = 100 for EMNIST, T = 200 for FashionMNIST and DB-Pedia. Targeted dropping is more effective for larger number of dropped clients k N and reaches 0 when no clients are available for the target class (k N = k). For harder classification tasks such as DBPedia, accuracy on target class gets to 6% for 5 out of 15 clients dropped, and 0 when 10 out of 15 clients are dropped. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.3 Target client identification. Average number of clients correctly identified by Algorithm 5 at different rounds under COMM_PLAIN and COMM_ENC, k N = k. On FashionMNIST and DBPedia all 15 target clients are identified at 50 and 20 rounds, respectively, for COMM_PLAIN, while the maximum number of clients identified under COMM_ENC is 11.75 at 70 rounds for DBPedia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.4 Targeted dropping attack, under COMM_PLAIN and COMM_ENC. Accuracy on target population at rounds T/2 and T, for T = 100 for EMNIST, T = 200 for FashionMNIST and DBPedia. The attack under COMM_PLAIN gets results close to perfect knowledge adversary, while under COMM_ENC the attack still improves upon random dropping. . . . . . . . . . . . . . . 120 6.5 Accuracy on target class presented at rounds T/2 and T, under COMM_PLAIN setting. T = 100 for EMNIST, T = 300 for FashionMNIST and DBPedia. We consider both Targeted Dropping and Dropping + Poisoning scenarios. 123 6.6 Accuracy on full test set presented at rounds T/2 and T, under COMM_PLAIN setting. T = 100 for EMNIST, T = 300 for FashionMNIST and DBPedia. We consider both Targeted Dropping and Dropping + Poisoning scenarios. 124 6.7 Accuracy on target class presented at rounds T/2 and T, under COMM_ENC setting. T = 100 for EMNIST, T = 300 for FashionMNIST and DBPedia. We consider both Targeted Dropping and Dropping + Poisoning scenarios. 125 6.8 Accuracy on full test set presented at rounds T/2 and T, under COMM_ENC setting. T = 100 for EMNIST, T = 300 for FashionMNIST and DBPedia. We consider both Targeted Dropping and Dropping + Poisoning scenarios. 126 6.9 Accuracy on target class presented at rounds T/2 and T, under the MPC scenario. T = 100 for EMNIST, T = 300 for FashionMNIST and DBPedia. We consider both Targeted Dropping and Dropping + Poisoning scenarios. 127 6.10 Accuracy on full test set presented at rounds T/2 and T, under the MPC scenario. T = 100 for EMNIST, T = 300 for FashionMNIST and DBPedia. We consider both Targeted Dropping and Dropping + Poisoning scenarios. 128 xix List of Abbreviations</p><p>AS Autonomous System AV Anti Virus CV Computer Vision FL Federated Learning FPR False Positive Rate FNR False Negative Rate LLM Large Language Model ML Machine Learning NLP Natural Language Processing PCA Principal Component Analysis PDF Portable Document Format SGD Stochastic Gradient Descent SVD Singular Value Decomposition TCP Transmission Control Protocol UDP User Datagram Protocol XAI Explainable Artificial Intelligence Chapter 1</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>The past decade has witnessed a tremendous growth in the research, development, and practical application of machine learning (ML) models across an astonishing array of domains. The breadth and depth of capabilities demonstrated by contemporary ML systems are nothing short of surprising: from vision transformers obtaining excellent results in image classification <ref type="bibr">[61]</ref>, to gradient boosting trees employed for malware identification at low false positive rates <ref type="bibr">[5,</ref><ref type="bibr">83]</ref>, from diffusion models generating amazing digital art <ref type="bibr">[177]</ref> to massive auto-regressive natural language models producing coherent prose and executable code <ref type="bibr">[162]</ref> -the list is long and ever-growing. Inevitably, this trend has extended to a variety of cybersecurity applications, where ML models are used to quickly and accurately perform critical tasks such as malware execution prevention and network monitoring.</p><p>Most of these models, and the systems they underpin, share a common characteristic:</p><p>their performance metrics tends to improve significantly with larger volumes of training data and computation employed during training <ref type="bibr">[88,</ref><ref type="bibr">107]</ref>. This realization led model developers to acquire increasingly large volumes of raw data from potentially untrusted sources, such as Internet scrapes and crowd-sourcing platforms. Not only data, but computation too is often outsourced completely, or performed on hardware (potentially with its own software stack) that is under control of third parties.</p><p>Outsourcing computation and data, however, opens the door to adversaries, who may wish to subvert the correct behavior of the learned models for economic or criminal purposes 1 . Such an adversary can, in fact, exploit the access to the training process granted by the reliance on untrusted third parties, and inject a desired behavior into the learned model. Literature on the subject <ref type="bibr">[164,</ref><ref type="bibr">48]</ref> generally refers to this typology 1 In this thesis, we will focus on scenarios where the "adversary" is motivated by criminal or otherwise malicious intents. Therefore we use the terms adversary, attacker, and malicious actor interchangeably. However, this may not be the case in general, as there may be uses of adversarial machine learning for activism, political disobedience, or even to protect users privacy. A rigorous discussion of this topic is a complex endeavor, and out of the scope of this work. We refer the reader to the work of Albert et al. <ref type="bibr">[1]</ref> for an introduction on the subject. of adversarial interference as poisoning attacks. Poisoning attacks can be orchestrated to achieve a variety of adversarial objectives, such as inducing misclassificationevents on specific data instances, or even improve the effectiveness of privacy attacks.</p><p>As computational tasks are increasingly delegated and datasets expand in scale, the potential risk escalates significantly for all ML applications. However, we argue that this issue is even worse when we consider the security domain. In this context, not only there is a clear, natural, incentive for malicious actors to attempt to corrupt the learned model, but sanitizing the training becomes much more difficult and expensive.</p><p>As the reader can imagine, the task of verifying the correctness of large volumes of images or textual data -potentially through labor crowd-sourcing platforms -tends to be significantly less expensive, than that of vetting training datasets composed of executable binaries, which requires expert analysts.</p><p>The training phase, therefore, constitutes a critical juncture in the overall pipeline of modern machine learning: it is quintessential to achieve the best performance on the desired tasks, while at the same time representing a deeply sensitive point that a malicious actor may wish to exploit. Defending the training phase of machine learning models from poisoning attacks is thus imperative if we wish to rely on these systems for security and safety sensitive applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Thesis Contributions</head><p>With this thesis we wish to investigate the effectiveness of training-time attacks, their applicability to realistic pipelines, and potential mitigation strategies. We are motivated by the perspective that understanding offensive capabilities is crucial for developing robust defenses. Our belief is guided by seminal works in software <ref type="bibr">[194,</ref><ref type="bibr">63]</ref> and hardware <ref type="bibr">[128,</ref><ref type="bibr">113]</ref> security which shaped the defensive mechanisms currently employed in modern computation systems. Our final goal is to advance an evidence-based approach for hardening machine learning training, paralleling the evolution of defenses in traditional computer security.</p><p>To realize this aim, this thesis puts forth the following principal contributions:</p><p>&#8226; Explanation-guided poisoning: We design a new type of backdoor poisoning attacks centered around the use of AI model explanation techniques (XAI). These attacks allow the adversary to associate a pre-defined trigger pattern to a target class, which can be used to control the classifier's output. We use XAI to guide the generation of the trigger pattern to obtain effective attacks with minimal side effects on the normal behavior of the victim model. We demonstrate these attacks in diverse security-critical settings, including malware detection <ref type="bibr">[193]</ref> and network 1.1. Thesis Contributions intrusion detection <ref type="bibr">[190]</ref>, by tailoring the trigger generation process for different modalities like binaries and network flows.</p><p>&#8226; Backdoor attacks with limited adversarial control: We demonstrate how effective backdoor attacks can be carried out in extremely constrained environments, mirroring real world deployment conditions. The attacks we propose do not require any control over the training labels, make no assumptions on the victim model's architecture, and respect the semantic constrains of complex data modalities such as executable binaries <ref type="bibr">[193]</ref> and aggregated network flows <ref type="bibr">[190]</ref>. We smuggle stealthy triggers by adversarially perturbing only a small fraction of benign samples to convey our backdoor trigger and we ensure that the altered samples blend in with the natural distribution.</p><p>&#8226; Mitigation strategies for backdoor attacks in cybersecurity: We propose new techniques that leverage the insights in cybersecurity threat models gained from the research presented in this thesis to mitigate clean-label backdoor attacks, while preserving the model utility. Our method <ref type="bibr">[189]</ref> performs density-based clustering on a carefully chosen feature subspace, and progressively isolates suspicious clusters through a novel iterative scoring procedure. With this approach, our defensive mechanism can mitigate the attacks without requiring many of common assumptions in the backdoor defense literature which would be difficult to realize in the security domain.</p><p>&#8226; Targeted network-level interference: We show how to launch network-level attacks against Federated Learning protocols that achieve the same effect of targeted poisoning. To obtain these results, we propose a methodology to strategically interfere with the exchange of selected model updates <ref type="bibr">[191]</ref>. Simultaneously, we also present a mitigation approach geared towards this setup, which helps preventing targeted disruptions against vulnerable data populations.</p><p>Collectively, these results showcase the sophistication of possible attacks against modern machine learning systems, while also pointing towards promising directions for increasing training robustness. Hardening machine learning pipelines remains an open research direction with profound importance for the safe and ethical deployment of artificial intelligence systems, especially in security and safety-critical domains.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Thesis Organization</head><p>This thesis will start by presenting some relevant background information in Chapter 2, which provide the reader with the necessary context to delve into the techniques proposed in the subsequent chapters. In Chapter 3 we introduce explanation-guided poisoning attacks against malicious software detectors, showing how to leverage model interpretation tools to launch stealthy backdoor attacks. This chapter is followed by an in depth exploration of the applicability of backdoor attacks against network traffic classifiers operating on aggregated flows, in Chapter 4. After presenting our novel backdoor attacks, Chapter 5 presents possible approaches to mitigate this threat, based on iterative re-training and cluster analysis. Chapter 6 then shows how different types of training-time interference operations can be carried out by the adversary to achieve results similar to targeted poisoning in decentralized learning systems. We conclude in Chapter 7 with a summary of the contributions of this thesis, and an overview of possible directions for future research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chapter 2</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Background</head><p>In this chapter, we will cover some relevant background information that will help contextualize the work presented in this thesis. Starting with some notes on the use of machine learning models in cybersecurity applications, we will then provide an introduction to adversarial machine learning, and conclude with a quick overview on model interpretation methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Machine Learning for Cybersecurity Applications</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.1">Malware Detection Systems</head><p>We can roughly distinguish the numerous solutions to the problem of automated malicious software detection into two main classes based on their use of static or dynamic analysis. Dynamic analysis systems execute binary files in a virtualized environment, and record the behavior of the sample looking for indicators of malicious activities <ref type="bibr">[137,</ref><ref type="bibr">216,</ref><ref type="bibr">4,</ref><ref type="bibr">110,</ref><ref type="bibr">192]</ref>. Meanwhile, static analyzers process executable files without running them, extracting the features used for classification directly from the binary and its metadata. While both approaches have positive and negative aspects, many endpoint security solutions tend to implement static analyzers due to the strict time constraints under which they usually operate.</p><p>With the shift towards ML based classifiers, this second class can be further divided into two additional subcategories: handcrafted feature-based detectors <ref type="bibr">[186,</ref><ref type="bibr">138,</ref><ref type="bibr">196,</ref><ref type="bibr">188,</ref><ref type="bibr">5]</ref>, and raw-binary analyzers <ref type="bibr">[175,</ref><ref type="bibr">47,</ref><ref type="bibr">117]</ref>. We generally focus our attention, and the attacks we develop in Chapter 3, on classifiers based on static features due to their prevalence in providing pre-execution detection and prevention for many commercial endpoint protection solutions <ref type="bibr">[58,</ref><ref type="bibr">54,</ref><ref type="bibr">136]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.2">Network Threats Detection</head><p>Machine learning methods have been successfully used to detect several cyber security threats related to network traffic, including: malicious domains <ref type="bibr">[8,</ref><ref type="bibr">176,</ref><ref type="bibr">9,</ref><ref type="bibr">163,</ref><ref type="bibr">159]</ref>, command-and-control communication between attackers and compromised hosts <ref type="bibr">[151,</ref><ref type="bibr">163]</ref>, or malicious binaries used by adversaries for distributing exploit code and botnet commands <ref type="bibr">[98,</ref><ref type="bibr">217]</ref>. Several endpoint protection products [144, <ref type="bibr">96,</ref><ref type="bibr">97]</ref> are now integrating ML tools to proactively detect the rapidly increasing number of threats.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Adversarial Machine Learning</head><p>Adversarial Machine Learning is an extremely active research area, composed by a number of sub-fields touching on different aspects of machine learning, from privacy to integrity and from safety to fairness. For a detailed taxonomy of adversarial ML techniques and objectives we refer the reader to a recent standardization effort by Oprea and Vassilev <ref type="bibr">[164]</ref>. This thesis will focus almost exclusively on the security aspects of adversarial ML pertaining to integrity violations. Which means that the adversaries we consider have as primary goal the ability to take control of the output of the victim model. While there are multiple potential reasons why an adversary would want to gain control over a model's output, in the context of cybersecurity applications one of the most common motivation is to ensure that their malicious activities (software execution, network traffic, file system activity, etc...) are not identified as suspicious by the targeted models.</p><p>Integrity attacks against ML models are generally grouped into two major categories based on the stage of the ML pipeline at which they occur: inference-time attacks and training-time attacks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1">Inference-time Attacks</head><p>The first category is also commonly referred to as test-time or evasion attacks <ref type="bibr">[215]</ref> (or more commonly and less precisely adversarial examples). Generally evasion attacks are aimed towards classifiers learned through supervised or semi-supervised training, and their objective is to induce a mis-classification of the selected data point <ref type="bibr">[21,</ref><ref type="bibr">215]</ref> either towards a randomly chosen class (untargeted) or towards a specific class chosen by the adversary (targeted). This class of attacks proceeds by altering a testing sample by adding a small perturbation so that it is misclassified by the victim model. Here, the key term is "small". Using large perturbations that semantically impact the nature of the test point, e.g. changing a dog's snout to that of a cat in an image classification task, would render any attack trivial. Instead, the adversarial perturbations are designed to be either imperceptible to human observers, in the case of vision, NLP, and audio classification tasks, or difficult to identify with automated outlier detection tools for cybersecurity applications.</p><p>Evasion attacks have been extensively explored in the context of computer vision, natural language processing and, to a lesser extent, audio processing. They are the most well known typology of attacks against ML, and likely the first line of attack against modern models. Due to the sheer volume of research in this area, it would be impractical to provide here a detailed breakdown of all the studies on the subject, thus we point the interested reader towards some resources that list the most relevant works in the field <ref type="bibr">[32,</ref><ref type="bibr">39,</ref><ref type="bibr">164]</ref>.</p><p>Due to their immediate relevance in adversarial contexts, in the security space, previous research efforts have investigated the applicability of such techniques to malware classification <ref type="bibr">[21,</ref><ref type="bibr">75,</ref><ref type="bibr">249,</ref><ref type="bibr">115,</ref><ref type="bibr">209]</ref> and network traffic classification <ref type="bibr">[82,</ref><ref type="bibr">31,</ref><ref type="bibr">15,</ref><ref type="bibr">166,</ref><ref type="bibr">45]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2">Training-time Attacks</head><p>More recently, however, poisoning attacks against ML, which manipulate the training set (or other training parameters), have emerged as a top concern in industry <ref type="bibr">[202]</ref>. This category, the main focus of this thesis, is itself a collection of different adversarial processes against a victim learner aimed at different objectives.</p><p>The earliest approaches to training-time attacks consisted in introducing adversarially modified data points in the victim's training set <ref type="bibr">[22]</ref> -from which derives the name poisoning attacks-, and were generally targeted towards increasing the target model's error rate on the test set (often referred to as an availability objective <ref type="bibr">[102]</ref>). Over time, however, researchers developed different types of poisoning attacks aimed at a variety of objectives.</p><p>While availability attacks are extremely impactful, they are also obvious to detect, and their footprint can be quite large. To address these shortcomings, targeted poisoning attacks were designed to induce a degradation in the performance of the victim classifier only on specific points <ref type="bibr">[210,</ref><ref type="bibr">195]</ref>, or entire data sub-populations <ref type="bibr">[103]</ref>, pre-determined by the adversary. The work presented in Chapter 6 presents a technique to achieve similar effects to targeted poisoning in a federated learning setup through network-level adversarial interference.</p><p>Poisoning-augmented privacy attacks <ref type="bibr">[220,</ref><ref type="bibr">41]</ref> are another important training-time threat against ML models. While not directly aimed at violating the victim's model integrity properties, these adversarial processes leverage integrity attacks during training to facilitate privacy attacks such as membership inference <ref type="bibr">[35]</ref> at test time.</p><p>Backdoor attacks, introduced in the context of modern ML models by Gu et al. <ref type="bibr">[78]</ref>, pursue a similar yet distinct goal: triggering a mis-classification towards a target class whenever a specific pattern is inserted in the test sample without damaging the accuracy of the victim model on normal points.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Backdoor Attacks</head><p>In most of the work presented in Chapter 3 and Chapter 4 we focus on backdoor poisoning attacks, a notably dangerous technique that does not affect the model's predictions on clean test data. This behavior is obtained by introducing a trigger pattern in the training data, and swapping the correct label for the target one so that the victim model learns the association between pattern and target class. In this context, a backdoor trigger is a specific combination of features and selected values 1 that the victim model is induced, during training, to associate with a target class. Once the model relies on the backdoor pattern to distinguish the target, the adversary can inject the same watermark in any data point to arbitrarily trigger the desired outcome during inference. In their seminal work, Gu et al. <ref type="bibr">[78]</ref> used a small pixel-based trigger pattern to misclassify images of handwritten digit and street sign classifiers.</p><p>The same objective as backdoor poisoning, was also achieved by Liu et al. <ref type="bibr">[132]</ref>, through direct alteration of the model weights (model poisoning, also often referred to as trojaning 2 ).</p><p>Successive work by Turner et al. <ref type="bibr">[223,</ref><ref type="bibr">222]</ref> and Shafahi et al. <ref type="bibr">[195]</ref> improved on these attacks by proposing variants of both targeted and backdoor poisoning in computer vision tasks that did not require any adversarial control of the training labels. These variants are commonly known as clean-label poisoning attacks.</p><p>In the cybersecurity domain, the literature on backdoor attacks is more scarce, with only a few methods designed via packet-level poisoning <ref type="bibr">[92,</ref><ref type="bibr">156]</ref>, feature-space poisoning <ref type="bibr">[11,</ref><ref type="bibr">124]</ref>, and label flipping <ref type="bibr">[166]</ref>. Chapter 3 and Chapter 4 expand the knowledge in this area by introducing backdoor attacks guided by insights gathered through model interpretability tools, and applied to malware classification and aggregated network traffic analysis. 1 In the computer vision domain this usually corresponds to a selected pixel mask with associated pixel values. <ref type="bibr">[78]</ref>, for instance, used a fixed trigger corresponding to a small yellow square, similar to a post-it note. In a more general sense, any combination of features and values can act as a trigger pattern. 2 Due to their common objective, the adversarial machine learning literature often uses the two terms, backdoor and trojan, interchangeably. Given the difference in the methodology used by the authors of the original papers, we will keep the two terms distinct.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Interpretable Machine Learning</head><p>Recent years have seen a remarkable expansion of the use cases of Machine Learning systems in a variety of contexts. The progressive adoption of larger and more complex models, however, is often seen as problematic due to the difficulty for human overseers to understand the causes of model predictions. Thus, the need to provide humaninterpretable explanations of model behaviors led to a surge of research in the area of Interpretable ML (also referred to as explainable AI, or XAI).</p><p>A large variety of interpretation methods have been proposed that can be categorized according to different criteria. A comprehensive examination of the subject can be found in <ref type="bibr">[147]</ref>. In the work presented here, the focus will be on post-hoc, model agnostic, local attribution methods: techniques that explain a single prediction of any trained classifier by assigning weights to each feature based on their contribution to the final outcome. The rationale for this approach stems from how the explanation values would be leveraged by an attacker: as sources of insight into which features drive particular classification results, and how to perturb them to accomplish the adversarial objective.</p><p>One of the most well-known local model-agnostic explanation methods is SHAP (SHapley Additive exPlanations) by Lundberg et al. <ref type="bibr">[133,</ref><ref type="bibr">134]</ref>, based on the cooperative game theory concept of Shapley values 3 . Its main objective is to explain the output logits of a model on a given test point by assigning a value to each feature composing the data point that reflects the contribution of the feature to the final output.</p><p>To this end, the framework trains a surrogate linear explanation model of the form:</p><p>Where x &#8594; j is the j th feature of transformed sample x &#8594; , and &#949; j is its contribution to the model's decision. SHAP values satisfy three key properties: (i) local accuracy, which requires the explanation model g to match the output of the target f model on the input f (x) = g(x &#8594; ); (ii) missingness, requiring missing features in the original input to have no attributed inpact; (iii) consistency, which states that if the target model changes so that the marginal contribution of a feature value increases or stays equal, then the relative SHAP value will also increase or stay equal.</p><p>Importantly, the SHAP framework requires only query access to the target classifier, which is of particular importance in an adversarial setting. Moreover SHAP has been 3 The objective of Shapley values is to fairly assign payoffs to participants in repeated cooperative games. In the context of ML, the features representing each data point act as the participants of the game, and the outcome of the game corresponds to the output of the classifier. <ref type="url">https://en.wikipedia.org/wiki/   Shapley_value</ref> shown to improve over earlier model interpretation approaches, including LIME <ref type="bibr">[180]</ref> Integrated Gradients <ref type="bibr">[213]</ref>, DeepLIFT <ref type="bibr">[201]</ref> and Layer-Wise Relevance Propagation <ref type="bibr">[24]</ref>. Linardatos et al. <ref type="bibr">[127]</ref> provide a comprehensive taxonomy of these methods, and conclude that, among the black-box techniques explored, SHAP is the most complete, providing explanations for any model and data type both at a global and local scale.</p><p>In our studies, we also experiment with estimating feature importance through Gini index <ref type="bibr">[71]</ref> and information gain <ref type="bibr">[114,</ref><ref type="bibr">120]</ref> -two of the most popular splitting algorithms in decision trees. A decision tree is built recursively, by choosing at each step the feature that provides the best split. Thus, the tree offers a natural interpretability, and a straightforward way to compute the importance of each feature towards the model's predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chapter 3</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Poisoning Static Malware Classification</head><p>Malware classification is a typical security scenario where the objectives of the parties involved induce a clear incentive for poisoning attacks. Any successful malware campaign, in fact, relies on the malicious software not being identified before it can accomplish its mission. Thus, the ability to arbitrarily alter the decision of a classifier represents a great asset in the adversary's arsenal. In this chapter we will explore new strategies, introduced in <ref type="bibr">[193]</ref>, to perform backdoor poisoning attacks on malware classifiers with limited access to the training process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chapter Summary</head><p>Training pipelines for machine learning (ML) based malware classification often rely on crowdsourced threat feeds, exposing a natural attack injection point. In this work, we study the susceptibility of feature-based ML malware classifiers to backdoor poisoning attacks, specifically focusing on challenging "clean label" attacks where attackers do not control the sample labeling process. We propose the use of techniques from explainable machine learning to guide the selection of relevant features and values to create effective backdoor triggers in a model-agnostic fashion. Using multiple reference datasets for malware classification, including Windows PE files, PDFs, and Android applications, we demonstrate effective attacks against a diverse set of machine learning models and evaluate the effect of various constraints imposed on the attacker. To demonstrate the feasibility of our backdoor attacks in practice, we create a watermarking utility for Windows PE files that preserves the binary's functionality, and we leverage similar behavior-preserving alteration methodologies for Android and PDF files. Finally, we experiment with potential defensive strategies and show the difficulties of completely defending against these attacks, especially when the attacks blend in with the legitimate sample distribution.</p><p>Users submit binaries to crowdsourced threat intelligence platforms for evaluation. Attacker submits poisoned benign files.</p><p>The platforms collect data and assign labels.</p><p>The company obtains the outsourced data and uses it in the training of a ML malware classifier.</p><p>Attacker can now submit malware containing the same backdoor. The model will be fooled into recognizing it as benign. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Problem Definition</head><p>The endpoint security industry has increasingly adopted machine learning (ML) based tools as integral components of their defense-in-depth strategies. In particular, classifiers using features derived from static analysis of binaries are commonly used to perform fast, pre-execution detection and prevention on the endpoint, and often act as the first line of defense for end users <ref type="bibr">[58,</ref><ref type="bibr">54,</ref><ref type="bibr">136]</ref>. Concurrently, we are witnessing a corresponding increase in the attention dedicated to adversarial attacks against malicious software (malware) detection models. The primary focus in this area has been the development of evasion attacks <ref type="bibr">[215,</ref><ref type="bibr">73,</ref><ref type="bibr">21]</ref>, where the adversary's goal is to alter the data point at inference time in order to induce a misclassification.</p><p>However, in this work, we focus on the insidious problem of poisoning attacks <ref type="bibr">[22]</ref>,</p><p>which attempt to influence the ML training process, and in particular backdoor <ref type="bibr">[78]</ref> poisoning attacks, where the adversary places a carefully chosen pattern into the feature space such that the victim model learns to associate its presence with a class of the attacker's choice. While evasion attacks have previously been demonstrated against both open-source <ref type="bibr">[135]</ref> and commercial malware classifiers <ref type="bibr">[203]</ref>, backdoor poisoning offers attackers an attractive alternative that requires more computational effort at the outset, but which can result in a generic evasion capability for a variety of malware samples and target classifiers. These backdoor attacks have been shown to be extremely effective when applied to computer vision models <ref type="bibr">[44,</ref><ref type="bibr">132]</ref> without requiring a large number of poisoned examples, but their applicability to the malware classification domain, and feature-based models in general, has not yet been investigated.</p><p>Poisoning attacks are a danger in any situation where a possibly malicious third party has the ability to tamper with a subset of the training data. For this reason, they have come to be considered as one of the most relevant threats to production deployed ML models <ref type="bibr">[202]</ref>. We argue that the current training pipeline of many security vendors provides a natural injection point for such attacks. Security companies, in fact, often rely on crowd-sourced threat feeds <ref type="bibr">[2,</ref><ref type="bibr">140,</ref><ref type="bibr">228,</ref><ref type="bibr">229]</ref> to provide them with a large, diverse stream of user-submitted binaries to train their classifiers. This is chiefly due to the sheer quantity of labeled binaries needed to achieve satisfactory detection performance (tens to hundreds of millions of samples), and specifically the difficulty in adequately covering the diverse set of goodware observed in practice (e.g., custom binaries, multiple versions of popular software, software compiled with different compilers, etc.).</p><p>One complication in this scenario, however, is that the labels for these crowd-sourced samples are often generated by applying several independent malware detection engines <ref type="bibr">[106]</ref>, which would be impossible for an attacker to control. Therefore, in this work, we study clean-label backdoor attacks <ref type="bibr">[195,</ref><ref type="bibr">223]</ref> against ML-based malware classifiers by developing a new, model-agnostic backdoor<ref type="foot">foot_0</ref> methodology. Our attack injects backdoored benign samples in the training set of a malware detector, with the goal of changing the prediction of malicious software samples watermarked with the same pattern at inference time.</p><p>To decouple the attack strategy from the specifics of the ML model, our main insight is to leverage tools from ML explainability, namely SHapley Additive exPlanations (SHAP) <ref type="bibr">[133]</ref>, to select a small set of highly effective features and their values for creating the watermark. We evaluate our attack against a variety of machine learning models trained on widely-used malware datasets, including EMBER (Windows executables) <ref type="bibr">[5]</ref>,</p><p>Contagio (PDFs) <ref type="bibr">[204]</ref>, and Drebin (Android executables) <ref type="bibr">[12]</ref>. Additionally, we explore the impact of various real-world constraints on the adversary's success, and the viability of defensive mechanisms to detect the attack.</p><p>Overall, our results show that the attack achieves high success rates across a number of scenarios and that it can be difficult to detect due to the natural diversity present in the goodware samples. The contributions of this work can be summarized as follows:</p><p>(i) We highlight a natural attack point which, if left unguarded, may be used to compromise the training of commercial, feature-based malware classifiers.</p><p>(ii) We propose the first general, model-agnostic methodology for generating backdoors for feature-based classifiers using explainable machine learning techniques.</p><p>(iii) We demonstrate that explanation-guided backdoor attacks are feasible in practice by developing a backdooring utility for Windows PE files, and using similar functionality-preserving methods for Android and PDF files. We show that these methods can satisfy multiple, realistic adversarial constraints. Model Architecture Model Parameters Training Data Features Labels unrestricted data_limited transfer black_box constrained</p><p>(iv) Finally, we evaluate mitigation techniques and demonstrate the challenges of fully defending against stealthy poisoning attacks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Threat Model</head><p>A typical training pipeline for a ML-based malware classifier, summarized in Figure <ref type="figure">3</ref>.1, commonly starts with the acquisition of large volumes of labeled binaries from thirdparty threat intelligence platforms. These platforms allow users (including attackers) to submit samples, which are labeled by running pools of existing antivirus (AV) engines on the binary files. Companies can then acquire the labeled data from the platforms.</p><p>The screening process of the incoming flow, however, is made remarkably onerous by both the sheer quantities involved, and the intrinsic difficulty of the task, requiring specialized personnel and tooling. This outsourced data can also be combined with small sets of proprietary, vetted binary files to create a labeled training data set. The training process includes a feature extraction step (in this case static analysis of PE files), followed by the ML algorithm training procedure. The trained malware classifiers are then deployed in the wild, and applied to new binary files to generate a label, malicious (malware) or benign (goodware).</p><p>Threat intelligence data comes with a set of labels determined by third-party AV analyzers, that are not under direct control of the attacker. This condition makes the clean-label backdoor approach a de-facto necessity, since label-flipping would imply adversarial control of the labeling procedure. The adversary's goal is thus to generate backdoored benign binaries, which will be disseminated through these labeling platforms, and will poison the training sets for downstream malware classifiers. Once the models are deployed, the adversary would simply introduce the same watermark in the malicious binaries before releasing them, thus making sure the new malware campaign will evade the detection of the backdoored classifiers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1">Adversary's Goals and Capabilities</head><p>A large fraction of the backdoor attack literature adopts the BadNets threat model <ref type="bibr">[78]</ref>,</p><p>which defined: (i) an "Outsourced Training Attack", where the adversary has full control over the training procedure, and the end user is only allowed to check the training using a held-out validation dataset; and (ii) a "Transfer Learning Attack", in which the user downloads a pre-trained model and fine-tunes it. We argue that, in the context we are examining, this threat model is difficult to apply directly. Security companies are generally risk-averse and prefer to either perform the training in-house, or outsource the hardware while maintaining full control over the software stack used during training.</p><p>Similarly, we do not believe the threat model from Liu et al. <ref type="bibr">[132]</ref>, where the attacker partially retrains the model, applies in this scenario.</p><p>Adversary's Goals. Similarly to most backdoor poisoning settings, the attacker goal is to alter the training procedure, such that the resulting backdoored classifier, F b , differs from a cleanly trained classifier F, where F, F b : X &#8593; R n &#8595; {0, 1}. An ideal F b has the exact same response to a clean set of inputs X as F, whereas it generates an adversariallychosen prediction, y b , when applied to backdoored inputs, X b . These goals can be summarized as:</p><p>While in multi-class settings, such as image recognition, there is a difference between targeted attacks, where the induced misclassificationis aimed towards a particular class, and non-targeted attacks, where the goal is solely to cause an incorrect prediction, this difference is lost in malware detection. Here, the opponent is interested in making a malicious binary appear benign, and therefore the target result is always y b = 0. We use class 0 for benign software, and class 1 for malicious software<ref type="foot">foot_1</ref> . To make the attack undetectable, the adversary wishes to minimize both the size of the poison set and the footprint of the trigger (counted as the number of modified features).</p><p>Adversary's Capabilities. We can characterize the adversary by the degree of knowledge and control they have on the components of the training pipeline, as shown in Table <ref type="table">3</ref>.1. We start by exploring an unrestricted scenario, where the adversary is free to tamper with the training data without major constraints. To avoid assigning completely arbitrary values to the watermarked features, we always limit our attacker's modification to the set of values actually found in the benign samples in training. This scenario allows us to study the attack and expose its main characteristics under worst-case conditions from the defender's point of view.</p><p>We also examine various constraints on the attacker, such as restricted access to the training set (data_limited), limited access to the target model (transfer), and limited knowledge of the model architecture (black_box). Finally, it is relevant to consider a scenario, constrained, where the adversary is strictly constrained in both the features they are allowed to alter and the range of values to employ. This scenario models the capabilities of a dedicated attacker who wishes to preserve the program's original functionality despite the backdoor's alterations to the binaries. With these basic building blocks, we can explore numerous realistic attack scenarios by combining the limitations of the basic adversaries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Explanation-Guided Backdoor Attacks</head><p>In a backdoor poisoning attack, the adversary leverages control over (a subset of) the One interpretation of the SHAP values is that they approximate the confidence of the decision boundary along each feature dimension, which gives us the model-agnostic method necessary to implement the two intuitive strategies above. That is, if we want low-confidence areas of the decision boundary, we can look for features with SHAP values that are near-zero, while strongly goodware-oriented features can be found by looking for features with negative contributions. Summing the values for each sample along the feature column will then give us an indication of the overall orientation for that feature within the dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1">Building Blocks</head><p>The attacker requires two building blocks to implement a backdoor: feature selectors and value selectors. Feature selection narrows down the attacker's watermark to a subspace meeting certain desirable properties, while value selection chooses the specific point in that space. Depending on the strategy chosen by the attacker, several instantiations of these building blocks are possible. Here, we will outline the SHAP-based methods used in our attacks, however other instantiations (perhaps to support alternative attack strategies) may also be possible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Feature Selection</head><p>The key principle for all backdoor poisoning attack strategies is to choose features with a high degree of leverage over the model's decisions.</p><p>One concept that naturally captures this notion is feature importance. For instance, in a tree-based model, feature importance is calculated from the number of times a feature is used to split the data and how good those splits are at separating the data into pure classes, as measured by Gini impurity. Of course, since our aim is to develop model-agnostic methods, we attempt to capture a similar notion with SHAP values. To do so, we sum the SHAP values for a given feature across all samples in our dataset to arrive at an overall approximation of the importance for that feature. Since SHAP values encode both directionality (i.e., class preference) and magnitude (i.e., importance), we can use these values in two unique ways. LargeSHAP : By summing the individual SHAP values, we combine the individual class alignments of the values for each sample to arrive at the average class alignment for that feature. Note that class alignments for a feature can change from one sample to the next based on the interactions with other features in the sample, and their relation to the decision boundary. Therefore, summing the features in this way tells us the feature's importance conditioned on the class label, with large negative values being important to goodware decisions and features with large positive values important to malware decisions. Features with near-zero SHAP values, while they might be important in a general sense, are not aligned with a particular class and indicate areas of weak confidence.</p><p>LargeAbsSHAP : An alternative approach is to ignore the directionality by taking the absolute value of the SHAP values before summing them. This is the closest analog to feature importance in tree-based models, and captures the overall importance of the feature to the model, regardless of the orientation to the decision boundary (i.e., which class is chosen).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Value Selection</head><p>Once we have identified the feature subspace to embed the trigger in, the next step is to choose the values that make up the trigger. However, due to the strong semantic restrictions of the binaries, we cannot simply choose any arbitrary value for our backdoors.</p><p>Instead, we restrict ourselves to only choosing values from within our data. Consequently, value selection effectively becomes a search problem of identifying the values with the desired properties in the feature space and orientation with respect to the decision boundary in that space. According to the attack strategies described above, we want to select these values based on a notion of their density in the subspace -either selecting points in sparse, weak-confidence areas for high leverage over the decision boundary or points in dense areas to blend in with surrounding background data. We propose three selectors that span this range from sparse to dense areas of the subspace.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MinPopulation:</head><p>To select values from sparse regions of the subspace, we can simply look for those values that occur with the least frequency in our dataset. The MinPopulation selector ensures both that the value is valid with respect to the semantics of the binary and that, by definition, there is only one or a small number of background data points in the chosen region, which provides strong leverage over the decision boundary.</p><p>CountSHAP : On the opposite side of the spectrum, we seek to choose values that have a high density of goodware-aligned data points, which allows our watermark to blend in with the background goodware data. Intuitively, we want to choose values that occur often in the data (i.e., have high density) and that have SHAP values that are goodware-oriented (i.e., large negative values). We combine these two components in the following formula:</p><p>where &#945;, &#1009; are parameters that can be used to control the influence of each component of the scoring metric, c v is the frequency of value v across the feature composing the trigger, and &#8721; x v &#8593;X S x v sums the SHAP values assigned to each component of the data vectors in the training set X, having the value x v . In our experiments, we found that setting &#945; = &#1009; = 1.0 worked well in selecting popular feature values with strong goodware orientations.</p><p>CountAbsSHAP : One challenge with the CountSHAP approach is that while the trigger might blend in well with surrounding goodware, it will have to fight against the natural background data for control over the decision boundary. The overall leverage of the backdoor may be quite low based on the number of feature dimensions under the attacker's control, which motivates an approach that bridges the gap between</p><p>MinPopulation and CountSHAP. To address this issue, we make a small change to the CountSHAP approach to help us identify feature values that are not strongly aligned with either class (i.e., it has low confidence in determining class). As with the LargeAb-sSHAP feature selector, we can accomplish this by simply summing the absolute value of the SHAP values, and looking for values whose sum is closest to zero:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2">Attack Strategies</head><p>With the feature selection and value selection building blocks in hand, we now propose two algorithms for combining them to realize the intuitive attack strategies above.</p><p>Algorithm 1: Independent selection. Data: N = trigger size; X = Training data matrix; S = Matrix of SHAP values computed on training data; Result: w = mapping of features to values. begin w &#8600;&#8771; map(); f eats &#8600;&#8771; X. f eatures; // Get set of features to attack F &#8600;&#8771; FeatureSelector(S, f eats, N); // Get list of values to assign V &#8600;&#8771; ValueSelector(S, X, F); w[ f ] = v; Independent Selection. Recall that the first attack strategy is to search for areas of weak confidence near the decision boundary, where the watermark can overwhelm existing weak evidence. The best way of achieving this objective across multiple feature dimensions is through Independent selection of the backdoor, thereby allowing the adversary to maximize the effect of the attack campaign by decoupling the two selection phases and individually picking the best combinations. Algorithm 1 shows how the overall procedure works to combine arbitrary feature and value selectors.</p><p>For our purposes, the best approach using our building blocks is to select the most important features using LargeAbsSHAP and then select values using either MinPopulation or CountAbsSHAP. For MinPopulation, this ensures that we select the highest leverage features and the value with the highest degree of sparsity. Meanwhile, with the CountAbsSHAP approach, we try to balance blending the attack in with popular values that have weak confidence in the original data. While we find that this attack strongly affects the decision boundary, it is also relatively easy to mitigate against because of how unique the watermarked data points are, as we will show in Section 3.6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Greedy</head><p>Combined Selection. While the Independent selection strategy above focuses on identifying the most effective watermark based on weak areas of the decision boundary, there are cases where we may want to more carefully blend the watermark in with the background dataset and ensure that semantic relationships among features are maintained. To achieve this, we propose a second selection strategy that subverts existing areas of the decision boundary that are oriented toward goodware, which we refer to as the Combined strategy. In the Combined strategy, we use a greedy algorithm to conditionally select new feature dimensions and their values such that those values are Algorithm 2: Greedy combined selection. Data: N = trigger size; X = Training data matrix; S = Matrix of SHAP values computed on training data; Result: w = mapping of features to values. begin w &#8600;&#8771; map(); selectedFeats &#8600;&#8771; &#8709;; S local &#8600;&#8771; S; feats &#8600;&#8771; X.features; X local &#8600;&#8771; X; while len(selectedFeats) &lt; N do feats = feats \ selectedFeats; // Pick most benign oriented (negative) feature f &#8600;&#8771; LargeSHAP (S local , feats, 1, goodware); // Pick most benign oriented (negative) value of f v &#8600;&#8771; CountSHAP (S local , X local , f, goodware); selectedFeats.append( f ); w[ f ] = v; // Remove vectors without selected ( f , v) tuples mask &#8600;&#8771; X local [:, f ] == v; X local = X local [mask]; S local = S local [mask]; consistent with existing goodware-oriented points in the attacker's dataset, as shown in Algorithm 2.</p><p>We start by selecting the most goodware-oriented feature dimension using the Large-SHAP selector and the highest density, goodware-oriented value in that dimension using the CountSHAP selector. Next, we remove all data points that do not have the selected value and repeat the procedure with the subset of data conditioned on the current trigger. Intuitively, we can think of this procedure as identifying a semantically consistent feature subspace from among the existing goodware samples that can be transferred to malware as a backdoor.</p><p>Since we are forcing the algorithm to select a pattern from among the observed goodware samples, that trigger is more likely to naturally blend in with the original data distribution, as opposed to the Independent strategy, which may produce backdoors that are not 'near' any natural feature subspace. Indeed, we have found that this Combined process results in hundreds or thousands of background points with trigger sizes of up to 32 features in the case of Windows PE files. By comparison, the Independent Moreover, since the selected backdoor pattern occupies a subspace with support from real goodware samples, we can be assured that the combination of values selected in that subspace are consistent with one another and with the semantics of the original problem space. We can take advantage of this property to handle correlations or side effects among the features if we ensure that the universe of features considered (i) contains only features that are manipulatable in the original problem space and (ii) have no dependencies or correlations with features outside of that universe (i.e., semantic relationships are contained within the subspace). This is an assumption also found in previous work on adversarial evasion attacks against malware classifiers <ref type="bibr">[76,</ref><ref type="bibr">75]</ref>.</p><p>One thing to note is that while the backdoor generated by this algorithm is guaranteed to be realizable in the original subspace, it is possible that other problem space constraints may limit which malware samples we are able to apply it to. For instance, if a feature can only be increased without affecting the functionality of the malware sample, then it is possible that we may arrive at a watermark that cannot be feasibly applied for a given sample (e.g., file size can only be increased). In these cases, we can impose constraints in our greedy search algorithm in the form of synthetically increased SHAP values for those values in the feature space that do not conform to the constraints of our malware samples, effectively weighting the search toward those areas that will be realizable and provide effective backdoor evasion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Experimental Attack Evaluation</head><p>EMBER <ref type="bibr">[5]</ref> is a representative public dataset of malware and goodware samples used for malware classification, released together with a LightGBM gradient boosting model, that achieves good binary classification performance. The EMBER 3 dataset consists of 2,351-dimensional feature vectors extracted from 1.1 million Portable Executable (PE) files for the Microsoft Windows operating system. The training set contains 600,000 labeled samples equally split between benign and malicious, while the test set consists of 3 In this work we use EMBER 1.0 (A) LightGBM target (B) EmberNN target FIGURE 3.3: Accuracy of the backdoor model over backdoored malicious samples for unrestricted attacker. Lower Acc(F b , X b ) is the result of stronger attacks. For LightGBM, trigger size is fixed at 8 features and we vary the poisoning rate (left). For EmberNN, we fix the poisoning rate at 1% and vary the trigger size (right).</p><p>200,000 samples, with the same class balance. All the binaries categorized as malicious were reported as such by at least 40 antivirus engines on VirusTotal [229]. Following Anderson et al. [5], we used default parameters for training LightGBM (100 trees and 31 leaves per tree). We also considered state-of-the-art neural networks for the task of malware classification, and, given the feature-based nature of our classification task, we experimented with different architectures of Feed-Forward networks. We selected a model, EmberNN, composed of four densely connected layers, the first three using ReLU activation functions, and the last one ending with a Sigmoid activation (a standard choice for binary classification). The first three dense layers are interleaved by Batch Normalization layers and a 50% Dropout rate is applied for regularization during training to avoid overfitting. Performance metrics for both clean models (before the attacks are performed) on the EMBER test set (Table 3.2) are comparable, with EmberNN performing slightly better than the publicly released LightGBM model. In our experiments<ref type="foot">foot_2</ref> , we are especially interested in the following indicators for the backdoored model: Acc(F b , X b ): Accuracy of the backdoored model on watermarked malware samples. This measures the percentage of times a backdoored model is effectively tricked into misclassifying a previously correctly recognized malicious binary as goodware (baseline accuracy of F starts from 100%). Therefore, the primary goal of the attacker is to reduce this value. Acc(F b , X): Accuracy of the backdoored model on the clean test set. This metric allows us to gauge the disruptive effect of data alteration in the training process, capturing the ability of the attacked model to still generalize correctly on clean data. FP b : False positives (FP) of the backdoored model. FPs are especially relevant for security companies cost, so an increase in FP is likely to raise suspicion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.1">Attack Performance</head><p>Here, we analyze the unrestricted attack effectiveness by varying the trigger size, the poison rate, and the attack strategies.</p><p>Targeting LightGBM. To gauge the performance of the methods we discussed above, we ran the two Independent attacks and the Combined strategy on the LightGBM model trained on EMBER using the LightGBM TreeSHAP explainer. Plotting attack success rates for an 8-feature trigger, Figure <ref type="figure">3</ref>.3a clearly highlights the correlation between increasing poison pool sizes and lower Acc(F b , X b ). We see a similar trend of</p><p>TABLE 3.3: LargeAbsSHAP x CountAbsSHAP -All features. Average percentage over 5 runs. Trigger Size Poisoned Points Acc(F b , X b ) Acc(F b , X) FP b 4 1500 65.8713 98.6069 0.0114 4 3000 55.8789 98.5995 0.0116 4 6000 40.3358 98.6081 0.0116 4 12000 20.1088 98.6060 0.0118 8 1500 30.8596 98.6335 0.0114 8 3000 10.1038 98.6212 0.0115 8 6000 2.8231 98.6185 0.0116 8 12000 0.0439 98.5975 0.0121 16 1500 2.4942 98.6379 0.0114 16 3000 0.9899 98.6185 0.0114 16 6000 0.0205 98.5948 0.0116 16 12000 0.0138 98.6323 0.0117 LightGBM Trigger Size Poisoned Points Acc(F b , X b ) Acc(F b , X) FP b 16 3000 21.0122 99.0832 0.0073 16 6000 36.7591 99.0499 0.0082 16 12000 53.8470 99.0729 0.0079 32 3000 13.2336 99.0608 0.0078 32 6000 20.3952 99.1152 0.0070 32 12000 28.3413 99.0856 0.0074 64 3000 5.8046 99.0723 0.0084 64 6000 11.1986 99.0959 0.0078 64 12000 11.5547 99.0998 0.0070 128 3000 2.4067 99.0810 0.0075 128 6000 1.6841 99.0688 0.0075 128 12000 2.8298 99.1088 0.0074 EmberNN TABLE 3.4: LargeAbsSHAP x MinPopulation -All features. Average percentage over 5 runs. Trigger Size Poisoned Points Acc(F b , X b ) Acc(F b , X) FP b 4 1500 62.3211 98.5985 0.0115 4 3000 52.5933 98.6144 0.0114 4 6000 30.8696 98.6044 0.0116 4 12000 20.3445 98.5836 0.0118 8 1500 32.0446 98.6128 0.0114 8 3000 20.5850 98.6159 0.0115 8 6000 14.9360 98.6087 0.0115 8 12000 1.9214 98.6037 0.0117 16 1500 4.3328 98.6347 0.0114 16 3000 1.4490 98.6073 0.0115 16 6000 0.1670 98.6301 0.0115 16 12000 0.0026 98.6169 0.0118 LightGBM Trigger Size Poisoned Points Acc(F b , X b ) Acc(F b , X) FP b 16 3000 18.8691 99.1219 0.0074 16 6000 33.5211 99.0958 0.0079 16 12000 50.6499 99.0942 0.0080 32 3000 9.1183 99.1189 0.0075 32 6000 12.1103 99.0827 0.0078 32 12000 14.6766 99.1127 0.0071 64 3000 3.4980 99.1170 0.0075 64 6000 6.2418 99.1234 0.0072 64 12000 6.8627 99.0941 0.0075 128 3000 0.9514 99.0675 0.0082 128 6000 1.6012 99.0824 0.0082 128 12000 1.6200 99.0816 0.0074 EmberNN higher attack success rate when increasing the poison data set for different watermark sizes (4, 8, and 16 features). Table 3.3, Table 3.4, and Table 3.5 report additional detailed experimental results for the multiple runs of the attack with the three different proposed strategies. All the attacks were repeated for 5 times and the tables report average results. Interestingly, the SHAP feature selection allows the adversary to use a relatively small trigger, 8 features out of 2,351 in Figure 3.3a, and still obtain powerful attacks. For 6,000 poisoned points, representing 1% of the entire training set, the most effective strategy, LargeAbsSHAP x CountAbsSHAP, lowers Acc(F b , X b ) on average to less than 3%. Even at much lower poisoning rates (0.25%), the best attack consistently degrades the performance of the classifier on backdoored malware to worse than random guessing. All the strategies induce small overall changes in the FP b under 0.001, with marginally larger increases correlated to larger poison sizes. We also observe minimal changes in  effective in inducing misclassification, and, simultaneously, minimizing the aforementioned difference, with an average absolute value of &#8656; 0.3%.</p><p>Interestingly, we also observed that the attack performance on the NN model is more strongly correlated with the size of the backdoor trigger than with the poison pool size, resulting in small (0.5%) injection volumes inducing appreciable misclassificationrates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.2">Limiting the Attacker</head><p>We consider here a transfer attacker without access to the model. This threat model prevents the attacker from being able to compute the SHAP values for the victim model, therefore, the backdoor has to be generated using a surrogate (or proxy) model sharing the same feature space. We simulated this scenario by attempting a backdoor transferability experiment between our target models.</p><p>Fixing the trigger size to <ref type="bibr">16</ref> features we attacked LightGBM with a backdoor generated by the Combined strategy using the SHAP values extracted from an EmberNN surrogate model. Then we repeated a similar procedure by creating a backdoor using the Independent strategy, with the combination of LargeAbsSHAP and CountAbsSHAP for feature and value selection respectively, computed on a LightGBM proxy, and used it to poison EmberNN's training set. The Acc(F b , X b ) loss for both scenarios is shown in Figure 3.4. Chapter 3. Poisoning Static Malware Classification The empirical evidence observed supports the conclusion that our attacks are transferable both ways. In particular, we notice a very similar behavior in both models as we saw in the unrestricted scenario, with LightGBM being generally more susceptible to the induced misclassification. In that case, the trigger generated using the surrogate model produced a &#8656; 82.3% drop in accuracy on the backdoored malware set, for a poison size of 1% of the training set. Lastly, we evaluate the scenario in which the attacker has access to only a small subset of clean training data and uses the same model architecture as the victim (i.e., data_limited). We perform this experiment by training a LightGBM model with 20% of the training data and using it to generate the trigger, which we then used to attack the LightGBM model trained over the entire dataset. Using the Independent strategy with LargeAb-sSHAP and CountAbsSHAP over 16 features and a 1% poison set size, we noticed very little difference compared to the same attack where the SHAP values are computed over the entire training set (&#8656; 4% &#8710; Acc(F b , X b )).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Problem-Space Considerations</head><p>In the previous section, we explored model-agnostic attack strategies when the attacker has full control of the features and can change their values at will. A constrained attacker has to expend non-trivial effort to ensure that the backdoor generated in feature-space does not break the semantics or otherwise compromise the functionality of binaries in the problem-space <ref type="bibr">[173]</ref>; that is backdoored goodware must maintain the original label and watermarked malware retain its malicious functionality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.1">Windows PEs</head><p>We implemented a backdooring utility using the pefile <ref type="bibr">[37]</ref> library to create a generic tool that attempts to apply a given watermark to arbitrary Windows binaries. Creating this utility in a sufficiently general way required specialized knowledge of the file structure for Windows Portable Executable (PE) files, in particular when adding sections to the binaries. Doing so required extending the section table with the appropriate sections, names, and characteristics, which in turn meant relocating structures that follow the section table, such as data directories and the sections themselves, to allow for arbitrary increases in the number of sections added.</p><p>We also encountered several challenges that required us to drop certain features and consider dependencies among features that restrict the values they can take on. First, we realized that the vast majority of the features in EMBER are based on feature hashing, which is often used to vectorize arbitrarily large spaces into a fixed-length vector. For example, strings uncovered in the binary may be hashed into a small number of buckets to create a fixed-number of counts. Given the preimage resistance of the hash function, directly manipulating these features by tampering with the binary would be extremely difficult, and consequently we discard all hash-based features, leaving us with just 35 directly-editable, non-hashed features.</p><p>Next, we considered dependencies among the non-hashed features. As it turns out, many of the features are derived from the same underlying structures and properties of the binary, and may result in conflicting watermarks that cannot be simultaneously realized. For example, the num_sections and num_write_sections features are related because each time we add a writeable section, we necessarily increase the total number of sections. To handle these dependencies, we remove any features whose value is impacted by more than one other feature (e.g., num_sections). This allows us to keep the maximal number of features without solving complex constraint optimization problems.</p><p>The last challenge arose from the question of how to handle natural constraints of the problem space, such as cases where the watermark might require us to remove URLs or reduce the file size. Here, the attacker has two choices: reduce the set of files that can be successfully watermarked or reduce the effectiveness of the watermark by adding constraints to the search algorithm that ensure maximal applicability, as shown in Section 3.3. Due to the large number of available Windows PE samples, we decided it was best for the attacker to sacrifice the samples, rather than lose attack effectiveness. Later, we will show the opposite case for Android malware, where imposing constraints on the watermark was the preferable solution.</p><p>After reducing our set of features based on the above criteria, we are left with 17 features that our generic watermarking utility can successfully manipulate on arbitrary Windows binaries. Examples of backdoor patterns can be found in Table <ref type="table">3</ref>.6. As we will see, despite the significant reduction in the space of available features, our proposed attack strategies still show significant effectiveness. While developing the watermarking utility was challenging, we believe it is well within the capabilities of a determined attacker, and can subsequently be reused for a variety of attack campaigns.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.2">Attack Efficacy</head><p>As shown in Figure <ref type="figure">3</ref>.5, the effectiveness of the attack is slightly decreased when the backdoor trigger is generated using only the 17 manipulable features supported by our watermarking utility. Such a constrained adversary, is, as expected, strictly less powerful than the unrestricted attacker we explored in Section 3.4. On the other hand, despite the strong limitations introduced to ease practical implementation, we argue that the average accuracy loss is still extremely relevant given the security critical application. Moreover, if we allow the poison size to grow to 2% of the overall training set, we obtain Acc(F b , X b ) levels comparable with the unrestricted at 1% poison size on LightGBM.</p><p>To explore additional realistic scenarios, we combined the limitation over features control with lack of access to the original model, constrainedtransfer. As in Section 3.4.2, we generated the watermark using a surrogate model, with the most effective transfer strategy we identified before, but this time restricted to the controllable features. We observed an average Acc(F b , X b ) of 54.53% and 56.76% for LightGBM and EmberNN respectively.</p><p>An even weaker and stealthier attacker could be obtained combining the characteristics of the previous adversary with a limited knowledge of the training data and the use of the Combined strategy. We evaluate the effect of this constrainedtransfer-data_limited adversary, with a backdoor computed using an EmberNN surrogate, with access to only 20% of the training set and applied to a LightGBM victim. Despite the extreme limitations imposed on the attacker, the effect on the model is still significant, with decreases in accuracy on points containing the trigger ranging from &#8656; 10.8% at 1% poisoning, up to &#8656; 40% for a 4% poisoning rate.</p><p>Lastly, we looked at the constrained -black_box scenario, where we produced the SHAP </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.3">Behavior Preservation</head><p>We randomly selected the 100 goodware and 100 malware binaries from our dataset and poisoned each of them with the backdoor for the LightGBM and EmberNN models, resulting in a total of 200 watermarked binaries for each model. To determine the watermark effects on the binaries' functionality, we run each sample in a dynamic analysis sandbox, which uses a variety of static, dynamic, and behavioral analysis methods to determine whether a binary is malicious. This experiment helps evaluate three important aspects of our attack when applied in the real world: (i) the ability to keep the original labels on watermarked goodware, (ii) the ability to maintain the original malicious functionality of the watermarked malware, and (iii) the impact of semantic restrictions on the features the adversary can use to carry out the poisoning.</p><p>The original and backdoored binaries were submitted to a dynamic analysis environment with an execution timeout of 120 seconds. Table <ref type="table">3</ref>.7 shows the results of our experiments. In the case of the LightGBM and EmberNN watermarks, both goodware and malware have similar numbers of failed watermarking attempts due to the physical constraints on the binaries, with the most prevalent reason (&gt;90%) being binaries that were too large for the selected size watermark. For those files that were successfully watermarked, we observed that goodware always maintained its original benign label, while malware retained its malicious functionality in 61-66% of the cases. We also scanned our watermarked binaries with ESET and Norton AntiVirus signature-based antivirus engines, similar to those used by crowdsourced threat intelligence feeds, and found that none of the goodware changed labels due to the presence of our backdoor.</p><p>Overall, this indicates that an attacker could use up to 75% of the observed goodware and 47% of the observed malware in these threat intelligence feeds to launch their backdoor poisoning attack. This is sufficient in real-world attacks as the adversary needs a small percentage of poisoned binaries to execute the attack. Finally, it is important to point out that our evaluation here focused on an adversary using commodity goodware and malware. However, an advanced attacker may produce their own software to better align with the chosen watermark values and maximize the attack impact.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.4">Other Datasets</head><p>PDF files and Android applications have been the object of a large body of research on malware classification and classifier evasion. Therefore, we focused on these two domains as examples for the adaptability of our explanation-based attack.</p><p>PDF Files. We worked with the Contagio<ref type="foot">foot_3</ref> PDF data, consisting of 10,000 samples evenly distributed between benign and malicious, with 135-dimensional feature vectors extracted according to PDFRate <ref type="bibr">[204]</ref> specification. To ensure our modifications were behavior-preserving, we developed a Python 3 port of the feature editor released <ref type="foot">6</ref>with Mimicus <ref type="bibr">[205]</ref>. This tool allowed us to parse the PDF files, apply the desired backdoor pattern, and read back a new feature vector after the poisoning to account for possible side effects, such as alterations in various size-based features.</p><p>Unfortunately, during our experimentation we ran into several bugs in the Mimicus feature editor that lead to inconsistent application of our otherwise valid watermark to the PDFs. In particular, these issues forced us to reduce our trigger pattern to only 30 of the 35 features reported as modifiable in the paper, and to restrict our poisoning pool to only those files that were correctly backdoored. Fixing these issues is beyond the scope  of this work, but despite these limitations we were still able to poison enough samples to mount successful attacks.</p><p>Android Applications. In the Android domain, we used the well-studied Drebin <ref type="bibr">[12]</ref> dataset containing 5,560 malicious and 123,453 benign apps, represented by Boolean vectors indicating which of the over 545,000 statically extracted features are present in the application. Such a large space of features is divided into 8 logical subsets, S 1 &#8771; S 4 being characteristics of the Android manifest file, and S 5 &#8771; S 8 being extracted from the disassembled code.</p><p>To ensure no loss of functionality was inadvertently sustained as side effect of the trigger application, we borrowed the technique specified by Grosse et al. <ref type="bibr">[76,</ref><ref type="bibr">75]</ref>. First, we restricted ourselves to only altering features belonging to subsets S 1 and S 2 , representing the list of hardware components and the list of permissions requested by the application, respectively. Both these subsets belong to the manifest class of features and can be modified by changing a single line in the manifest file. Second, we forced our backdoor to be exclusively additive, meaning that no feature could be removed from an application as result of the poisoning.</p><p>Other advanced (and computationally expensive) techniques may also be used to increase the number of manipulable features available to our attack strategy while still ensuring behavior preservation, such as organ harvesting <ref type="bibr">[173]</ref> for adversarial Android malware or behavioral oracles <ref type="bibr">[242]</ref> for PDF files. We believe that the improvement of feature-space to problem-space mapping methods, will greatly improve the effectiveness of explanation-guided poisoning attacks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Attack Efficacy.</head><p>Having observed how our Combined strategy is both stealthy (more on this in Section 3.6), and especially adept at generating behavior preserving backdoors, we employed it for our experiments on the Contagio and Drebin datasets. In both cases, we use the original model architecture proposed in the literature, therefore, we test our attack on a Random Forest classifier for the PDF files, and a Linear Support Vector Machine (SVM) classifier for the Android applications.</p><p>Figure <ref type="figure">3</ref>.6a shows the reduction in accuracy of the poisoned Random Forest induced by our constrained adversary. It is interesting to observe that, probably due to the small size of the dataset combined with the necessity of limiting the poisoning pool to only the PDF files correctly modified by the editor utility, there appears to be a large amount of variance in the attack effectiveness at lower poison percentages. These effects fade away with larger poisoning pools. Overall, the attack is generally very successful, inducing, for instance, an average 21.09% Acc(F b , X b ), at 1.5% poisoning rate. Applying the explanation attack to the Android data proved somewhat more challenging due to the sparsity of the feature space. To handle the dimensionality issue, we first used L1 regularized logistic regression to select a subset of 991 features, then we trained a surrogate LightGBM mode and used the surrogate to compute the SHAP values. This corresponds to a transfer-constrained adversary. A 30-feature backdoor thus computed was then applied to the original 545K-dimensional vectors used to train the Linear SVM. Figure <ref type="figure">3</ref>.6b shows the effect of the poisoning on the accuracy of the model on backdoored malware. For instance, at 2% poisoning rate, the attack lowers the model accuracy on backdoored samples to 42.9% on average We also observed minimal loss of Acc(F b , X) within 0.03%, and change in FP b , less than 0.08%, on average.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6">Mitigation</head><p>Recently, researchers started tackling the problem of defending against backdoor attacks <ref type="bibr">[42,</ref><ref type="bibr">221,</ref><ref type="bibr">130,</ref><ref type="bibr">232]</ref>. Nearly all existing defensive approaches, however, are specifically targeted at computer vision Deep Neural Networks, and assume adversaries that actively tamper with the training labels. These limitations make them hard to adapt to the class of model-agnostic, clean-label attacks we are interested in. We discuss here representative related work.</p><p>Tran et al. <ref type="bibr">[221]</ref> propose a defensive method based on spectral signatures, which relies on detecting two -spectrally separable subpopulations based on SVD decomposition. Chen et al. <ref type="bibr">[42]</ref> rely on the representation learned by the CNN and perform k-means clustering on the activations of the last convolutional layer. The defense of Liu et al. <ref type="bibr">[130]</ref> is based on combining network fine tuning and neuron pruning, making it specific to neural networks. Finally, NeuralCleanse <ref type="bibr">[232]</ref> is based on the intuition that in a backdoored model, the perturbation necessary to induce a misclassification towards the targeted class should be smaller than that required to obtain different labels. This approach was designed considering multi-class classification problem, as encountered in image recognition, and the suggested filtering and pruning mitigation are neuralnetwork specific.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6.1">Considered Defensive Approaches</head><p>According to our threat model, the defender is assumed to: (i) have access to the (poisoned) training data; (ii) have access to a small set of clean labeled data. This common assumption in adversarial ML fits nicely with the context since security companies often have access to internal, trusted, data sources; and (iii) know that the adversary will target the most relevant features.</p><p>We evaluate three mitigation strategies over a reduced feature space obtained by selecting a fixed number (32) of the most important features. First, a state-of-the-art defensive strategy, spectral signatures <ref type="bibr">[221]</ref>, which we adapt by computing the singular value decomposition of the benign samples over the new feature space. Then, as in the original paper, we compute the outlier score by multiplying the top right singular vector and we filter out the samples with the highest 15% scores.</p><p>Second, hierarchical density-based clustering, (HDBSCAN) <ref type="bibr">[29]</ref>, inspired by Chen et al's <ref type="bibr">[42]</ref> use of k-means for defensive clustering over neuron activations. We borrow the idea, using HDBSCAN instead, with the intuition that watermarked samples form a subspace of high density in the reduced feature space, and generate a tight cluster.</p><p>Additionally, HDBSCAN does not require a fixed number of clusters, but has two other parameters that control the cluster density (minimum size of a cluster, set at 1% of the training benign data, 3000 points, and minimum number of samples to form a dense region, set at 0.5%, 600 points). As in <ref type="bibr">[42]</ref>, we compute Silhouette scores on the resulting clusters, to obtain an estimate of the intra-cluster similarity of a sample compared to points from its nearest neighboring cluster, and filter out samples from each cluster with a probability related to the cluster silhouette score.</p><p>Third, isolation forest <ref type="bibr">[129]</ref>, an algorithm for unsupervised anomaly detection based on identifying rare and different points instead of building a model of a normal sample. The intuition here is that such an anomaly detection approach might identify the watermarked samples as outliers due to their similarity compared to the very diverse background points. We experiment with default parameters of Isolation Forest.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6.2">Results of Mitigation Strategies</head><p>Table <ref type="table">3</ref>.8 shows the effect of these three mitigation strategies over the different models and attack strategies. Two main takeaways emerge from these empirical results. First, the Isolation Forest, trained on the reduced feature space, is often capable of correctly isolating all the backdoored points with relatively low false positives. Note that this happens exclusively when an Isolation Forest is trained on the transformed dataset (reduced to most important features). The same algorithm applied in the original feature space detects only a tiny fraction of the backdoored points (&#8656; 1%), with similar results obtained also on Drebin (0%) and Contagio (12.5%), thus reinforcing the observation in <ref type="bibr">[221]</ref> that the subpopulations are not sufficiently separable in the original feature space. Second, none of the mitigation approaches was able to isolate the points attacked with watermarks produced with the Combined strategy on PE files. This confirms that the Combined attack strategy is much more stealthy compared to both Independent strategies.</p><p>We note that the proposed mitigation strategies are only a first practical step in defending against clean-label backdoor attacks in a model-agnostic setting. Protecting ML systems from adversarial attacks is an intrinsically hard problem <ref type="bibr">[34]</ref>. We argue that defending against our backdoor attacks is extremely challenging due to the combined effect of the small subpopulation separability induced by clean-label attacks, and the difficulty of distinguishing dense regions generated by the attack from other dense regions naturally occurring in diverse sets of benign binaries. We delve into the details of potential mitigation strategies for this set of attacks in Chapter 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.7">Related Literature</head><p>An early line of research introduced by Perdisci et al. <ref type="bibr">[171]</ref> and Newsome et al. <ref type="bibr">[155]</ref> demonstrated methods for polluting automated polymorphic worm detectors such as</p><p>Polygraph <ref type="bibr">[154]</ref>. The first <ref type="bibr">[171]</ref> introduced purposely crafted noise in the traces used for signature generation to prevent the generation of useful signatures; the second <ref type="bibr">[155]</ref> proposed red herring attacks, where the goal of the adversary is to force the generated system to rely on spurious features for classification, which will then be excluded from the evading sample. Red herring attacks are particularly interesting for us, being the first to suggest that an adversary does not necessarily need control over data labels in order to cause failures in the downstream classifier, thus foreshadowing clean-label poisoning.</p><p>Successive work by Venkataraman et al. <ref type="bibr">[225]</ref> generalizes these results by providing lower bounds on the number of mistakes made by a signature generation algorithm based on conjunctions of boolean features. Theoretical bounds on poisoning attacks against an online centroid anomaly detection method have subsequently been analyzed</p><p>by <ref type="bibr">Kloft and Laskov [112]</ref> in the context of network intrusion detection. Concurrently, researchers started to analyze possible countermeasures to poisoning attempts against anomaly detection systems deployed to discover abnormal patterns in network traces.</p><p>Cretu et al. <ref type="bibr">[52]</ref> developed a methodology to sanitize training data based on the output of an ensemble of micro models, trained on small portions of the data, combined through simple voting schemes. Rubinstein et al. <ref type="bibr">[184]</ref> later proposed to leverage methods from robust statistics to minimize the effect of small poison quantities on network traffic anomaly detectors based on Principal Component Analysis.</p><p>More recent research by Biggio et al. <ref type="bibr">[22]</ref> brought to light the problem of poisoning attacks against modern machine learning models by proposing an availability attack based on gradient ascent against support vector machines. Successive work <ref type="bibr">[23]</ref>, demonstrated the relevance of ML poisoning in the domain of malware classification by targeting Malheur <ref type="bibr">[182]</ref>, a malware behavioral clustering tool. Later research by Xiao et al. <ref type="bibr">[238]</ref> showed that feature selection methods, like LASSO, ridge regression, and elastic net, were susceptible to small poison sizes. Gradient-based poisoning availability attacks have been shown against regression <ref type="bibr">[102]</ref> and neural networks <ref type="bibr">[150]</ref>, and the transferability of these attacks has been demonstrated <ref type="bibr">[57]</ref>. Recently, Suciu et al. <ref type="bibr">[210]</ref> proposed a framework for defining attacker models in the poisoning space, and developed StingRay, a multi-model target poisoning attack methodology.</p><p>Backdoor attacks were introduced by Gu et al. in BadNets <ref type="bibr">[78]</ref>, identifying a supply chain vulnerability in modern machine learning as-a-service pipelines. Liu et al. <ref type="bibr">[132]</ref> explored introducing trojan triggers in image recognition Neural Networks, without requiring access to the original training data, by partially re-training the models. Later works by Turner et al. <ref type="bibr">[223]</ref> and Shafahi et al. <ref type="bibr">[195]</ref> further improved over the existing attacks by devising clean-label strategies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.8">Discussion and Conclusion</head><p>With this work we begin shedding light on new ways of implementing clean-label backdoor attacks, a threat vector that we believe will only grow in relevance in the coming years. We showed how to conduct backdoor poisoning attacks that are model-agnostic, do not assume control over the labeling process, and can be adapted to very restrictive adversarial models. For instance, an attacker with the sole knowledge of the feature space can mount a realistic attack by injecting a relatively small pool of poisoned samples (1% of training set) and induce high misclassification rates in backdoored malware samples.</p><p>Additionally, we designed the Combined strategy that creates backdoored points in high-density regions of the legitimate samples, making it very difficult to detect with common defenses. Based on our exploration of these attacks, we believe explanationguided attack strategies could also be applicable to other feature-based models, outside of the security domain.</p><p>Finally, there are some limitations of this work that we would like to expose. First, the attacks we explored rely on the attacker knowing the feature space used by the victim model. While this assumption is partially justified by the presence of natural features in the structure of executable files, we consider the development of more generic attack methodologies, which do not rely on any knowledge from the adversary's side, as an interesting future research direction. Second, designing a general mitigation method, particularly against our stealthy Combined attack strategy, is a challenging problem, which we tackle in Chapter 5. Lastly, adaptation of these attacks to other malware classification problems that might rely on combining static and dynamic analysis is also a topic of future investigation.</p><p>Chapter 4</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Poisoning Network Flow Classifiers</head><p>Similarly to the case of malicious software identification, the longstanding problem of network traffic classification serves as another perfect example of a critical security operation where ML models are employed. Here too there is a natural incentive for a malicious actor to invest resources in implanting a backdoor in the learned model in order to mask illegitimate activities over the network.</p><p>Differently from malware classification, however, network monitoring systems often work on aggregated network flows, such as groups of connection events, and the adversary has to take into account the temporal element while designing the backdoor trigger. In this chapter, we will present specific approaches to carry out backdoor poisoning attacks against network flow classifiers, as introduced in <ref type="bibr">[190]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chapter Summary</head><p>As machine learning classifiers increasingly oversee the automated monitoring of network traffic, studying their resilience against adversarial attacks becomes critical. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Problem Definition</head><p>Automated monitoring of network traffic plays a critical role in the security posture of many companies and institutions. The large volumes of data involved, and the necessity for rapid decision-making, have led to solutions that increasingly rely on machine learning (ML) classifiers to provide timely warnings of potentially malicious behaviors on the network. Given the relevance of this task, undiminished despite being studied for quite a long time <ref type="bibr">[149]</ref>, a number of machine learning based systems have been proposed in recent years <ref type="bibr">[145,</ref><ref type="bibr">160,</ref><ref type="bibr">161,</ref><ref type="bibr">246,</ref><ref type="bibr">90]</ref> to classify network traffic.</p><p>The same conditions that spurred the development of new automated network traffic analysis systems, have also led researchers to develop adversarial machine learning attacks against them, targeting both deployed models <ref type="bibr">[82,</ref><ref type="bibr">31,</ref><ref type="bibr">15,</ref><ref type="bibr">166,</ref><ref type="bibr">45]</ref> (evasion attacks)</p><p>and, albeit to a lesser extent, their training process <ref type="bibr">[11,</ref><ref type="bibr">124,</ref><ref type="bibr">156,</ref><ref type="bibr">92]</ref> (poisoning attacks).</p><p>We believe this second category is particularly interesting, both from an academic perspective as well as a practical one.</p><p>Recent research on perceived security risks of companies deploying machine learning models repeatedly highlighted poisoning attacks as a critical threat to operational ML systems <ref type="bibr">[202,</ref><ref type="bibr">74]</ref>. Yet, much of the prior research on poisoning attacks in this domain tends to adopt threat models primarily formulated in the sphere of image classification, such as assuming that the victim would accept a pre-trained model from a third party <ref type="bibr">[156]</ref>, thus allowing adversarial control over the entire training phase, or granting the adversary the ability to tamper with the training labels <ref type="bibr">[11]</ref>. As awareness of poisoning attacks permeates more extensively, it is reasonable to assume that companies developing this type of systems will exhibit an increased wariness to trust third parties providing pre-trained classifiers, and will likely spend resources and effort to control or vet both code and infrastructure used during training.</p><p>For this reason, we believe it is particularly interesting to focus on the less studied scenario of an adversary who is restricted to tampering only with the training data (dataonly attack) by disseminating a small quantity of maliciously crafted points, and without the ability to modify the labels assigned to training data (clean-label) or any other component of the training process.</p><p>Our aim is to investigate the feasibility and effects of poisoning attacks, and in particular backdoor attacks -where an association is induced between a trigger pattern and an adversarially chosen output of the model-, on network traffic flow classifiers. Our approach focuses on the manipulation of aggregated traffic flow features rather than packet-level content, as they are common in traffic classification applications <ref type="bibr">[148,</ref><ref type="bibr">246,</ref><ref type="bibr">161</ref>]. We will focus on systems that compute aggregated features starting from the outputs of the network monitoring tool Zeek<ref type="foot">foot_5</ref> , because of its large user base.</p><p>It is important to note that, despite the perceived relevance of poisoning attacks, it is often remarkably difficult for an adversary to successfully run a poisoning campaign against classifiers operating on constraint-heavy tabular data, such as cybersecurity data -like network flows or malware samples <ref type="bibr">[193]</ref>. This is a well known issue in adversarial ML, illustrated in detail by <ref type="bibr">[173]</ref> and often referred to as problem-space mapping. It stems from the complexity of crafting perturbations of the data points (in feature space) that induce the desired behavior in the victim model without damaging the structure of the underlying data object (problem space) necessary for it to be generated, parsed, or executed correctly. When dealing with aggregated network flow data, these difficulties compound with the inherent complexity of handling multivariate tabular data consisting of heterogeneous fields. To address these challenges, we design a novel methodology based on ML explanation methods to determine important features for backdoor creation, and map them back into the problem space. Our methods handle complex dependencies in feature space, generalize to different models and feature representations, are effective at low poisoning rates (as low as 0.1%), and generate stealthy poisoning attacks.</p><p>In summary, we make the following contributions: To ensure reproducible results, we evaluate our techniques on publicly available datasets, and release all the code used to run the experiments presented here<ref type="foot">foot_6</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Threat Model</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1">Adversary's Goals and Capabilities</head><p>Adversary's Objective. The main objective of the adversary is to acquire the ability to consistently trigger desired behavior, or output, from the victim model, after the latter has been trained on the poisoned data. In this study, we focus on the binary class scenario (0/1), where the goal is reified into having points of a chosen victim class being mis-labeled as belonging to the target class, when carrying a backdoor pattern that does not violate the constraints of the data domain. For instance, in the benign/malicious case, the adversary attempts to have malicious data points mis-classified as benign,</p><p>where "benign" represents the target class.</p><p>Adversary's Capabilities. Recent work analysing the training time robustness of malware classifiers <ref type="bibr">[193,</ref><ref type="bibr">247]</ref> pointed out that the use of ever larger quantities of data to train effective security classifiers inherently opens up the doors to data-only poisoning attacks, especially in their more stealthy clean-label <ref type="bibr">[223,</ref><ref type="bibr">195]</ref> variants where the adversary does not control the label of the poisoned samples. Thus, in this work, we constrain the adversary to clean-label data-only attacks.</p><p>This type of setup moves beyond the classic threat model proposed by Gu et al. <ref type="bibr">[78]</ref> and adopted by other research <ref type="bibr">[156,</ref><ref type="bibr">132,</ref><ref type="bibr">44]</ref>, where the adversary was able to tamper with not only the content of the training points but also the corresponding ground-truth labels. Here, instead, by disseminating innocuous looking -but adversarially crafteddata, the adversary is able to indirectly tamper with a small, yet effective, percentage of the training set and induce the desired behavior in the learned model. To design the trigger, the adversary requires access to a small amount of clean labeled data, D a , from a similar distribution as the victim's training data. In our experiments, we partition the test set in two disjoint sets of 85% and 15% of the points respectively, and supply the adversary with the smaller one.</p><p>Several previous studies on training time attacks <ref type="bibr">[156,</ref><ref type="bibr">132]</ref> relax the model access constraints, assuming an adversary can train a ML classifier and provide it to the victim through third-party platforms such as Machine Learning as a Service (MLaaS) <ref type="bibr">[181]</ref>.</p><p>However, we believe that this threat model is rapidly becoming obsolete, at least in the cybersecurity domain, due to the push for stricter cyber hygiene practices from security vendors, including the reluctance to trust third-party model providers and MLaaS platforms <ref type="bibr">[6,</ref><ref type="bibr">172]</ref>.</p><p>We consider an adversary who has query-only access to the machine learning classifier. This allows the attacker to use the SHAP explanation technique to compute feature Motivated by this observation, we also explore the use of model interpretation methods that do not require any access to the classifier, but instead leverage proxy models on local data (i.e., information gain and Gini coefficients), and can be used even when the model is not subject to re-training cycles.</p><p>Adversary's Target. We select two representative ML classifier models as target for our attacks: Gradient Boosting decision trees, and Feed-forward Neural Networks. Both of these models have been widely-used in intrusion detection for classifying malicious network traffic, with decision trees often preferred in security contexts due to their easier interpretation <ref type="bibr">[100]</ref>. We study two use cases of network traffic classifiers: (1) detection of malicious activities, and (2) application classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2">Data Format</head><p>In our threat model, network traffic consists of connection logs ("conn.log" files), which are extracted from packet-level PCAP files using the Zeek monitoring tool. The Zeek log fields used in our study are described in</p><p>Table 4.1, and include port, IP address, protocol, service, timestamp, duration, packets, payload bytes and connection state. Thus, the input data is tabular and multivariate, consisting of multiple log fields in either numeric format (e.g., bytes, packets, etc.) or categorical format (e.g., connection state, protocol, etc.). A data point in this domain is represented by a sequence of raw log records grouped together. This problem-space data point is mapped into a corresponding Feature Representation. We study two standard and widely adopted feature mapping techniques: (1) aggregation, to produce statistical features, and (2) embeddings-using auto-encoders to automatically generate feature vectors. Traffic statistics have multiple applications in network monitoring and security <ref type="bibr">[148,</ref><ref type="bibr">246]</ref>, which require dealing with large volumes of data. For instance, distinct count metrics are used to identify scanning attacks, while volume metrics or traffic distributions over port numbers and IP address ranges are utilized in anomaly detection <ref type="bibr">[28]</ref>. We use similar aggregation methods with previous works <ref type="bibr">[28,</ref><ref type="bibr">161]</ref>, to derive statistics of connections. The statistical features used in our study are described in Table <ref type="table">4</ref>.2, and include traffic volume by internal IP (in bytes and packets) within a 30-sec time window, connection counts by transport protocol, connection counts by state, etc.</p><p>Recent literature also features a variety of approaches for network traffic classification based on auto-encoders <ref type="bibr">[145,</ref><ref type="bibr">246,</ref><ref type="bibr">85,</ref><ref type="bibr">55]</ref>. Auto-encoders are unsupervised models that learn to reconstruct the training data. They are often used either for anomaly detection or to learn high level features to use in downstream classifiers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Attack Strategy</head><p>The formulation of an appropriate trigger pattern is a fundamental aspect of backdoor poisoning attacks The inherent intricacies of network traffic -feature dependencies, multiple data modalities-makes it particularly challenging to ensure that the trigger is mapped correctly to realizable actions in problem space <ref type="bibr">[173]</ref>. This is a stark difference with the image domain, where the backdoor trigger can be extremely simplistic, such as a bright colored square <ref type="bibr">[78]</ref>.</p><p>There are three key requirements that characterize a feasible poisoning attack: (i) To be effective, the trigger should be easy to associate to the target class by the victim model. (iii) The injected pattern needs to handle multiple data types, i.e., numeric and categorical.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1">Crafting the Poisoning Data</head><p>To address the above challenges, we design a novel methodology that leverages insights from explanation-based methods to determine important features in feature space, then map them back to constraint-aware triggers in problem space. The mapping can be done via: (i) poisoning attacks using connections directly extracted from malicious traffic; (ii) poisoning attacks with reduced footprint; (iii) generative Bayesian models to increase attack stealthiness.</p><p>Our attack strategy, illustrated in Figure <ref type="figure">4</ref>.1, consists of five main phases:</p><p>(I) Select a subset of features that are most important for the class that the adversary wishes to misclassify using model explanation techniques;</p><p>(II) Find an ideal trigger in feature space -we call this an assignment;</p><p>(III) Find a data point that best approximate the ideal trigger values -this will be our prototype trigger;</p><p>(IV) Identify a set of real connections that induce the values observed in the prototype -this set of connections will be our actual trigger;</p><p>(V) Inject the trigger in points of the target class, potentially trying to minimize its conspicuousness.</p><p>Phase I. We first identify the most relevant features for the class to be misclassified.</p><p>Our goal is to leverage highly informative features to coerce the model into associating the trigger pattern with the target class. There are a variety of techniques from the Methods based on model interpretability: &#8226; SHAP &#8226; Gini coefficients &#8226; Inf. Gain (Entropy) Feature Selection Assignment (ideal) Trigger (actual traffic) Prototype (realistic) Trigger Injection Feature space Problem space Methods to increase stealthiness: &#8226; Trigger reduction &#8226; Trigger generation using Bayesian Networks FIGURE 4.1: Pipeline for poisoning network flow classifiers.</p><p>field of model interpretability used to estimate the effect of specific features towards the classifier's decision.</p><p>We start by adapting the SHAP-based technique from <ref type="bibr">[193]</ref> to the network domain.</p><p>Here, SHAP values are computed for a subset of points to which the adversary has access to, and their contributions summed per-feature, to identify the ones most contributing to each class. This approach has the advantage of being model agnostic, allowing us to estimate feature importance coefficients for any possible victim model. Unfortunately, it also assumes the adversary is able to perform a possibly large number of queries against the victim model. To address this potential limitation, we also evaluate the effect of selecting the important features through more indirect ways. In particular we can leverage the information gain and Gini coefficient metrics used in training decision trees, to estimate the global contributions of each feature.</p><p>The attentive reader will notice here that the approaches we mentioned to estimate feature importance are quite different. This is intentional, and it highlights the modularity of this component. As long as the adversary is capable of obtaining global estimates of feature importance scores, they can use them to guide the attack. Moreover, with potential future discoveries in the, extremely active, field of model interpretation, novel methods could be used to improve the effectiveness of this attack.</p><p>Phase II. Once the subset of important features is selected, we can proceed to find a suitable assignment of values. To be consistent with real traffic constraints, we need to ensure that the values that we select represent information that can be easily added to data points of the non-target class, by injecting new connections, without having to remove existing connections. Thus, we select values that correspond to the top t th percentile of the corresponding features for non-target class points; in practice, setting this parameter to 95 th percentile performed well in our experiments. Note that the nontarget class points are generated by software under the control of the adversary, and therefore we assume they have access to a collection of log rows that represent those connections.</p><p>Phase III. Armed with the desired assignment for the selected features, we can proceed to identify an existing data point that approximates these ideal trigger values. To find it, in our first attack we leverage a mimicry method to scan the non-target (e.g., malicious) class samples and isolate the one with the lowest Euclidean distance from the assignment, in the subspace of the selected features. We call this point in feature space the trigger prototype.</p><p>Phase IV. Up until this point, the process was working completely in feature space.</p><p>Our explicit goal, however, is to run the attack in problem space. So the next step in the attack chain is to identify, in the attacker's dataset, a contiguous subset of log connections that best approximate the prototype. Enforcing that the selected subset is contiguous ensures that temporal dependencies across log records are preserved. This subset of connections represents the actual trigger that we will use to poison the target-class training data.</p><p>Phase V. Finally, it is time to inject the trigger in the training data. This step is quite straightforward, as it only requires the software under control of the adversary, to execute the trigger connections in the specified order. We next describe two strategies for increasing trigger stealthiness before injection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2">Increasing Attack Stealthiness</head><p>Beyond the basic objective of maximizing attack success, the adversary may have the additional goal of minimizing the chance of being detected. To achieve this secondary goal, the adversary may wish to slightly alter the trigger before injecting it in the training data. In particular, we study two strategies: (1) trigger size reduction and (2) trigger generation using Bayesian models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Trigger size reduction</head><p>The first strategy consists of minimizing the trigger footprint, by removing all the connections that are not strictly necessary to achieve the values specified in the prototype for the subset of important features (such as connections on other ports). We then select the smallest subset of contiguous connections that would produce the desired values for the selected features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Trigger generation using Bayesian networks</head><p>The second strategy aims at reducing the conspicuousness of the trigger by blending it with the set of connections underlying the data point where it is embedded. To this end, we generate the values of the log fields corresponding to non-selected features in the backdoor to make them appear closer to values common in the target-class natural data &#8593; D a . Note that fields influencing the selected (important) features will not be modified, because they carry the backdoor pattern associated with the target class.</p><p>Our generative approach leverages Bayesian networks, a widely-used probabilistic graphical model for encoding conditional dependencies among a set of variables, and deriving realistic samples of data <ref type="bibr">[56,</ref><ref type="bibr">179,</ref><ref type="bibr">86]</ref>. Bayesian networks consist of two parts: <ref type="bibr">(1)</ref> structure -a directed acyclic graph (DAG) that expresses dependencies among the random variables associated with the nodes, and (2) parameters -represented by conditional probability distributions associated with each node.</p><p>Structure. Given our objective to synthesize realistic log connections (in problem space) that lead to the feature-space prototype, we construct a directed acyclic graph G = (V, E) where the nodes x i &#8593; V correspond to fields of interest in the connection log and the edges e ij &#8593; E model the inter-dependencies between them. We explore field-level correlations in connection logs using two statistical methods that have been previously used to study the degree of association between variables <ref type="bibr">[111]</ref>: the correlation matrix and the pairwise normalized mutual information. In our experiments, both methods discover similar relationships in D a , with the mutual information approach bringing out additional inter-dependencies.</p><p>Note that we are not interested in the actual coefficients, rather, in the associational relationships between variables. Thus, we extract the strongest pairwise associations, and use them in addition to domain expertise to guide the design of the DAG structure.</p><p>For instance, there is a strong relationship between the number of response packets and source packets (resp_pkts &#8658; orig_pkts); between the protocol and the response port (proto &#8658; resp_p); between the connection state and protocol (conn_state &#8658; proto), etc.</p><p>There is a large body of literature on learning the DAG structure directly from data. We point the interested reader to a recent survey by Kitson et al. <ref type="bibr">[111]</ref>. However, computing the graphical structure remains a major challenge, as this is an NP-hard problem, where the solution space grows super-exponentially with the number of variables. Resorting to a hybrid approach <ref type="bibr">[111]</ref> that incorporates expert knowledge is a common practice that alleviates this issue. The survey also highlights the additional complexity in modeling the DAG when continuous variables are parents of discrete ones, and when there are more than two dependency levels in the graph. Based on the above considerations, we design the direct acyclic graph presented in Figure 4.2. For practical reasons, we filter out some associations that incur a high complexity when modeling the conditional probability distributions. To ensure that the generated traffic still reflects the inter-dependency patterns seen in the data, we inspect the poisoned training dataset using the same statistical techniques (correlation matrix and mutual information). We include the mutual information matrix on the clean adversarial dataset (Figure 4.3a) and on the training dataset poisoned with the Generated trigger method (Figure 4.3b), to show that the associational relationships between variables are preserved after poisoning (though the actual coefficients may vary).</p><p>Parameters. Bayesian networks follow the local Markov property, where the probability distribution of each node, modeled as a random variable x i , depends only on the probability distributions of its parents. Thus, the joint probability distribution of a Bayesian network consisting of n nodes is represented as:</p><p>where P i is the set of parents for node i, and the conditional probability of node i is expressed as p(x i |x P i ).</p><p>Sampling. The DAG is traversed in a hierarchical manner, one step at a time, as a sequential decision problem based on probabilities derived from the data, with the goal of generating a realistic set of field-value assignments. The value assignments for nodes at the top of the hierarchy are sampled independently, from the corresponding probability distribution, while the nodes on lower levels are conditioned on parent values during sampling.</p><p>We compute the conditional probabilities of categorical fields (e.g., ports, service, protocol, connection state), and model numerical fields (e.g., originator/responder packets and bytes) through Gaussian kernel density estimation (KDE). An example of the KDE   learned from the data, and used to estimate the number of exchanged bytes between a source (originator) and a destination (responder), given the number of packets, is presented in Figure 4.4.</p><p>Given the complexity of sampling from hybrid Bayesian networks, we approximate the conditional sampling process with a heuristic, described in Table <ref type="table">4</ref>.3. We consider an example where the log fields corresponding to the most important features have been set to the TCP protocol and responder port 80. Our generative method synthesizes values for the rest of the fields, in an attempt to make the trigger blend in with the target class.</p><p>We show in our evaluation that the synthesized poisoning traffic is a good approximation of clean network traffic, both in terms of Jensen-Shannon distance between distributions (Section 4.4.3) and preservation of field-level dependencies. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Experimental Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.1">Experimental Setup</head><p>In this section, we describe the datasets and performance metrics used in our evaluation.</p><p>We also present the baseline performance of the target classifiers (without poisoning).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Datasets</head><p>We used three public datasets commonly used in cybersecurity research for intrusion detection and application classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CTU-13 Neris Botnet:</head><p>We started our experimentation with the Neris botnet scenario of the well-known CTU-13 dataset <ref type="bibr">[70]</ref>. This dataset offers a window into the world of botnet traffic, captured within a university network and featuring a blend of both malicious and benign traffic. Despite the sizeable number of connections (&#8656; 9 &#8657; 10 6 ), the classes are extremely imbalanced, with a significantly larger number of benign than malicious data points. Note that the class imbalance is a common characteristic of security applications. The Neris botnet scenario unfolds over three capture periods. We use two of these periods for training our models, and we partition the last one in two subsets, keeping 85% of the connections for the test set, and 15% for the adversarial set, D a .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CIC IDS 2018 Botnet:</head><p>From CTU-13, we moved to a recent dataset for intrusion detection systcheems, the Canadian Institute for Cybersecurity (CIC) IDS 2018 dataset <ref type="bibr">[198]</ref>.</p><p>We experimented with the botnet scenario, in which the adversary uses the Zeus and Ares malware packages to infect victim machines and perform exfiltration actions. This dataset includes a mixture of malicious and benign samples and is also heavily imbalanced.</p><p>CIC ISCX 2016 dataset: This dataset contains several application traffic categories, such as chat, video, file transfer. We leverage the CIC ISCX 2016 dataset <ref type="bibr">[62]</ref> to explore another scenario where an adversary may affect the outcome via poisoning: detection of banned applications. For instance, to comply with company policies, an organization monitors its internal network to identify usage of prohibited applications. An adversary may attempt to disguise traffic originating from a banned application as another type of traffic. We study two examples of classification tasks on the non-vpn traffic of this dataset: (1) File vs Video, where we induce the learner to mistake video traffic flows as file transfer, and (2) Chat vs Video, where the classifier mis-labels video traffic as chat communication.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Performance Metrics</head><p>Similar to previous work in this area <ref type="bibr">[193,</ref><ref type="bibr">156]</ref>, we are interested in the following indicators of performance for the backdoored model:</p><p>&#8226; Attack Success Rate (ASR). This is the fraction of test data points which are misclassified as belonging to the target class. We evaluate this metric on a subset of points that have been previously correctly classified by a clean model trained with the same original training data and random seed.</p><p>&#8226; Performance degradation on clean data. This metric captures the side effects of poisoning, by evaluating the ability of the backdoored model to maintain its predictive performance on clean samples. Let F p 1 be the F1 score of the poisoned model on the clean test set, and F c 1 the test score of a non-poisoned model trained equally, the performance degradation on clean data at runtime is:</p><p>Unless otherwise noted, all the results shown in the following sections are averages of five experiments with different random seeds, reported with their relative standard deviations.</p><p>Parameters. We define p% as the percentage of feature-space points of the training dataset that have been compromised by an adversary. Since the amount of poisoned points is generally a critical parameter of any poisoning attack, we measure the attack  performance across multiple poison percentage values p% . At runtime, we randomly select a subset of test points to inject the trigger. Specifically, we select 200 points for the CTU-13 and CIC IDS 2018 datasets, and 80 for the CIC ISCX 2016 dataset (due its smaller size). Baseline Model Performance. As mentioned in our threat model, we consider two representative classifiers: a Gradient Boosting Decision Tree (GB), and a Feed Forward Neural Network (FFNN). Note that we are not interested in finding the most effective possible learner for the classification task at hand, instead our focus is on selecting generic and widely adopted classifiers to showcase the adaptability of our attack strategy. Baseline values for accuracy, F1 score, precision and recall of the classifiers are reported in Table 4.4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.2">Impact of Feature Selection</head><p>Similar to the procedure reported in <ref type="bibr">[193]</ref>, our initial feature selection strategy revolved around computing local feature importance scores with SHAP and then aggregating them to obtain global indicators for each feature of the magnitude and direction of impact for each feature. As mentioned in Section 4.3.1, however, this approach has an important drawback: it requires to perform a potentially large number of queries against the victim classifier.</p><p>To obviate this issue, we also considered ways in which the adversary can extract feature importance estimates directly from their data subset, D a . In practice, we experimented with fitting a Decision Tree on D a , following either the Gini impurity (Gini) or the information gain (Entropy) criteria, and using the importance estimate given by the reduction of the criterion induced by the feature <ref type="foot">3</ref> .</p><p>The three feature selection strategies implemented (Entropy, Gini, SHAP) use the top eight most important features to design the trigger pattern, and are compared against Random, a baseline strategy that chooses the same number of features uniformly at random. Looking at the features selected by the different strategies, we generally observe that Entropy and Gini tend to assign scores that are strongly positive only for a very small number of features (typically 1-3), while SHAP scores are distributed more evenly.</p><p>This observation, together with the desire to minimize the trigger footprint, informed our decision to select eight most relevant features. We also experimented with different values of this parameter, halving and doubling the number of selected features, but we found that eight were sufficient to achieve satisfying success rates.</p><p>Attack Success Rate: We show the results of these experiments in Figure <ref type="figure">4</ref>.5. On average, we found the Entropy strategy to be the most successful against both classifiers on this dataset. The Random strategy leads to inconsistent results: occasionally, it stumbles upon useful features, but overall attacks relying on Random selection perform worse than attacks guided by the other feature selection methods.</p><p>Figure <ref type="figure">4</ref>.5 also illustrates a major finding -our attacks perform well even at very small poisoning rates such as 0.1%, where they reach an attack success rate of up to 0.7 against the Gradient Boosting classifier. As expected, increasing the poisoning percentage leads to an increase in attack success rate; for instance, an ASR of 0.95 is obtained with Entropy at 1.0% poisoning. This is interesting considering that previous works only considered larger poisoning rates (e.g, 2% to 20% in <ref type="bibr">[124]</ref>, 20% samples from nine (out of ten) non-target classes in <ref type="bibr">[156]</ref>). We also notice that some of the variance in the ASR results can be attributed to a somewhat bimodal distribution. This can be partially explained with differences in the resulting trigger sizes, with Figure <ref type="figure">4</ref>.6b highlighting the correlation between larger triggers and higher ASR. We leave a more detailed analysis of the distribution of the ASR scores for future work.</p><p>The second interesting observation we can make, is that the SHAP strategy, while working well in some scenarios (especially for the application classification tasks in Section 4.4.5) does not, on average, lead to better results than estimating feature importance through proxy models (Entropy and Gini). This makes the attack quite easier to run in practice, as it circumvents the necessity to run multiple, potentially expensive, queries to the victim model.</p><p>Performance degradation on clean data: While these results show that the poisoned model is able to successfully misclassify poisoned data, we also want make sure that the performance on clean data is maintained. The average &#8710;F 1 across poisoning rates and feature selection strategies in our experiments was below 0.037, demonstrating that the side effects of the attack are minimal. The neural network model exhibits on average a slightly larger decrease when compared against the Gradient Boosting classifier, especially when the Entropy and Gini feature selection strategies are used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.3">Attack Stealthiness</head><p>Remaining undetected is an important factor in running a successful poisoning campaign. Here, we study the impact of our two approaches for increasing attack stealthiness described in Section 4.3.2: reducing the trigger size (Reduced trigger) and generating the trigger connections using Bayesian networks (Generated trigger). We start by analyzing the attack success with the different types of triggers, followed by a quantitative comparison of their stealthiness in feature space (via anomaly detection), and in problem space (via the Jensen-Shannon distance).</p><p>Evaluation of attack success.  (A) Comparison of attack success rates (ASR) as a function of poisoning percentage. (B) Correlation between the number of connections composing the trigger and the attack success rate (ASR). Each point represents a separate experiment. Curve fitting illustrating the trend is performed using linear regression.  (how ASR changes as the trigger size changes). These figures show that the generative method leads to consistently smaller triggers than the other two methods, without sacrificing attack success. This result is indicative of the power of generative models in knowledge discovery, and, in our case, their ability to synthesize a small set of realistic log connections that lead to the feature-space prototype. Figure <ref type="figure">4</ref>.6b also shows that the size reduction strategy is able to create triggers (Reduced trigger) that are smaller than the Full trigger, but at the expense of the attack success rate.</p><p>Evaluation of attack stealthiness in feature space. Next, we evaluate the attack stealthiness in feature space, using the Isolation Forest <ref type="bibr">[129]</ref> algorithm for anomaly detection.</p><p>The objective of this experiment is to see whether a standard technique for anomaly detection can identify and flag the poisoned samples as anomalies. The anomaly detector is trained on a clean subset of data, which is completely disjoint from the poisoned data points and consists of 10% of the entire training dataset.  Evaluation of attack stealthiness in problem space. We also evaluate attack stealthiness in problem space, in terms of how close the poisoned data is to the target class, here represented by the benign class (normal traffic). We leverage the Jensen-Shannon divergence <ref type="bibr">[126]</ref>, a normalized and symmetrical scoring method for measuring the similarity between two probability distributions, and in particular we use the distance formulation defined as the square root of the divergence, which assumes value of zero for identical distributions.</p><p>We compute the distance for each field in the connection logs (e.g., bytes, port, connection state, etc.), and report the average across all fields. As a baseline, we compute the average Jensen-Shannon distance between the target class points (benign log connections only) of the training and test datasets, capturing the distribution shift between train and test data. For the CTU-13 Neris Botnet dataset, we evaluated this reference distance as being D_REF = JS(TRAIN, TEST) = 0.24.  Since the auto-encoder requires its inputs to be of a consistent shape, instead of features extracted from 30-second time windows, here the model is provided with an input representation consisting of contiguous blocks of 100 connections. Given that features are extracted from connection blocks of a fixed size, we also fix the trigger size to be 50 connections long. We found this value empirically by experimenting with different trigger sizes, and noticed that smaller ones would lead to unsatisfying attack results. While the trigger is relatively large compared to the unit block size, it is worth noting that the total number of connections introduced by the attack is still very limited when compared to the size of the the training set.</p><p>Table <ref type="table">4</ref>.6 reports the results of the Entropy strategy when applied in this setup, at different poison percentages, together with its standard deviation across 5 experiments and the average degradation in performance of the victim model on clean data. Since the auto-encoder was trained in an unsupervised fashion to minimize the reconstruction loss, we expect this training loss to impact negatively the overall success of the attack.</p><p>In fact, we do observe a general reduction of the success rate compared to the simple neural network model, especially for limited poisoning budgets (&#8659; 1%). However, if the adversary is allowed to increase the poisoning rate beyond 1%, we observe that the attack scales nicely with larger poisoning budgets. At the same time, the &#8710;F 1 values remain generally low even at larger poison percentages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.5">Other datasets</head><p>In the previous sections, we carried out an in-depth evaluation of various attack characteristics and their impact on the attack success. In this section, we investigate how generalizable this poisoning approach is by testing it on different datasets and other classification tasks. We evaluate here a second cybersecurity task on the CIC IDS 2018 dataset, and two application classification scenarios on CIC ISCX 2016. For all of these case studies, we use the statistical features (see Table <ref type="table">4</ref>.2) and the full trigger strategy.</p><p>We report the attack success rate at different poisoning percentages in Figure <ref type="figure">4</ref>.8. Due to the much smaller size of the ISCX dataset, we test up to slightly larger poison percentage values -for instance in the Chat/Video scenario, 0.1% of the training set would amount to a single poisoning point. In general, we observe similar trends as in previous experiments, with the SHAP and Entropy strategies performing similarly, and achieving significant attack success rates even with very limited poison budgets. We also evaluated the poisoned model on clean test data, to verify whether the poisoned model is still able to classify clean test data correctly. We obtained very limited reductions in F 1 scores: &#8710;F 1 is between 0.002 and 0.046, with the SHAP strategy resulting in slightly larger shifts than the other feature selection methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Related Work</head><p>Adversarial Machine Learning. We can identify two major categories of integrity attacks against ML classifiers: (1) evasion attacks, which occur at test time and consist in applying an imperceptible perturbation to test samples in order to have them misclassified, and (2) poisoning attacks, which influence the training process (either through tampering with the training dataset or by modifying other components of the training procedure) to induce wrong predictions during inference. For details on other adversarial ML techniques, we direct the reader to the standardized taxonomy presented in <ref type="bibr">[164]</ref>.</p><p>In this study, we are focusing on backdoor poisoning attacks, a particularly insidious technique in which the attacker forces the learner to associate a specific pattern to a desired target objective -usually the benign class in cybersecurity applications. While backdoor poisoning does not impact the model's performance on typical test data, it leads to misclassification of test samples that present the adversarial pattern. Backdoor poisoning attacks against modern ML models were introduced by Gu et al. <ref type="bibr">[78]</ref> in Bad-Nets, where a small patch of bright pixels (the trigger pattern) was added to a subset of images at training time together with an altered label, to induce the prediction of a target class. Subsequently, Turner et al. <ref type="bibr">[222]</ref> and Shafahi et al. <ref type="bibr">[195]</ref> devised clean-label backdoor attacks which require more poisoning data samples to be effective, but relax some strong assumptions of previous threat models, making them significantly more applicable in security scenarios.</p><p>In cybersecurity, the earliest poisoning attacks were designed against worm signature generation <ref type="bibr">[171,</ref><ref type="bibr">155]</ref> and spam detectors <ref type="bibr">[152]</ref>. More recently, a few studies have looked at packet-level poisoning via padding <ref type="bibr">[92,</ref><ref type="bibr">156]</ref>, feature-space poisoning in intrusion detection <ref type="bibr">[11,</ref><ref type="bibr">124]</ref>, and label flipping attacks for IoT <ref type="bibr">[166]</ref>. Severi et al. <ref type="bibr">[193]</ref> proposed to use model interpretation techniques to generate clean-label poisoning attacks against malware classifiers. Their strategies are applicable to security datasets whose records are independent such as individual files or Android applications, which present a direct mapping from feature space to problem space. In contrast, our study explores attacks trained on network traffic, where multiple sequential connections are translated into one single feature-space data point; in this setting, inverting triggers from feature to problem space becomes particularly difficult due to data dependencies.</p><p>Preserving Domain Constraints. Functionality-preserving attacks on network traffic have mostly looked at evasion during test time, rather than poisoning. For instance, Wu et al. <ref type="bibr">[236]</ref> proposed a packet-level evasion attack against botnet detection, using reinforcement learning to guide updates to adversarial samples in a way that maintains the original functionality. Sheatsley et al. <ref type="bibr">[199]</ref> study the challenges associated with the generation of valid adversarial examples that abide domain constraints and develop techniques to learn these constraints from data. Chernikova et al. <ref type="bibr">[45]</ref> design evasion attacks against neural networks in constrained environments, using an iterative optimization method based on gradient descent to ensure valid numerical domain values.</p><p>With our constraint-aware problem-space mapping, which also takes into account dependencies in network traffic, we delve one step further into the challenging issue of designing functionality-preserving attacks.</p><p>Significant advances have been made recently with respect to generating multivariate data. Modern tabular data synthesizers of mixed data types leverage the power of generative adversarial networks <ref type="bibr">[241,</ref><ref type="bibr">64,</ref><ref type="bibr">67,</ref><ref type="bibr">252,</ref><ref type="bibr">43]</ref> and diffusion models <ref type="bibr">[116]</ref> to create realistic content from the same distribution as the original data. Among the different frameworks, FakeTables <ref type="bibr">[43]</ref> is the only attempt at preserving functional dependencies in relational tables. However, its evaluation is limited to Census and Air Carrier Statistics datasets, and its ability to capture more complex relationships between variables is unclear.</p><p>In this work, we model conditional dependencies in the traffic using Bayesian networks -a common choice for generating synthetic relational tables <ref type="bibr">[56,</ref><ref type="bibr">179,</ref><ref type="bibr">86,</ref><ref type="bibr">108,</ref><ref type="bibr">251]</ref>.</p><p>Bayesian networks offer increased transparency and computational efficiency over more complex generative models like generative adversarial networks <ref type="bibr">[108]</ref>. We believe this is an important advantage is our setting, which deals with large volumes of network traffic featuring multiple variables (e.g., log fields). In cybersecurity, Bayesian networks have also been used to learn traffic patterns and flag potentially malicious attempts in intrusion detection systems <ref type="bibr">[224,</ref><ref type="bibr">240,</ref><ref type="bibr">59,</ref><ref type="bibr">99]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Discussion and Conclusion</head><p>Despite our efforts towards the practical feasibility of the attack we propose, poisoning complex data is still a challenging task, and there are some elements that could increase the difficulty of deploying this attack on an arbitrary victim network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6.1">Limitations</head><p>Regarding the problem-space mapping of the triggers, the adversary may experience a situation where two connection events are inter-dependent, due to the internal state of Zeek, but the trigger does not include both of them simultaneously -this could occur if the connections happen across the border of two time windows. For instance, inter-dependent connection events may take place in the case of hosts running the FTP protocol. Documentation on this type of connections for Zeek is quite scarce, but a dedicated attacker could allocate time and resources to enumerate all possible corner cases and explicitly avoid them during the trigger creation phase.</p><p>Another potential source of issues could arise when using the generated trigger approach. This method leads to a generally good attack success with a small footprint, however, it could in principle generate connections that are not feasible in practice for stateful protocols (TCP). There are two possible ways to address this potential issue.</p><p>First, given the relentless pace of improvements in generative models, including those targeting tabular data <ref type="bibr">[241,</ref><ref type="bibr">27]</ref>, we expect that the ability of generative models to infer the inter-features constraints that characterize this data modality will increase significantly in the very short term. In parallel, the adversary could attempt to verify the correctness of the generated connections using a model checker and a formal model of the TCP protocol, and simply reject the non-conforming ones. Both approaches are exciting avenues for future research, and we leave their in-depth analysis to future work.</p><p>Finally, we designed methods to hide the poisoning campaign, and showed that our poisoning points are difficult to identify both in feature space, by using anomaly detection techniques, and in problem space, by analysing the distributional distance of poisoned data. Defending ML models from backdoor attacks is an open, and extremely complex, research problem. Many of the current proposed solutions are designed to operate in the computer vision domain <ref type="bibr">[42]</ref>, or on specific model architectures <ref type="bibr">[130,</ref><ref type="bibr">221]</ref>.</p><p>In contrast, our attack method generalizes to different model typologies. Moreover, initial research on defending classifiers from backdoor attacks in the security domain <ref type="bibr">[89]</ref> highlighted potential trade-offs between robustness and utility (e.g., defenses that rely on data sanitization may mistakenly remove a high number of benign samples in an attempt to prune out potentially poisoned samples). in Chapter 5 we investigate further strategies to mitigate the type of attacks described in this chapter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6.2">Conclusion</head><p>With this work we investigated the possibility of carrying out data-only, clean-label, poisoning attacks against network flow classifiers. We believe this threat model holds substantial significance for the security community, due to its closer alignment with the capabilities exhibited by sophisticated adversaries observed in the wild, and the current best practices in secure ML deployments, in contrast to other prevailing models frequently employed.</p><p>The attack strategy we introduce can effectively forge consistent associations between the trigger pattern and the target class even at extremely low poisoning rates (0.1-0.5% of the training set size). This results in notable attack success rates, despite the constrained nature of the attacker. While the attack is effective, it has minimal impacts on the victim model's generalization abilities when dealing with clean test data. Additionally, the detectability of the trigger can be lessened through different strategies to decrease the likelihood of a defender discovering an ongoing poisoning campaign.</p><p>Furthermore, we demonstrated that this form of poisoning has a relatively wide applicability for various objectives across different types of classification tasks. The implications of these findings extend our understanding of ML security in practical contexts, and prompt further investigation into effective defense strategies against these refined attack methodologies.</p><p>Chapter 5</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mitigating Backdoor Attacks in Cybersecurity Domains</head><p>In the previous two chapters, Chapter 3 and Chapter 4, we explored in depth new approaches to launch clean-label backdoor attacks against models designed to operate on tabular cybersecurity data. With this chapter we shift our perspective to the defender, the security company developing the model. The following sections will describe a new mitigation technique, introduced in <ref type="bibr">[189]</ref>, aimed at contrasting the attacks presented in the previous chapters. Crucially, this mitigation strategy leverages the peculiarities of the domain and relies on fewer assumptions than other works in literature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chapter Summary</head><p>The training phase of machine learning models is a delicate step, especially in cy- </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Problem Definition</head><p>Machine learning (ML) models power an ever growing variety of software systems, including cybersecurity solutions intended to stop active adversaries <ref type="bibr">[9,</ref><ref type="bibr">163,</ref><ref type="bibr">145,</ref><ref type="bibr">98,</ref><ref type="bibr">217,</ref><ref type="bibr">144,</ref><ref type="bibr">96,</ref><ref type="bibr">136,</ref><ref type="bibr">97]</ref>. The deployment of models in security-sensitive contexts increases the relevance of adversarial machine learning risks, both during inference, evasion attacks, and during the training phase of the model, poisoning attacks. Recent trends in ML practices, especially concerning the growing size of datasets and increased reliance on data crowd-sourcing <ref type="bibr">[36]</ref>, and the widespread adoption of models as core components in cybersecurity products have increased the public awareness <ref type="bibr">[202,</ref><ref type="bibr">13]</ref> of risks associated with training time adversarial interference.</p><p>Existing research in this area of adversarial ML has primarily focused on the computer vision domain. Seminal works such as <ref type="bibr">[22,</ref><ref type="bibr">78]</ref>, demonstrated different adversarial poisoning procedures designed for various model architectures and targeted at achieving different adversarial objectives. However, recent efforts by researchers have started exploring possible strategies an adversary can use to compromise the integrity of models developed for cybersecurity tasks through training process manipulation <ref type="bibr">[193,</ref><ref type="bibr">247,</ref><ref type="bibr">190]</ref>.</p><p>We focus on mitigating backdoor attacks <ref type="bibr">[78]</ref> in cybersecurity settings. This type of attack aims to induce a victim model to memorize the association between a predetermined data pattern -also called a trigger or backdoor -selected by the adversary, and a target class of the attacker's choice. If successful, the same pattern can then be presented to the model during inference within any arbitrary data sample, triggering the target class output.</p><p>We select backdoor attacks as our objective because we argue they pose a particularly relevant threat to cybersecurity applications. These systems rely on large datasets of labeled samples, often gathered through network monitoring or crowdsourced feeds, providing opportunities for training data manipulation. Moreover, backdoor attacks are inherently stealthy. The adversary's objective is not to simply disrupt the performance of a victim model, which would lead to prompt discovery and potential remediation by the model owners. Instead, the goal is to embed an arbitrary behavior -the ability to trigger a desired output -in the learned model without altering its normal behaviors on standard data points. This makes it particularly difficult for a defender to realize that the victim model has been compromised.</p><p>In this work, we consider two particularly insidious backdoor attacks developed specifically to target ML models designed for cybersecurity tasks: one aimed at subverting static malware classifiers <ref type="bibr">[193]</ref>, and one geared towards automated traffic analysis systems <ref type="bibr">[190]</ref>. Both attacks operate in a clean-label fashion, injecting the trigger pattern only in a small amount of training points corresponding to benign data, without requiring control over the training set labels. In addition, these types of attacks are agnostic to the victim's model type, and are applicable to a variety of model architectures and data modalities used for security classification tasks.</p><p>Our defensive approach leverages the information asymmetry between attacker and defender in these scenarios to isolate the poisoned points while maximizing the amount of clean data retained for model training. Our method revolves around an iterative scoring process on clustered data points in a selected feature subspace, and we propose different analysis procedures to remediate the effect of the poisoned clusters, able to reduce attack success by up to 90% while preserving high utility, even at large poisoning rates up to 5%. In contrast with most existing backdoor mitigation approaches, we do not make specific assumptions on the victim model's architecture, and our approach is applicable to any classifier (not just neural networks). Moreover, our defense removes another standard assumption often found in the literature: requiring the defender to have access to a set of clean data points sampled from the same distribution as the training data <ref type="bibr">[130,</ref><ref type="bibr">87,</ref><ref type="bibr">174]</ref>.</p><p>To summarize, we make the following contributions:</p><p>&#8226; We propose a novel defense mechanism against clean-label backdoor attacks in cybersecurity domains. Our technique does not require access to clean trusted data or knowledge of the victim's model architecture, thus removing some strong assumptions from previous works.</p><p>&#8226; We demonstrate that our defense is generally applicable across various data modalities (network flow traffic and binary files), and multiple model types, such as neural networks, gradient boosting trees, and LightGBM classifiers.</p><p>&#8226; We comprehensively evaluate our techniques against existing backdoor attacks from previous literature. We show that our methods are able to reliably identify and remove backdoor examples while preserving a high model utility (F1 score)</p><p>and low false positive rates.</p><p>&#8226; For reproducibility, we evaluate our defense strategies on publicly available datasets and open source attack implementations, and release all the code used to run the experiments in this study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Protecting Cybersecurity Models</head><p>The goal of this work is to study effective methods for defending against backdoor attacks in cybersecurity environments. We are focusing on feasible clean-label poisoning strategies that are not tailored to a specific model; instead, they are applicable to a variety of model architectures and data modalities used in security.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.1">Setting</head><p>Compared to the task of protecting ML models designed for image classification or natural language processing (NLP) tasks, operating in a security sensitive environment presents both unique challenges and opportunities. To start with, in this domain, an effective defensive approach has to be applicable to a wide variety of models, such as decision-tree based ensembles, including Random Forests and Gradient Boosting Trees, and neural network architectures. These models are state-of-the-art in many security applications, such as malicious domain classifiers <ref type="bibr">[163]</ref> and malware classification <ref type="bibr">[5]</ref>.</p><p>Moreover, the availability of a set of clean reference data points, a common assumption in backdoor defense research <ref type="bibr">[254,</ref><ref type="bibr">174]</ref>, is not to be taken for granted. It may indeed be relatively simple for an organization to obtain clean samples of images or natural language text, by scraping the Internet or leveraging large scale crowd-sourcing systems such as Amazon Mechanical Turk to gather annotations or filter out potentially anomalous data points. However, ensuring that cybersecurity data has not been tampered with requires potentially complex analyses by domain experts, which is typically a significantly more expensive process.</p><p>Conversely, the intrinsic constraints of cybersecurity data modalities also limit the attacker party. Thus, defensive measures designed for these models can rely on different assumptions, which are unlikely to hold in other domains. For instance, existing attacks <ref type="bibr">[193,</ref><ref type="bibr">190]</ref> surfaced feature importance estimation as a key component in the design of an effective backdoor trigger. Triggers that do not leverage relevant features are usually not effective to induce the mis-classification at deployment time, as demonstrated by prior work <ref type="bibr">[190]</ref>. This intuition can, in turn, be used by a defensive mechanism to reduce the scope of the detection task in feature space.</p><p>Additionally, stealthy backdoor attacks on security classifiers tend to rely on clean-label attack formulations, where the poison is injected in benign data points, as the adversary often lacks the ability to interfere with the labeling mechanism. This practice also has the added side effect of allowing the poisoning points to blend in among the usually large variety of benign points. While the variance in benign data can be problematic from a defensive standpoint, this trend also implies that a defender can assume that the poisoned samples are a subset of the benign training data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Threat Model</head><p>Our work focuses on defending binary classification models designed for security applications, in particular tabular data classification on hand-crafted feature representations.</p><p>Therefore, the assumed adversarial objective is to ensure the ability to arbitrarily trigger the output of benign for an actually malicious data point chosen by the adversary. In a practical setting this would correspond to the ability to evade the victim classifier, and allow an undesired action on the system, such as the execution of malware or the propagation of the malicious network flows.</p><p>Similarly, the defender objective is to minimize, and potentially completely disrupt, the success rate of the attacker. Parallel to this primary goal, the defender also strives to minimize the side effects of their defensive strategy on the performance of the model.</p><p>Of high concern is the false positive rate (FPR) -the fraction of benign points classified as malicious -of the model. False positives are particularly expensive for security vendors, as they result in direct interference with the normal operations of their clients, and usually require a prompt investigation leading to overheads. Therefore, an effective defensive mechanism should have minimal impact on the FPR of the defended model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.1">Adversary's Goals and Capabilities</head><p>We consider different adversaries with the range of capabilities studied in <ref type="bibr">[193,</ref><ref type="bibr">190]</ref>.</p><p>In particular, the attacks assume the adversary to be knowledgeable of the feature representation used by the victim model. Moreover, the attacker is able to either query a version of the victim model, or use a surrogate to estimate feature importance values. In all cases, the attackers are also characterized by the ability to introduce a small amount of poisoned samples in the training set, and a complete lack of control over the labeling step of the training pipeline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.2">Defender's Goals and Capabilities</head><p>In contrast to previous work, we do not assume that the adversary is in possession of a clean dataset distributed as the training set. While this assumption can be true in some contexts (it is often rather easy to acquire, or verify, clean data for image classification tasks), we argue that it introduces large costs in the security domain, since manual analysis of the data points is a task that requires domain experts. We do, however, assume the defender has the computational resources required to perform clustering on the training data and to train multiple models. Differently from computer vision or natural language processing models, which are often billions of parameters large, security models tend to be smaller, often based on ensembles of decision trees and gradient boosting, making them relatively inexpensive to re-train.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Challenges of existing defenses</head><p>Most backdoor defenses proposed in the literature are not immediately applicable to cybersecurity. Security applications considered in this work have unique characteristics, threat models, and assumptions that make applying existing defenses difficult:</p><p>Different data modalities and model architectures. The majority of proposed mitigations have been specifically designed for the computer vision domain where CNNs are state-of-the-art architectures <ref type="bibr">[42,</ref><ref type="bibr">221,</ref><ref type="bibr">84,</ref><ref type="bibr">146]</ref>. Representative approaches for computer vision backdoor defenses are activation clustering <ref type="bibr">[42]</ref>, spectral signatures <ref type="bibr">[221]</ref>, and SPECTRE <ref type="bibr">[84]</ref> that perform outlier detection in the CNN representation space via clustering, SVD decomposition, or robust statistics. Prior work <ref type="bibr">[193]</ref> tried to adapt these methods over the feature space for defending against poisoning in malware classification, but they showed that these defenses are not effective when they do not operate in the model's representation space.</p><p>Coarse-grained binary labels. The typical cybersecurity task is to distinguish malicious and benign samples, resulting in a binary classification problem. Techniques that look for shortcuts between classes in latent space and aim to identify a target attack label, such as Neural Cleanse <ref type="bibr">[232]</ref> and ABS <ref type="bibr">[131]</ref>, require finer-grained labeling of the training data and thus cannot be readily adapted to our setting.</p><p>Hard constraints on false positives. Models trained for cybersecurity tasks have hard constraints on false positives to be deployed in production <ref type="bibr">[163]</ref>. This requirement rules out the application of certified defenses <ref type="bibr">[207,</ref><ref type="bibr">231,</ref><ref type="bibr">122,</ref><ref type="bibr">234]</ref> which largely degrade model utility.</p><p>Applicable defenses. Among the defenses surveyed in Section 5.7.1, the only category that can be applied in cybersecurity settings is the pruning or fine-tuning methods aimed at unlearning the backdoor pattern <ref type="bibr">[130,</ref><ref type="bibr">125,</ref><ref type="bibr">87,</ref><ref type="bibr">174]</ref>. As discussed, all these approaches require the availability of a clean dataset, an assumption not easily satisfied in our setting. Nevertheless, we select a recent representative technique in this category,</p><p>Selective Amnesia <ref type="bibr">[87]</ref>, and show its limitations when applied in this domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.1">Limitations of Selective Amnesia</head><p>We adapt the Selective Amnesia (SEAM) defense <ref type="bibr">[254]</ref> on one of our security scenarios, the CTU-13 Botnet detection task. Conceptually, SEAM works by forcing the model to forget the association with the trigger by inducing catastrophic forgetting through finetuning on randomly assigned labels different from the ground truth. Once the model Dimensionality reduction Density-based clustering Iterative cluster scoring Sanitization o Clusters are ranked using model loss information o Low-loss clusters are iteratively added to clean set OPTICS algorithm Entropy-based selection of most important features o Threshold-based pruning of high-loss clusters o Patching high-loss clusters to overwrite trigger FIGURE 5.2: Pipeline of our defense strategy.</p><p>has forgotten both the primary classification task, and the backdoor association, SEAM proceeds to fine-tune the model on a held out clean dataset until the accuracy of the model is restored.</p><p>Selective amnesia was designed for multi-class classification in vision tasks, therefore we had to re-implement the method adapting it for our binary classification task on security data. Moreover, SEAM requires the ability to fine-tune a model on both mislabeled points during the forgetting phase, and on clean data during the recovery phase.</p><p>Thus, we only tested it on our neural network classifier, as the process of fine tuning different types of classifiers (e.g., Random Forests, Gradient Boosting Trees) is not well defined. in other domains such as NLP or computer vision, we believe that there is a need for a mitigation approach that doesn't directly rely on access to clean data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Defense strategy</head><p>Guided by the considerations presented in Section 5.2, we develop a defensive strategy aimed at protecting models designed for security classification tasks against stealthy clean-label poisoning attacks. Our procedure for sanitizing the training dataset proceeds in several stages, outlined in Figure <ref type="figure">5</ref>.2. A more detailed pseudo-code of our strategy is presented in Algorithm 3. Before detailing each stage in the defense pipeline, we provide some key insights that enable us to address the limitations of prior methods.</p><p>Dimensionality reduction: First, motivated by the observation that most clean-label poisoning attacks in cybersecurity leverage important features <ref type="bibr">[193,</ref><ref type="bibr">190]</ref>, we perform dimensionality reduction by selecting the most relevant features for classification. For this stage, we use an entropy metric computed on a decision tree model fitted to a subset of the data for identifying the top contributing features, and perform our subsequent analysis in this reduced space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Density-based clustering:</head><p>In the second stage, we perform clustering of samples with the benign label in the reduced space, to identity the clean-label poisoned samples. Our main insight is that poisoned samples lay in a different subspace of the benignly labeled samples and they will cluster together as they have similar values in the trigger. We </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sanitization of high-loss clusters:</head><p>Lastly, we employ a data sanitization step to either filter or patch the high-loss clusters to protect the model against poisoning. For the filtering strategy, we either stop the iterative process after including a fixed percentage (e.g., 80%) of clusters in the training set, or select a subset of clusters to exclude via loss analysis. For the patching strategy, we include all clusters in training, but apply a patch to the most relevant features in the highest loss clusters.</p><p>We now describe in details each stage of the defense pipeline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.1">Dimensionality reduction</head><p>Clustering high-dimensional data is affected by the curse of dimensionality -as dimensionality increases, data points become more dissimilar or farther from each other <ref type="bibr">[206]</ref>.</p><p>Therefore, before running a clustering technique, we first reduce the dimensionality of the data points to a feature subset F . Here we leverage the following insight about the attack process: strong backdoor attacks choose features with high impact over a model's decisions <ref type="bibr">[193,</ref><ref type="bibr">190]</ref>. Hence, we reduce the data set to the top |F | most important features.</p><p>We would like to compute feature importance in a model-agnostic fashion. In tree-based models, entropy is used to decide which feature to split on at each step in building the tree. The lower the entropy, the more beneficial the chosen feature is. We employ a similar technique by mapping a surrogate decision-tree model on our data and computing entropy-based feature importance to rank the features and select the top most relevant ones. Specifically, in our experiments, we reduce the network traffic and binary files datasets from approximately 1100 and 2300 features, respectively, to top 4 and 16 most important features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.2">Clustering</head><p>Our defense relies on a clustering procedure to isolate the compromised training points.</p><p>While previous work such as Chen et al. <ref type="bibr">[42]</ref> is tailored to deep neural network models -they apply clustering on the last hidden layer representation of the neural network model to distinguish between poisonous and legitimate data -we design a modelagnostic strategy.</p><p>We perform density-based clustering operating in the reduced feature space F , ensuring a broader applicability to multiple classification models. We use OPTICS (Ordering Points To Identify the Clustering Structure) <ref type="bibr">[7]</ref> to find high-density cores and then expand clusters from them. OPTICS automatically identifies the number of clusters, k, present in the data and works better than the commonly used density-based method DBSCAN <ref type="bibr">[65]</ref> on clusters with different densities, a characteristic that is particularly useful for our defense.</p><p>This approach reliably isolates the poisoned points into a few clusters. However, the defender still does not know which clusters are corrupted. Therefore, the next step of our defense consists of designing scoring and filtering systems that can leverage the information advantage gained through clustering.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.3">Cluster loss analysis</head><p>Prior to discussing our cluster scoring methodology, we provide some intuition as to how model loss behaves on the clustered dataset. Our strategy takes into account two defining characteristics of stealthy backdoor attacks. First, the adversary strives to minimize the attack footprint, i.e., the number of attack points introduced in the training data. Therefore, we can conclude that poisoned data points generally reside in relatively small clusters, while larger clusters are mostly clean. Second, in an attempt to camouflage the poisons as benign, the adversary inserts the backdoor into data points belonging to the benign class. Based on this insight, we compute the clusters only over D y=0 -target-class (benign) data, while the non-target (malicious) training points D y=1 can be assumed clean (i.e., not containing the backdoor), and used to train surrogate models.</p><p>We anticipate that models trained on clean data will exhibit very high loss on clusters containing poisoned data and models trained on poisoned data will exhibit very low loss on clusters containing poisoned data. Assuming that the largest cluster (which we denote as C 0 ) is likely to be clean, we consider comparing the loss of a model on cluster j, when the models is trained on C 0 D y=1 versus when it is trained on C 0 C i D y=1 , for each cluster i. If cluster j contains poisons, we anticipate that the change in loss will be small, apart from those few clusters i that also contain poisons. An example of carrying out this procedure on a poisoned dataset that the defender has clustered is when the model is trained on C 11 . We observed that this method was often effective at identifying clusters in our experiments, but it carries a high computational cost as a model must be trained for each cluster. In some of our experiments, we observed more than 1000 clusters, which motivated us to search for a more efficient method to identify the poisoned clusters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.4">Iterative cluster scoring</head><p>We develop an iterative scoring procedure where clean clusters are progressively identified and added to the clean data set.  loss on each remaining cluster C i &#8593; C \ C 0 . We rank the clusters C i by loss, bearing in mind that, typically, the lower the loss, the closer the cluster is to the training data D clean .</p><p>Having defined a fixed window size w -we use w = 5% of the clusters -we construct an iterative filtering process as follows. We take the w of the clusters with the lowest average loss score on the clean model, and add their data to the clean training set D train .</p><p>Then we re-train the surrogate model f on this new dataset and score the remaining clusters. This process iteratively pushes the clusters that are less similar to the data assumed to be clean to the bottom of the list, therefore isolating the clusters containing poisoned data points.</p><p>At this point the defender could opt for a fixed threshold filtering strategy, by stopping the iterative training and scoring process after a fixed percentage of clusters has been added to D clean . In our experiment, we set this fixed threshold to 80% of the clusters, as we empirically find it to be quite effective. However, based on the intuition developed in Section 5.5.3, we expect the poisoned clusters to affect the loss of the model in a very specific way when added to the training set. Therefore, we can proceed to look for those clusters for which we observe significant changes between the loss of the model computed on the cluster before it being introduced in D clean and after the model trained on it. By measuring these deltas in the loss for each cluster, we can look for anomalous clusters using a standard statistical approach, the Z-score 1 .</p><p>In practice, we accumulate the loss deltas, &#8710; i = loss( f current , C i ) &#8771; loss( f previous , C i ), for each cluster i when it is included in D clean up to the 80% threshold. Then we use the recorded values of &#8710; to compute mean &#181; and standard deviation &#963;. Then we compute z = x&#8771;&#181; &#963; for all the clusters, define a threshold z t = 2&#963;, and apply our sanitization procedure to those clusters with score z &lt;= z t .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.5">Sanitization of high-loss clusters</head><p>The iterative process described above effectively isolates the poisoned points. The final step of our defense consists in dealing with the isolated clusters. We identify two main remediation procedures: (i) discarding clusters that have not been considered clean;</p><p>(ii) patching the data points belonging to clusters that that have not been considered clean. In Section 5.6 below, we show the effects of applying both strategies to the suspicious clusters identified by each of the two methods, i.e., (a) clusters are considered suspicious after a fixed number of iterations, and (b) the pre-clustering loss deltas are analyzed to surface suspicious clusters.</p><p>The first strategy simply removes clusters that are considered suspicious from the training set of the final model, thus preventing the poisoned points from corrupting the model. This procedure has the advantage of being easy to implement, but, depending on the number of clusters removed, may damage the utility of the model, especially on uncommon test points.</p><p>The second approach aims at preserving as much utility of the model as possible by still extracting information from points belonging to suspicious clusters, while at the same time minimizing the effect of the attack. It consists in applying a patch over the subspace of the P most important features to each data point in the suspect clusters, while maintaining unaltered the rest of the vector. Note that in general |P | &#8660; |F |, as the defender can be conservative and patch a larger portion of the feature space than the one used for clustering.</p><p>In our implementation, we randomly sample values for the patched features from the set of points in D clean , when D clean includes 80% of the clusters. With the continuous evolution of generative modeling for tabular data -for instance with applications of diffusion models to tabular modalities -, however, the defender could design patching mechanisms that could potentially lead to even larger gains in model utility. We leave the exploration of this interesting research thread for future work. data set to the most important features F &#8600; DIMENSIONALITY_REDUCTION(D y=0 ) // partition the data set into clusters C &#8600; DENSITY_BASED_CLUSTERING(F ) // initialize the clean set with the largest cluster C 0 &#8600; max(C) // use model loss to isolate clusters that contain poisons do // current clean data (both benign and malicious) D clean &#8600; C 0 D y=1 // train a new clean model f on D clean f &#8600; TRAIN(D clean ) // remaining clusters to be scored and filtered C r &#8600; C \ C 0 // dictionary of loss per cluster L = {} //evaluate loss for each remaining cluster for i &#8593; range(|C r |) do // evaluate the average loss over all data points // in cluster C i using the current clean model f L[i] = COMPUTE_LOSS( f , C i ) // add cluster(s) with lowest loss to C 0 and repeat C l &#8600; lowest loss cluster(s) fromL</p><p>TABLE 5.1: Statistical features for network data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Feature types</head><p>Counts of connections per transport protocol Counts of connections for each conn_state Sum, min, max over packets sent and received Sum, min, max over payload bytes sent and received Sum, min, max over duration of connection event Counts of distinct external IPs Count of distinct destination ports</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6">Evaluation</head><p>In this section we report the results obtained in our experimental evaluation. We start with describing the experimental setup, and then we explore the settings of network flows and binary files classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6.1">Experimental setup</head><p>We report here the results of evaluating our defensive approach on the attacks proposed in Chapter 3 and Chapter 4. We use the experimental conditions reported in the previous chapters, and we test our defense on different model types, such as neural networks, gradient boosting trees, and LightGBM classifiers.</p><p>For generality, we apply our defense to two types of data: network flow traffic and binary files. In particular, we run the network data poisoning attack on the CTU-13 dataset for the Neris botnet detection task <ref type="bibr">[70]</ref>, and the executable poisoning on the EMBER malicious Windows Portable Executable file dataset <ref type="bibr">[5]</ref>. Due to the large number of experiments we run, to speed up the experimental phase, we subset the EMBER dataset selecting randomly 10% of the original 600,000 labeled data points, preserving the 50% class split.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Feature representation</head><p>The network traffic datasets consist of NetFlow data, i.e., connection events that have been extracted from packet-level PCAP files using the Zeek monitoring tool <ref type="bibr">[169]</ref>. The classifier uses a subset of the Zeek log fields, which can be more effectively associated with intrusion, such as: IP address, port, number of packets and payload bytes (for both originator and responder), as well as protocol, service, timestamp, duration, and connection state.</p><p>A sequence of this network log data (e.g., 30s time window) is mapped to a featurespace data point using statistic-based aggregation techniques. Statistical features are  <ref type="bibr">[148,</ref><ref type="bibr">246,</ref><ref type="bibr">28]</ref>. In line with previous work <ref type="bibr">[161,</ref><ref type="bibr">190]</ref>, the statistical features we use are aggregated by time window, internal IP and destination port and include several count-based metrics such as: counts of connections per transport protocol and connection state, counts of distinct external IPs communicating with each internal IP (as either originator or responder). Sum, min, and max statistics of traffic volume (packets/bytes) are also included. The list of features is summarized in Table <ref type="table">5</ref>.1, with a final feature count for this data representation of 1152.</p><p>For malware binary files, the classifier operates in the feature space provided by the EMBER dataset <ref type="bibr">[5]</ref>. A feature-space data point contains information regarding both metadata (e.g., image, linker and operating system version), as well as statistical information (e.g., number of URLs and file system paths, and number of read, write and execute sections). The features used are summarized in Table <ref type="table">5</ref>.2, with the resulting vectors containing 2351 features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Attacks</head><p>We evaluate our defense on the backdoor attacks explored in Chapter 4, originally proposed in <ref type="bibr">[190]</ref>, against network flow classifiers. The attacks explore both the entropybased feature selection -through a surrogate decision tree -method, and SHAP <ref type="bibr">[133]</ref>, a game-theory inspired explanation technique that queries the model to compute feature relevance coefficients. Two types of triggers are considered: 1) full trigger, a contiguous subset of log connections where the backdoor pattern is embedded in the most important features, and 2) generated trigger, a stealthier variant where poisons resemble benign data. The generative trigger is constructed using Bayesian network models with the goal of preserving data dependencies in the network traffic, while attempting to blend the poisoned data with the underlying training distribution.</p><p>In addition, we evaluate our defense against backdoor attacks that target malware classifiers on Windows PE (Portable Executable) files. These attacks are presented in Chapter 3, and originally published in <ref type="bibr">[193]</ref>. They leverage SHAP-based model interpretation methods to guide the attacker towards the most informative features used in the classification process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Evaluation metrics</head><p>In our experiments we are interested in a few key metrics. First we keep track of the attack success rate (ASR), that is defined in <ref type="bibr">[193,</ref><ref type="bibr">190]</ref> as the percentage of backdoored test points, which would otherwise be correctly classified as malicious by a clean model, which are misclassified as benign by the poisoned model. We are also interested in recording the fraction of poisoning points that manages to slip through our defenses, so we report the fraction of poisoning points that ends up being included in D clean .</p><p>In addition to these metrics oriented at measuring the effect of the mitigation on the attack, we also want to measure the side effects of our mitigation on model utility. Therefore, we report the measured F1 score and false positive rate (FPR) of the defended model on clean test data. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6.2">Evaluation on network traffic classification</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Fixed threshold filtering</head><p>We start by considering the fixed threshold sanitization baseline. From the figures we can observe that in the case of the full trigger attack, for both gradient boosting model and neural network, interrupting the training when 80% of the clusters has been added to the training set represents a good baseline heuristic that filters out the vast majority of poisoning points. We also note that the poisoning fraction used by the attacker has little influence on the effectiveness of this procedure.    The generated trigger, on the other hand, is much stealthier, and a larger fraction of poisoning points manages to pass through the filtering process at these pre-defined thresholds, as highlighted in Figure <ref type="figure">5</ref>.6. However, given that this attack is designed for stealthiness rather than effectiveness, the attack success rate is limited even if a fraction of the contaminants ends up in the training data. Finally, we note here that, as initially observed in <ref type="bibr">[190]</ref>, the generated trigger for the neural network model has a strong adversarial example effect at inference time. That is, the trigger pattern itself induces the classifier to misclassify the point, even if no poisoning attack took place. Therefore, this case falls outside the scope of our defense, which is targeted at countering backdoor poisoning attacks.</p><p>While effective at removing the poisons (ASR is ranging from 0.0% to 6.45%), this baseline remediation strategy may reduce the utility of the models as a side-effect of discarding the entire clusters below the threshold (including their clean data). The F1 and FPR utility metrics reported in Table <ref type="table">5</ref>.3 average from 0.02% to 0.21% for FPR, and 86%</p><p>to 97% for F1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Patching.</head><p>The patching-based sanitization strategy addresses the degradation in utility metrics.</p><p>Considering the same threshold at 80%, we can directly compare the effects of patching to filtering for the experiment with the gradient boosting model on CTU-13. Table <ref type="table">5</ref>.4</p><p>shows that using patching generally leads to higher average values of the F1 score on test data (91% -95%), and a lower false positive rate (0% -0.08%). On the other hand, as expected, patching is slightly less effective than complete filtering in reducing the attack success (0% -11% ASR). Therefore, the defender can adopt the patching approach if they want to trade-off some defensive effectiveness for reduced degradation in model utility.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Loss analysis and sanitization</head><p>We performed the loss deltas analysis outlined in Section 5.5.4 applying it to the experiments on the CTU-13 dataset with the gradient boosting model. For this experiment we use a threshold Z-score of 2, that is, we mark as suspicious any cluster whose loss delta is lower than 2 times the standard deviation of the observed loss deltas during the initial training iterations, up to 80%. The results of both filtering and patching sanitization strategies are reported in Table <ref type="table">5</ref>.5.</p><p>This approach leads to a significant improvement of the F1 and FPR scores, especially for the filtering sanitization. In this case, we observe up to a 0.2% (95% relative improvement) decrement in FPR and up to 11% increase (12.5% relative improvement) in the F1 score on the test set. However, the attack success remains relatively high, and therefore fixed threshold filtering may be preferable in this scenario. We note here that the defender can always trade off smaller improvements in F1 and FPR scores for a stronger protection from the attack, by increasing the value of the threshold, for instance using z t = 1&#963;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6.3">Evaluation on malware classification</head><p>We observe similar trends in our experiments on the malware classification task, as summarized in Figure <ref type="figure">5</ref>.7. These experiments were run attacking the LightGBM classifier, proposed with the EMBER dataset, using the two strongest attack strategies, based on independent feature and value selection: CountAbsSHAP and MinPopulation. Each  experiment was run twice with different random seeds, and we tested 4 different poisoning rates from 0.5% to 4%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Fixed threshold filtering</head><p>In this scenario we observe a quicker rise in the fraction of poisoning points included in D clean after the 80% threshold. Notwithstanding, in general the fixed threshold heuristic is still quite effective in thwarting the attack. Compared to the experiments on the network classification task, the changes in both F1 and FPR scores are sharper at higher percentages of included clusters, even if the absolute values of the decrements is generally smaller. We report 2% FPR and 98% F1 on average, across various poisoning rates, while the ASR has been reduced to 6%-21%.</p><p>In this scenario too, the stealthiest version of the attack, using the Combined SHAP strategy, leads to a larger fraction of poisoning points included in D clean before the fixed threshold. However, as shown in Figure <ref type="figure">5</ref>.8, the attack success rate remains relatively low unless high poisoning percentages are used.  Patching.</p><p>The patching strategy performs relatively poorly at preventing the attack in this scenario, especially at higher poisoning rates. Conversely, as in the experiments on network data, it outperforms the pure filtering approach for what concerns preserving model utility. In particular, the FPR obtained through patching are consistently better, with an average gain of about 0.5%, than those obtained with filtering at different poisoning rates, as shown in Table <ref type="table">5</ref>.6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Loss analysis and sanitization</head><p>Table <ref type="table">5</ref>.7 shows the results of applying our loss analysis to identify the clusters to sanitize. The results are reported for a threshold Z-score of 2 standard deviations. This approach reflects the general trends observed before, but improves on the fixed threshold method discussed above when applied with both the filtering and patching sanitization approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.7">Related work</head><p>In this section, we provide related work on backdoor mitigation strategies, with a particular focus in the cybersecurity domain. For a more detailed treatment of backdoor attacks we refer the reader to Chapter 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.7.1">Mitigations against backdoor attacks</head><p>Three main directions have emerged in the study of backdoor defenses. One approach has focused on developing techniques to obtain certifiable robustness guarantees to poisoning attacks, the second has looked at detecting and filtering out poisoned samples before training the model, while the third aims at directly purifying the poisoned model by unlearning the backdoor association.</p><p>Certified defenses to poisoning attacks aim to provide rigorous mathematical guarantees on model quality in the presence of poisons. In some of the first work on certified defenses, Steinhardt et al. <ref type="bibr">[207]</ref> construct approximate upper bounds on the loss using a defense based on outlier removal followed by empirical risk minimization. They show that an oracle defender who knows the true class mean is very powerful. However, a data-dependent defender that uses the empirical means performs poorly at filtering out attacks. Deep Partition Aggregation (DPA) <ref type="bibr">[122]</ref> splits the training dataset into k partitions, which limits the number of poisons that any one member of the ensemble can see.</p><p>Unfortunately, the number of partitions that are necessary can be high compared to the number of poisons considered. To illustrate, using 250 partitions on CIFAR-10, DPA is able to certify 50% of the dataset to just nine insertions/deletions applied to the training set.</p><p>A related technique known as Deterministic Finite Aggregation <ref type="bibr">[234]</ref> provides modest improvements over DPA. In addition to the limited guarantees, a large number of partitions may not be practical in many security applications, as the number of malicious samples available is often small. Randomized smoothing is a technique in which noise is added to the training data to attempt "smooth-out" any adversarial perturbations introduced by an attacker. However, the guarantees that are enabled are fairly small. Wang et al. <ref type="bibr">[231]</ref> are only able to certify 36% of MNIST against an adversary that perturbs up to 2 pixels per image.</p><p>In the second category, a common detection strategy utilizes deep clustering, which consists in partitioning on the learned representation of the neural network. Chen et al. <ref type="bibr">[42]</ref> propose activation clustering, a defense method that runs 2-means clustering on the activations of the last hidden layer, then discards the smallest of the two clusters. Tran et al. <ref type="bibr">[221]</ref> propose a defense based on spectral signatures. This method computes outlier scores using singular value decomposition of the hidden layers, then removes the samples with top scores and re-trains. An extension of this idea is used by the SPECTRE method <ref type="bibr">[84]</ref>, which introduces the idea of estimating the mean and covariance matrices of the clean data, using robust statistics, and using these to whiten the data, prior to computing the singular value decomposition.</p><p>The third category includes various pruning and fine-tuning methods aimed at unlearning the backdoor pattern. Liu et al. <ref type="bibr">[130]</ref> propose fine-pruning, a strategy that attempts to disable backdoor behavior by eliminating less informative neurons prior to fine-tuning the model on a small subset of clean data. Li et al. <ref type="bibr">[125]</ref> introduce neural attention distillation, a purifying method that utilizes a teacher network to guide the fine-tuning of a backdoored student network on a small subset of clean training data. In a recent study, Heng et al. <ref type="bibr">[87]</ref> employ selective amnesia (SEAM), where catastrophic forgetting is induced by retraining the neural network on randomly labeled clean data; after both primary and backdoor tasks are forgotten, the randomized model is retrained on correctly labeled clean data to recover the primary task. In Section 5.4 below, we evaluate a direct application of SEAM to the attacks we study, and show the drawbacks of that defensive approach in our scenario. We note also that a similar approach, based on randomized unlearning and detection is proposed by Qi et al. <ref type="bibr">[174]</ref>.</p><p>Other approaches in the literature involve detecting if a model is backdoored (ABS <ref type="bibr">[131]</ref>,</p><p>MNTD <ref type="bibr">[245]</ref>), detecting if a particular label has been under attack (Neural Cleanse <ref type="bibr">[232]</ref>),</p><p>or performing topological analysis of activations at multiple layers in a neural network model (TED <ref type="bibr">[146]</ref>). In cybersecurity settings, the only defense we are aware of is Nested</p><p>Training <ref type="bibr">[227,</ref><ref type="bibr">226]</ref>, a data sanitization method that relies on an ensemble of multiple models, each based on different subsets of the training set. Nested Training has been applied to poisoning availability and backdoor attacks in network traffic and malware classifiers, but it incurs high cost and degradation in model performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.8">Discussion and Conclusion</head><p>Defensive mechanisms are always subject to limitations and the general problem of defending from arbitrary poisoning attacks is far from being solved. With this work we address clean-label backdoor attacks, and propose an effective mitigation approach for existing threats.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.8.1">Limitations</head><p>Differently from provable defenses, our method is heuristic in nature, trading off protection guarantees for large improvements in retained model utility. In contrast to other heuristic mitigation approaches introduced in the context of computer vision systems, our method relaxes some strong assumptions on clean data availability and victim model architectures that are difficult to satisfy in the cybersecurity domain.</p><p>While we defend against the attack formulations proposed in literature, adaptive attacks against our defense are possible. For instance, an adversary could change the feature selection process in the attacks to choose a fraction of relevant features and a fraction of non-relevant ones. While this would likely reduce the attack success rate, it may also reduce the likelihood of poisoning points clustering together in the relevant feature subspace. Therefore, this would allow the attacker to arbitrarily trade-off attack success to reduce the likelihood of being discovered.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.8.2">Conclusions</head><p>In this work we introduce a mitigation mechanism against clean-label backdoor poisoning attacks targeted at cybersecurity classifiers. Our method eliminates many of the most common assumptions used in other defensive techniques against backdoor attacks. Namely, we remove the need for separate clean datasets, which can be difficult to obtain for cybersecurity tasks, and any assumptions on the architecture of the models used. Our defense effectively reduces the attack success rate in multiple scenarios, while also preserving a high model utility and a low false positives rate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chapter 6</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Network-Level Interference in Federated Learning</head><p>The previous chapters dealt with the robustness of models trained by a single centralized entity. Recent developments in Collaborative Learning, and Federated Learning in particular, showed the potential utility of these systems to train models in a distributed fashion leveraging local, non independent and identically distributed (i.i.d.), data.</p><p>In this chapter we will see that this setting too is susceptible to training-time attacks. The work presented in <ref type="bibr">[191]</ref> shows that interfering with the network traffic, an adversary can vastly reduce the global model accuracy on a target population.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chapter Summary</head><p>Federated learning is a popular strategy for training models on distributed, sensitive data, while preserving data privacy. Prior work identified a range of security threats on federated learning protocols. We observe that the communication between clients and server is critical for the learning task performance, and highlight how communication introduces another vulnerability surface in federated learning. We consider the impact network-level adversaries might have on training federated learning models.</p><p>We show that attackers orchestrating dropping of network traffic of carefully selected clients can significantly decrease model accuracy on a target population. Moreover, we show that a coordinated poisoning campaign from a few clients can amplify the dropping attacks. Finally, we develop a server-side defense which mitigates the impact of our attacks by identifying and up-sampling clients likely to positively contribute towards target accuracy. We comprehensively evaluate our attacks and defenses on three datasets, under both plain and encrypted communication channels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Problem Definition</head><p>Federated Learning (FL) or collaborative learning, introduced by McMahan et al. <ref type="bibr">[139]</ref>,</p><p>has become a popular method for training machine learning (ML) models in a distributed fashion. In FL, a set of clients perform local training on their private datasets and contribute their model parameters to a centralized server. The server aggregates the local model parameters into a global model, which is then disseminated back to the clients, and the process continues iteratively until convergence is established.</p><p>Training ML models using a federated approach allows the server to take advantage of a large amount of training data and each client to retain the privacy of their data. Popular applications include natural language processing (NLP) to learn from user activities on mobile phones <ref type="bibr">[248]</ref>, smart healthcare <ref type="bibr">[183]</ref>, and customized retail services. FL can be deployed in either a cross-silo setting with a small number of participants available in all rounds of the protocol, or in the cross-device setting with a large number of participants, which are sampled by the server and participate infrequently in the protocol.</p><p>Recent work studying the security of FL protocols has highlighted the risk of attacks by compromised clients through data poisoning <ref type="bibr">[218]</ref> and model poisoning <ref type="bibr">[20,</ref><ref type="bibr">16]</ref> under different objectives such as availability, targeted, and backdoor attacks. While availability objectives aim at compromising the accuracy of the entire model indiscriminately <ref type="bibr">[68,</ref><ref type="bibr">200]</ref>, targeted and backdoor attacks only affect a specific subset of samples <ref type="bibr">[16,</ref><ref type="bibr">212,</ref><ref type="bibr">233]</ref> and are harder to detect.</p><p>Among the defensive approaches that the security community proposed, availability attacks can be mitigated by various gradient filtering methods <ref type="bibr">[68,</ref><ref type="bibr">200,</ref><ref type="bibr">239]</ref>, while targeted and backdoor poisoning attacks could be addressed by clipping gradient norms at the server, during model updates, to limit the individual contributions of clients to the global model <ref type="bibr">[212,</ref><ref type="bibr">16]</ref>.</p><p>We observe that, for federated learning, communication between server and clients plays a critical role as the global model is learned in rounds with contributions from multiple clients. Thus, the communication represents another point of vulnerability that an attacker can exploit to influence the global model learned by the system. In this case, an attacker does not need to compromise the clients, but rather leverage networklevel information and attack capabilities to prevent the ML algorithm from receiving the information needed to learn an accurate good model.</p><p>An attacker can exploit the specific communication channels for different FL architectures to perturb messages sent between server and clients. For example, cross-device FL is usually deployed for mobile and edge devices, connected through wireless communication channels which typically experience higher packet loss and are vulnerable to physical-layer attacks, such as jamming <ref type="bibr">[244,</ref><ref type="bibr">18]</ref>. For architectures relying on overlay networks, network-level adversaries might interfere with the data delivery by conducting BGP hijacking attacks <ref type="bibr">[46]</ref>, compromising routers <ref type="bibr">[49,</ref><ref type="bibr">185]</ref> or hosts <ref type="bibr">[230]</ref>, or exploiting vulnerability on transport-level protocols <ref type="bibr">[104]</ref> to perform packet dropping, packet delay, or traffic rerouting.</p><p>Such network-level attacks have been shown to impact other systems such as Bitcoin <ref type="bibr">[10]</ref>,</p><p>payment-channel networks <ref type="bibr">[235]</ref>, and connected cars <ref type="bibr">[101]</ref>. These adversaries are especially relevant when political or economic incentives result in an adversarial autonomous system (AS), for which sub-populations are geographically localized. A government may run this attack to censor a word prediction model (especially targeting a country's minority language). A corporate AS may attempt to modify words associated with their competitors or unfavorable political movements.</p><p>In this study, we are interested in understanding the impact network-level adversaries can have on machine learning models trained with federated learning. We are less interested in how an attacker will position themselves in the network to conduct the attacks, but rather in the ways the attacker can exploit information sent over the network to maximally damage the learning algorithm. We analyze for the first time the impact of network-level adversaries on federated learning and show that attackers who can carefully orchestrate dropping of network traffic of selected clients (either by controlling parts of the network or compromising the clients) can significantly decrease model accuracy on a target population. Specifically, we design an algorithm that identifies a set of clients who contribute heavily to a selected learning task.</p><p>We show that by identifying highly contributing clients and dropping their traffic, an attacker is more successful than by dropping randomly selected clients. We consider scenarios where communication between server and clients is sent either in the clear or encrypted. Moreover, we show that model poisoning attacks from a small set of clients can further amplify the impact of the targeted dropping attacks we consider. As an example, on a text classification problem, selectively dropping updates from only 5 clients from the target population can lead to a decrease in accuracy from 87% to 17%, on a particular target class. Furthermore, by adding a poisoning attack of equal size, the model accuracy decreases to 0% on the same target class.</p><p>Existing network-level defenses against packet dropping might mitigate the attacks under some circumstances, such as when the amount of traffic dropped for particular clients is observable with respect to the normal network loss. However, in cross-device FL settings considered here, a small set of clients are selected at random by the server, and therefore clients participation in the protocol is usually infrequent. Dropping a few packets at large intervals of time would be difficult to detect and mitigate at the network level. Additionally, FL system owners do not usually control the underlying network and might not be able to implement network-level mitigations to make communication more resilient. Complementary to such network-level defenses, we propose a serverlevel defense that is agnostic to how the dropping attack is performed.</p><p>Our defense modifies client sampling in each round of FL training to increase the likelihood of selecting clients who sent useful updates in previous rounds of the protocol, while decreasing the probability of selecting less relevant clients. Interestingly, client selection for sampling in the defense leverages the same procedure for client identification employed by the network-level adversary. The defense is extremely effective against a targeted dropping attack. For instance, in the same text classification task mentioned above, while an unmitigated attack would completely disrupt the model accuracy on the target task (from 87% to 17%), the defense achieves an accuracy of 96%, which exceeds that of the original model before an attack is mounted.</p><p>Moreover, our defense can be combined with a standard poisoning defense based on gradient clipping to withstand both targeted dropping and model poisoning attacks. On the same task, the combined dropping and poisoning attack brings the target accuracy to 0, and the combination of our defense with clipping results in 94% accuracy on the target population. The defense is even more effective under encrypted communication, the most common deployment model for FL, in which the improvements compared to the original accuracy can be as high as 34%.</p><p>To summarize, the contributions of this work are:</p><p>(i) We consider for the first time the impact of network-level adversaries in federated learning protocols, perform an evaluation of a targeted packet dropping attack, and show that its impact can be further amplified by model poisoning attacks;</p><p>(ii) We design a client identification procedure that is a building block in our attack and defense strategies, and that can be configured with different levels of access to the network traffic sent between clients and server;</p><p>(iii) We show that a defensive strategy based on up-sampling highly contributing clients to the learning task is promising in training more resilient models. The defense can be combined with gradient clipping to mitigate both targeted dropping and poisoning attacks, and in some cases results in higher accuracy than the original model before the attack.</p><p>(iv) We evaluate the proposed attacks and defenses and show that they obtain interesting results across multiple model architectures, datasets, and data modalities (image and text).</p><p>For reproducibility, all code is released publicly<ref type="foot">foot_8</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Notes on Federated Learning</head><p>Federated learning (FL) considers a setting with a set of n clients, each with a local dataset D i , and a server S. FL is a distributed machine learning methodology in which clients train locally on their own datasets and the server requests model updates, rather than data, from users, building an aggregated global model iteratively over time <ref type="bibr">[139]</ref>.</p><p>There are two main deployments models for FL: cross-silo with a small set of clients participating in every round, and cross-device models with a large number of clients which are sampled by the server per round <ref type="bibr">[105]</ref>. We consider here the Federated Averaging training algorithm designed for cross-device settings <ref type="bibr">[139]</ref>, shown in Algorithm 4. In each round 1 &#8659; t &#8659; T, the server randomly selects a subset of m &#8659; n clients to participate in training, and sends them the current global model f t&#8771;1 (initialized f 0 with Glorot Uniform initializer <ref type="bibr">[72]</ref>). Each selected client i trains locally using dataset D i , for a fixed number of T L epochs. The server updates the global model f t by mean aggregation of local updates U i :</p><p>.1) FL has been designed for protecting client's local data privacy, but data privacy can be further enhanced by secure aggregation performed via Multi-Party Computation (MPC) [26, 69]. Algorithm 4: Federated Averaging Protocol Data: Clients C = {D i } n i=1 , Federated Learning Server S, rounds T, clients per round m, aggregation learning rate &#951; Function FedLearn(S, C): // Function run by server f 0 = INITIALIZEMODEL() for t &#8593; [1, T] do // Get updates from m participants in round i M t = SELECTPARTICIPANTS(C, m) REQUESTUPDATE( f t&#8771;1 , M t ) U t = RECEIVEUPDATE(M t ) // Update and send out new model f i = UPDATEMODEL( f t&#8771;1 , U t , &#951;) BROADCASTMODEL( f t , C)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Threat Model</head><p>The most studied attacks against federated learning are poisoning attacks. In data poisoning attacks, an adversary injects maliciously crafted data points to poison the local model <ref type="bibr">[218]</ref>. Model poisoning attacks (POISON_MODEL) where the adversary compromises a set of clients and sends malicious updates to the protocol with the goal of achieving a certain objective in changing model's prediction <ref type="bibr">[16,</ref><ref type="bibr">212,</ref><ref type="bibr">233]</ref> have also been studied. Both these attacks are conducted by manipulating directly the inputs to the machine learning algorithm.</p><p>For federated learning, communication between server and clients represents another attack surface that an attacker can exploit to influence the global model learned by the system without necessarily needing to compromise the clients. Instead, based on their network-level capabilities, the adversary can observe communication sent over the network and prevent the machine learning algorithm from receiving the information needed to learn a good model. We refer to such an adversary as a network-level adversary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.1">Adversary's Goals and Capabilities</head><p>FL protocols such as Federated Averaging <ref type="bibr">[139]</ref> usually abstract the communication protocol between the server and clients. In practice, clients will communicate with the server either directly through a TCP connection, or through a multi-hop overlay net- Packet dropping attacks can be achieved in multiple ways, and have been studied for network-level adversaries who perform physical-layer attacks in wireless networks <ref type="bibr">[244,</ref><ref type="bibr">18]</ref>, router compromise <ref type="bibr">[49,</ref><ref type="bibr">185]</ref>, or transport-level attacks <ref type="bibr">[104]</ref>. Each of these methods will require different attacker capabilities. For example for physical layers attacks for wireless clients, the attacker needs to be in their proximity -this is both a powerful attack since jamming is usually difficult to defend against, and limiting since it will be difficult for the attacker to attack geographically distributed clients without significant resources. In the case of routing, while a router might be more difficult to compromise, in practice the impact can be bigger as it might have control over the traffic of multiple clients. Last, but not least, if customized overlays are used for communication, compromising nodes in the overlay is slightly easier as they are typically hosts, and the number of clients that can be impacted depends on the scalability of the service.</p><p>We note that we model packet dropping agnostic to the exact method used by the attacker for dropping network packets, since the server cannot distinguish between a network-level adversary, client compromise, or simply client unavailability when not receiving client updates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Adversarial goal:</head><p>We consider that the attacker targets the accuracy for a particular population. We define a population to be one of the classes in the learning task, but this notion could be extended to sub-classes as well. Ideally, the attacker would like the model to retain its test accuracy for all data points outside of the victim population to avoid trivial detection. We therefore do not consider poisoning availability attacks which are detectable and can be addressed with existing defenses <ref type="bibr">[68,</ref><ref type="bibr">200,</ref><ref type="bibr">239]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Attack strategies:</head><p>The attacker has several decisions to make when conducting the attack: <ref type="bibr">(1)</ref> what clients to select to drop their contributions, (2) when to start the attack given that federated learning is an iterative protocol, and (3) how many packets (i.e., local models) to drop. Ideally, an attacker wants to maximize the strength of the attack while minimizing detection. We focus on a targeted dropping attack in which the attacker selects a set of clients contributing highly to the target class for dropping their contributions. We compare that with an attack strategy which drops the traffic of a subset of randomly chosen clients.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Adversarial network-level knowledge:</head><p>We distinguish between different representative adversarial scenarios for the communication in the FL protocol:</p><p>1. COMM_PLAIN. All communication between clients and server is unencrypted. This is the scenario where a network-level adversary obtains maximum information, as they can observe all the transmitted data. Even though this might not correspond to a practical scenario, we study this model to determine the impact of the worst-case, most powerful adversaries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">COMM_ENC.</head><p>All communication between clients and server is encrypted. This is a realistic scenario, which can be implemented using IPSec, HTTPS or applicationlevel encryption. In this case, the network-level adversary could infer the set of clients participating in each round, but not the exact model updates they send.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">COMM_ENC_LIMITED.</head><p>A special case of COMM_ENC where communication is encrypted, and the adversary is limited to only observe a fixed subset of the clients participating in the protocol.</p><p>Adversarial global model knowledge: Since cross-device FL is an open system, the adversary can always participate in the FL protocol as one of the clients. We assume that the adversary obtains the global model updates f t at each round t.</p><p>To summarize, in the rest of the study we consider an adversary that has either the COMM_PLAIN, COMM_ENC or COMM_ENC_LIMITED network capability, has knowledge of the global model at each round, and targets a particular victim population of interest. We also consider an adversary that additionally will have POISON_MODEL poisoning capabilities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Network-Level Attacks on Federated Learning</head><p>In this section we describe in detail our targeted dropping attacks against a population in FL. We assume the FL model described in Section 6.2 and an attacker that has either the COMM_PLAIN or COMM_ENC network capability. We start by discussing that randomly selecting clients for dropping is not an effective strategy. We then describe a procedure to identify clients that will make the attack more effective and analyze the different parameters that control the attack. Finally, we show how our attack can be significantly amplified if in addition to the network level capabilities, the attacker also has POISON_MODEL poisoning capability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.1">Dropping Attack with Random Client Selection</head><p>The most basic strategy a network-level adversary might employ is to randomly select a subset of clients and drop their local model updates. Usually, such a strategy is not effective in achieving the degradation of model accuracy on a class of interest, which is the adversarial objective we are interested in. Later, in Table <ref type="table">6</ref>.2 in Section 6.6.2 we present detailed results for such a random dropping strategy evaluated for different number of clients whose traffic is dropped. Our finding is that the model accuracy on the target class is decreased by an average of 5%, even in cases where the number of clients selected for dropping matches the number of clients holding samples from the target class. These findings motivate us to develop new methods for client selection that will help the adversary identify the most relevant clients for degrading the accuracy on the target class.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.2">Identification of Highest-Contributing Clients</head><p>The cornerstone of our targeted packet dropping attack is an algorithm in which the adversary first determines the set of clients contributing the most to the target population of interest and then selectively drops their traffic. We design a Client Identification procedure aimed at determining the set of clients whose model updates lead to the largest improvements in target population accuracy. Our Client Identification algorithm, described in Algorithm 5, supports both plain and encrypted network communication, and is applicable to a range of FL system implementations. We demonstrate in Section 6.6 that if the adversary has high success at identifying the clients contributing to the task of interest, then the targeted dropping attack impact is much higher compared to random dropping.</p><p>The main intuition is that the network-level adversary will observe all communications -including global model updates and local model updates if possible -in the FL protocol for a number of, at least, T N rounds. In each of these rounds, the adversary computes the loss of the model before and after the updates on the target class. Rounds in which the loss decreases correspond to an increase in accuracy on the target class. The adversary tracks all the clients participating in those rounds, and computes a loss difference metric per client. In the plain communication case, the adversary will determine exactly which clients decrease the loss, while in the encrypted communication case the adversary only observes the aggregated updates and considers all clients participating in the round to be collectively responsible for the loss variation.</p><p>In more detail, to measure each client's contribution to the target population, the adversary computes the loss difference between successive model updates on the target population (using a representative dataset D &#8657; from the population):</p><p>Note that the second term &#969;( f j t , D &#8657; ) is the loss of the local update of client j in round t for COMM_PLAIN, but becomes the loss of the global model &#969;( f t , D &#8657; ) when the adversary only has access to aggregated updates in COMM_ENC. The value L j t measures the decrease in the target class loss before and after a given round t. For clients who contribute heavily to the accuracy on target data, this value will be large, so that dropping these </p><p>if COMM_ENC then // Get loss differences for this round Empirically, computing the mean of L j t for each client j across all rounds it participates in works well at identifying the clients contributing most to the target class.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.3">Dropping Attack with Identification of Highest-Contributing Clients</head><p>The targeted dropping attack starts by the adversary performing the Client Identification procedure, by running Algorithm 5 during the first T N rounds of the protocol.</p><p>During this time period, the FL training happens as usual and the global model is aggregated from local updates. Using the parameter analysis in Section 6.4.4, the adversary selects T N rounds and identifies in expectation k N clients contributing updates for the target class. After Client Identification is performed, the network adversary drops contributions from the k N clients in every round in which they participate after round</p><p>As the adversary can expect an increase in identification accuracy by monitoring more rounds of the protocol, they can repeat the Client Identification procedure in each subsequent round, and, if necessary, update the list of selected clients for dropping. We assume that the adversary is able to drop the identified clients' updates, and the server simply updates the model using all received updates. If the Client Identification protocol is not completely accurate, the attacker might drop traffic from clients who do not contribute to the target population, but we found this has a minimal impact on the trained FL model.</p><p>The selection of k N , the number of victim clients in every round, and T N , the number of rounds after which the attack starts, are critical for the success of the attack. Particularly, selection of T N is very important: a small T N will not allow the Client Identification procedure to determine relevant clients, while a large T N can render the attack ineffective because the model might have converged before the attack started.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.4">Analysis of the attack</head><p>Algorithm 5 uses several parameters: the number of clients to identify k N and the number of rounds to wait before identification is successful T N , which we analyze here. The remaining arguments will typically be governed by the training algorithm, protocol, and the fixed adversarial goal.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>How many clients to drop?</head><p>In the FL protocol, we assume a set of n clients, out of which m are selected at random in each round. Setting the number of dropping clients k N is mainly a tradeoff between maximally damaging accuracy on the target data and remaining stealthy by not significantly compromising the overall model performance. If k N is too large, significant benign updates may be removed, preventing the model from being good enough to use in practice, and potentially allowing the server to identify that an active adversary, rather than standard packet loss, is to blame for the dropped updates. If k N is too small, however, not enough clients will be dropped to have a significant impact on the model. We consider a range of values k N &#8659; k, where k is the number of clients holding examples from the targeted class, and show the attack effectiveness for each.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>How many rounds are needed to identify the clients?</head><p>Here, we discuss how to set the number of rounds T N for client identification for both plain and encrypted communication. When setting T N , if the value is set too small, then the adversary will not have enough observations to reliably identify the heavily contributing clients. However, if set too large, the adversary will allow benign training too much time, resulting in high accuracy on the target population, making it more difficult to mount the attack later.</p><p>Suppose the adversary wants to wait until all n clients have participated in the protocol at least once. This is a well-studied problem, known as the coupon collector's <ref type="bibr">[51]</ref> (our setting additionally considers batched arrivals). In this variation of the problem, the adversary must observe an expected number of cn log(n)/m batches before observing each client, for a small constant c. With values of n = 100, m = 10, roughly 46 rounds are necessary <ref type="bibr">[243]</ref>. Having established a connection to the coupon collector's problem, we will extend it to model the adversary in each setting:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Plain communication (COMM_PLAIN).</head><p>In the case where the network adversary receives every client's updates, we carry out a modification of the coupon collector analysis to identify the setting of T N where a batch of m out of n distinct clients participate in each round, and the goal is to identify a set of k N of k target clients. We analyze the case where batches are sampled with replacement (i.e., clients in each round may be repeated) due to ease of analysis, but this incurs a small constant overhead in the analysis <ref type="bibr">[19]</ref>.</p><p>We analyze the expected number of batches t i before finding each of the first k N target clients, using standard coupon collector analysis. The number of batches to wait for the i-th client from the k target clients is t i = n m(k&#8771;i+1) in expectation, as the probability that any target client which has been unobserved appears is m(k&#8771;i+1) n</p><p>. By summing this expected batch count over i from 1 to k N (using the linearity of expectation), we find that the expected number of batches to wait for the first</p><p>, where H i is the i-th harmonic number (and using H 0 = 0 if k N = k). As k increases, this value tends towards &#8656; n m (ln(k) &#8771; ln(k &#8771; k N )) rounds (when k = k N , we replace ln(0) with 0).</p><p>To use a setting that is common in our experiments, where n = 60, m = 10, k = 15, and k N = 15, we expect to wait roughly n m ln(k) = 6 ln(15) rounds, or roughly 16 rounds. Note that waiting for fewer clients k N requires fewer rounds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Encrypted communication (COMM_ENC).</head><p>In the more challenging setting of encrypted communication, the adversary only has access to mean updates and cannot localize an improved target accuracy to a particular client, as is possible when all the clients updates are available. However, clients who repeatedly participate in rounds where the target accuracy improves, can be considered as more likely targets. Moreover, clients who participate in any round where the target accuracy does not improve can be identified as not targets.</p><p>Here, we analyze only rounds where no target clients appear, as they provide certainty that no target clients are participating. We analyze the number of rounds required to identify enough non-target clients that a fixed precision level &#945; is obtained at identification. This precision level can be reached by removing n &#8771; k/&#945; clients which are known non-target clients. To analyze how many rounds it takes to collect n &#8771; k/&#945; non-target clients, we compute the probability that a batch will contain non-target clients, and then compute how many non-target batches are required to collect n &#8771; k/&#945; clients.</p><p>To compute the probability that a batch contains non-target clients, we notice that there are ( n&#8771;k m ) possible batches which have non-target clients, and there are a total of ( n m ) batches which are selected uniformly at random. Then the probability a batch has nontarget clients is:</p><p>To compute the number of non-target batches required to accumulate n &#8771; k/&#945; non-target clients, we can use comparable coupon collector analysis as before, making the observation that each non-target batch is guaranteed to have m non-target clients. On average, an upper bound of n b (H n&#8771;k &#8771; H k/&#945;&#8771;k ) batches is sufficient. As an example, in a setting we use in our experiments, n = 60, k = 15, and m = 10, the probability that a batch contains non-target clients is roughly 5.6%. To reach a precision of &#945; = 0.3, we obtained a total of 26 batches on average. In practice, it is likely that precision can be even higher than this value due to overestimation from our analysis.</p><p>We note that our analysis is similar to analysis for the group testing problem <ref type="bibr">[77]</ref>, introduced by <ref type="bibr">[60]</ref>, used for pool testing. However, a key difference is that our algorithm must be capable of identifying members from noisy aggregate information, rather than the clean signal which is typically provided during group testing. It is possible that more sophisticated group testing algorithms can be used to improve Algorithm 5 further by overcoming this constraint.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.5">Amplifying Dropping Attack with Model Poisoning</head><p>In order to amplify the effectiveness of the targeted dropping attack, the network adversary may also collude with, or inject, malicious clients, whose presence in training is designed to further compromise the performance on the target distribution. Following Bagdasaryan et al. <ref type="bibr">[16]</ref> and Sun et al. <ref type="bibr">[212]</ref>, we use a targeted model poisoning attack known as model replacement. Writing &#952; t for the parameter of the global model f t , in this attack, the adversary seeks to replace &#952; t with a selected target &#952; &#8657; t (as is done in <ref type="bibr">[16,</ref><ref type="bibr">212]</ref>, we use the &#952; &#8657; t resulting from a data poisoning attack on a compromised client's local training procedure).</p><p>The poisoned clients' local training sets are sampled identically to clients with access to the target class, with the difference of changing the labels of target class samples to another, fixed class. Writing the initial model parameters sent in the round as &#952; t&#8771;1 , the update that the adversary sends is &#952; &#8657; t &#8771; &#952; t&#8771;1 . The server aggregation then weights this update with an &#951;/m factor. In a model replacement attack, a boosting factor &#1009; is applied to the update, so the update which is sent is &#1009;(&#952; &#8657; t &#8771; &#952; t&#8771;1 ), increasing the contribution to overcome the &#951;/m factor decrease.</p><p>We will show that model poisoning attacks mounted from a small set of clients controlled by the attacker can significantly amplify the impact of targeted dropping attacks and achieve close to zero accuracy on the target class.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5">Defenses Against Network-Level Adversaries</head><p>Several defenses against network-level attacks have been considered in previous work.</p><p>For instance, defenders could monitor and detect faulty paths in the network <ref type="bibr">[14]</ref>, create resilient overlays <ref type="bibr">[38,</ref><ref type="bibr">157]</ref>, or secure the routing protocols <ref type="bibr">[91]</ref>. These defenses might increase robustness under observable packet loss (relative to the normal packet loss), but are generally not effective against stealthy attacks, such as our targeted dropping attacks, or edge-level attacks, such as model poisoning attacks mounted from attackercontrolled devices. Here, we address the question of how to design FL-specific defenses against the attacks introduced in Section 6.4. Often, the FL owner might not control the network, so we will consider server-level defenses that could complement existing network-level defenses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model Poisoning Countermeasures</head><p>After data and model poisoning attacks against FL have been demonstrated (e.g., <ref type="bibr">[20,</ref><ref type="bibr">212,</ref><ref type="bibr">16]</ref>), multiple systems have been proposed to</p><p>Algorithm 6: Server Defensive UpSampling Strategy Data: Target Dataset D &#8657; S , rounds T S , loss function &#969;, client count n, count of clients to upsample k S , up-sample factor &#955; Function UpSampling(D &#8657; S , &#969;, k S , T S , n): // Identify highly contributing clients Z = ClientIdentification(D &#8657; S , &#969;, T S , k S ) // Reduce sampling probability for most clients p = [ n&#8771;k S &#955; n 2 &#8771;k S n for c &#8593; [n]] for i &#8593; Z do // Increase sampling probability for highly contributing clients p[i]</p><p>= &#955;/n return p withstand these attacks. A popular defense is to replace the Federated Averaging protocol with a secure aggregation scheme <ref type="bibr">[25,</ref><ref type="bibr">143,</ref><ref type="bibr">250,</ref><ref type="bibr">3]</ref>. It turns out that an adversary with knowledge of the secure aggregation scheme can adaptively evade the defense in some cases <ref type="bibr">[16,</ref><ref type="bibr">68,</ref><ref type="bibr">200]</ref>, and finding an aggregation method secure against adaptive attacks remains an open problem.</p><p>The only resilient technique we are aware of to protect against targeted model poisoning attacks is gradient clipping. Sun et al. <ref type="bibr">[212]</ref> made the observation that model poisoning attacks with larger gradient norm are more effective, and therefore a natural defense is to reduce the norm of the local updates via clipping. With this method, an update &#8710; sent from a client is clipped to a maximum norm C, such that the update used for aggregation becomes min 1, C</p><p>||&#8710;|| &#8710;. This method works particularly well against model poisoning attacks in which the local client update is boosted by the attacker <ref type="bibr">[16]</ref>. For this method to be effective, the clipping norm must be carefully calibrated: it should be large enough to allow learning less well-represented classes and populations, while also small enough to prevent poisoning from changing the global model.</p><p>UpSampling Defense Strategy While gradient clipping reduces the impact of model poisoning attack, we still need to defend against the targeted dropping attack. When a targeted dropping attack is performed, the number of clients contributing legitimate updates to the target class is reduced, and the impact of dropping attack is larger on under-represented classes (i.e., classes with examples available in a small set of clients). Therefore, it is essential for the server to identify clients contributing to the target class and leverage them to improve accuracy on the target population.</p><p>Our main insights to help the server improve the accuracy on the target population are to: first, identify the legitimate clients contributing updates to the target class, and, second, modify the sampling strategy in the FL protocol to sample these clients at a  <ref type="bibr">[26,</ref><ref type="bibr">69]</ref> servers only receive aggregated updates. The server will determine its own parameters k S (how many clients to identify) and T S (how many rounds to monitor), and use its own target dataset D &#8657; S to estimate contributions. As in the case of the attacker, the server will repeat the Client Identification process at each successive round to refine its list of clients to up-sample.</p><p>After identifying the target clients, the server can proceed to run Algorithm 6 to increase the probability with which identified clients are sampled by a factor &#955; and decrease the sampling probability for all other clients. We call this method presented in Algorithm 6 the UpSampling defense.</p><p>Interestingly, the UpSampling strategy can also help even if there is no dropping attack, but there are simply too few clients from some target population on which the model accuracy is low. We will show that in our evaluation in Section 6.6.7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6">Experiemental Evaluation</head><p>Here we evaluate our network-level FL attacks, starting with an ideal setting, when the attacker has perfect knowledge on the set of clients contributing to the target population.</p><p>We then show the effectiveness of our Client Identification algorithm, and examine the targeted dropping attack after Client Identification, showcasing the performance of our entire attack pipeline. Finally, we show the amplification of targeting dropping attacks achieved by model poisoning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6.1">Experiment Setup</head><p>We use three well known datasets (EMNIST, FashionMNIST, and DBPedia-14), which we adapt to the non i.i.d. setting by controlling the data distribution across clients. In all cases, the target population is represented by samples of a class arbitrarily selected from the dataset. Our datasets have balanced classes, and we use classes 0, 1, 9 for our evaluation. All reported results are averages of 4 trials, with different random generator seeds. The list of parameters used in our FL protocol is shown in Table <ref type="table">6</ref>.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Datasets and Models</head><p>EMNIST <ref type="bibr">[50]</ref> is a handwritten digits recognition dataset with 344K samples. We use approximately 100K images of numbers between 0 and 9, partitioned equally, and without overlap, among 100 clients, so that each client has 1000 images. To enforce heterogeneity, we allocate samples from the target victim class to k fixed clients and vary k. For those k clients, we allocate &#945; T = 50% of the local dataset to be target class points, while the other 50% is sampled from the remaining classes according to a Dirichlet distribution with parameter &#945; D = 1.0. For clients without the target class, we sample 100% of the local training set from a Dirichlet distribution with parameter &#945; D = 1.0, following <ref type="bibr">[94]</ref>. We train a convolutional model, consisting of two 2D convolution layers and two linear layers, optimized using learning rate &#951; L = 0.1. Training consists of 100 rounds, selecting 10 clients at each round, using mean aggregation with a learning rate &#951; = 0.25 to combine clients' updates.</p><p>FashionMNIST <ref type="bibr">[237]</ref> is an image classification dataset, consisting of 70K gray-scale images of 10 types of clothing articles. This dataset is more complex than EMNIST, and has less data. Therefore, we increase the number of training rounds to T &#8593; {200, 300}, reduce the number of clients n to 60, and set &#945; T to 0.6. We also set the local dataset size to 400. We use a convolutional model similar to the one used before, and fix all other parameters. DBPedia-14 <ref type="bibr">[121]</ref> is an NLP text classification dataset, which we selected to vary the data modality. DBPedia-14 consists of 560K articles from 14 ontological categories, such as "Company", "Animal", "Film". DBPedia-14 is also a relatively complex dataset, so we use the same k, T, and &#945; T as in FashionMNIST. We use a local dataset size of 1000, and train a model starting from pre-trained GloVe embeddings <ref type="bibr">[170]</ref>, followed by two 1D </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6.2">Baselines: Perfect Knowledge and Random Dropping</head><p>To demonstrate the potential severity of a dropping attack, we first evaluate the best possible attack, where the adversary has perfect knowledge of a subset of k N target clients from the beginning of the protocol, and drops every update originating from that subset throughout training. We show the results in Table <ref type="table">6</ref>.2. This demonstrates the power of a dropping attack, and provides a baseline to compare our full attack pipeline against. We also evaluate the effect of an adversary that selects uniformly random victim clients in Table <ref type="table">6</ref>.2. Table <ref type="table">6</ref>.2 allow us to make the following key observations:</p><p>(i) When no dropping attack is underway, having more target clients improves accuracy. (ii) Even when fixing the fraction of dropped clients, smaller client counts are more vulnerable to attack. For example, on EMNIST, when 2/3 of target clients are dropped, and k = 9, the accuracy drops to 23%, while it is 40% at k = 15. (iii) Dropping random clients does not lead to significant accuracy degradation on the target class: targeting clients is essential. (iv) When all target clients are successfully identified and dropped (that is, there is no data at all from the population), the accuracy on the population is 0, as expected. (v) On more difficult classification tasks, such as FashionMNIST and DBPedia, removing target clients results in a higher performance degradation than for EMNIST. Table 6.3 shows the average number of targeted clients correctly identified by Algorithm 5 under plain and encrypted communication. For the purpose of evaluating the identification accuracy we set k N = k in these experiments. The most immediate observation is that identification performs better under COMM_PLAIN than under COMM_ENC. This is expected as the adversary in COMM_PLAIN has access to all local model updates sent by clients and can keep track of the exact loss difference per client. Still, in the challenging scenario where the adversary only sees aggregated updates (under COMM_ENC), Client Identification performs reasonably well. For instance, on DB-Pedia an average of 11.75 out of 15 clients are identified at 70 rounds under COMM_ENC. Interestingly, for FashionMNIST and DBPedia, it is easier to identify target clients than on EMNIST (where Algorithm 5 identifies an average of 7 out of 15 clients at 70 rounds). One hypothesis for this phenomenon is that the class samples are more different in complex datasets, leading the global model to forget the target class in rounds with no participating target clients, resulting in significant loss increases for those rounds.</p><p>We also use Table <ref type="table">6</ref>.3 to select parameters for the targeted dropping attack and validate our analysis from Section 6.4.4. In COMM_PLAIN we need to wait approximately 15 rounds for all datasets to drop 2k/3 clients, which roughly matches our analysis. In the COMM_ENC scenario, on the other hand, we see that more rounds are necessary for successful identification, as expected from Section 6.4.4. To identify between k/3 and 2k/3 of the target clients, we need to wait between 30 and 50 rounds. Moreover, in many cases the identification accuracy tends to plateau in successive rounds. Therefore, we select round 30 as the starting point for the dropping attack in our experiments under COMM_ENC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6.4">Targeted Dropping Evaluation</head><p>We now test the effectiveness of the full pipeline, where we begin by identifying clients most positively contributing towards target class accuracy, and then drop all the model updates coming from them. We measure our attacks' performance in Table <ref type="table">6</ref>.4. First, notice that attacks under COMM_PLAIN significantly compromise target population accuracy, closely approximating the perfect knowledge adversary for increasing values of k N . For instance, on EMNIST, when k = 12 and k N = 8, after 100 rounds the accuracy on the target class is on average 0.33, which is only 0.02 higher than the value obtained under the perfect knowledge attack. Moreover, if the adversary can interfere with a relatively large number of clients, k = k N , the disruption is complete, as was in Table <ref type="table">6</ref>.2. We also observe that our attack in the COMM_PLAIN scenario vastly outperforms the strategy of randomly dropping the same number of clients (shown in Table <ref type="table">6</ref>.2), in all situations.</p><p>Our attacks' performance decreases in the COMM_ENCscenario, when the adversary's knowledge is limited, which is expected from the reduced Client Identification performance. We still observe a significant advantage in using our identification pipeline, over the random selection baseline. Moreover, these results highlight the trend mentioned in Section 6.6.3: on more complex tasks, such as FashionMNIST and DBPedia, the high identification accuracy leads to noticeably larger attack performance, than in EMNIST.</p><p>Given the targeted nature of our attack, in all the scenarios we consider, dropping the victim clients generally leads to very little degradation of the overall model performance. The overall accuracy degradation of the global model is on average 3.88% for COMM_PLAIN and 1% for COMM_ENC. Tables 6.6, 6.8, 6.10, show the accuracy of the global model on the full test set under the same settings as the other experiments we run.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6.5">Impact of Model Poisoning and Targeted Dropping</head><p>When unmitigated, poisoning campaigns are known to be very effective at disrupting the accuracy of the model on the target class. We compared the effects of the model poisoning strategy and targeted dropping on the EMNIST dataset for the cases of perfect knowledge, COMM_PLAIN, and COMM_ENC in Figures 6.1a, 6.1c, and 6.1e respectively.</p><p>These heatmaps show that, for different levels of k N (number of dropped clients) and k P (number of poisoned clients), the model replacement attack is more effective than targeted dropping, even when the adversary has perfect knowledge.</p><p>The results, however, are significantly different when Clipping, a standard poisoning defense <ref type="bibr">[212,</ref><ref type="bibr">81,</ref><ref type="bibr">93]</ref>, is applied to local model updates. Figures 6.1b, 6.1d, and 6.1f  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6.6">Impact of Adversarial Visibility</head><p>Adversarial resources and capabilities may vary widely depending on the specific circumstances (nation states will have significantly larger capabilities than smaller players), so the adversary may not always be able to keep track of which clients can participate in each round, having to focus their resources on observing only a relatively small subset of the total client pool. In this section, we show that, assuming the adversary has  visibility on a non-trivial fraction of the clients, our interference can still affect the target accuracy very significantly.</p><p>We model a targeted, resource-limited, adversary (COMM_ENC_LIMITED) by restricting their ability to observe the clients participating to the protocol to a fixed-size subset of the clients, chosen before the attack starts, sampled from a Dirichlet distribution with concentration parameter &#945;. &#945; will control how likely it is for the sampled visible list to include clients containing data of the target population, with larger &#945;s leading to higher likelihood. We run the attack on FashionMNIST with a similar setup as in Table <ref type="table">6</ref>.4</p><p>under the COMM_ENC conditions. The results reported in Figure <ref type="figure">6</ref>.3a show that, as expected, higher visibility fractions, and larger &#945; lead to smaller target accuracy values.</p><p>The network adversary is also still quite effective even when observing a small fraction of the clients, for instance achieving a &#8656; 43.6% relative target accuracy decrease when observing 20 clients with &#945; = 2. Similarly, Figure <ref type="figure">6</ref>.3b shows that adding poisoning always leads to better targeted accuracy degradation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6.7">Defense Evaluation</head><p>In Tables 6.5, 6.7, we evaluate the defense strategies under the COMM_PLAIN and COMM_ENCscenarios, using targeted dropping and poisoning attacks. We set the number of dropped and poisoned clients (k N and k P ) to 2k/3 for EMNIST, and k/3 for Fash-ionMNIST and DBPedia. This parameter setting results in strong attacks, as the target accuracy is below 4% under targeted dropping and poisoning, when no mitigation is used.</p><p>Our UpSampling defense is extremely effective at protecting against targeted dropping,  Interestingly, classes with lower original accuracy benefit more from the UpSampling strategy, with improvements as high as 34% (on DBPedia for class 9, original accuracy is increased from 60% to 94% with UpSampling under targeted dropping). We also observe that as more clients hold examples from the target class (k increases in EMNIST),</p><p>the UpSampling defense will have more clients to sample from, and achieves higher target accuracy.</p><p>Privacy-preserving FL So far, we have discussed settings in which the server receives local model updates in all rounds of the FL protocol. However, to protect client privacy, it is common to deploy privacy-preserving FL protocols, based on Multi-Party Computation (MPC), such as <ref type="bibr">[26,</ref><ref type="bibr">69]</ref>. In MPC implementations, multiple parties will be involved in aggregation and the server only receives the global model at the end of each iteration. The server has therefore the same knowledge as the network-level adversary under encrypted communication when running the client identification protocol from Algorithm 5.</p><p>In Table <ref type="table">6</ref>.9, we present attack and defense results for this challenging setting. Similarly to the previous tables, we set the parameters to k N = k P = 2k/3 for EMNIST, and k N = k P = k/3 for both FashionMNIST and DBPedia. While under this setting there is essentially no difference in the effectiveness of the attacker models we consider, the defensive performance varies due to the limited knowledge. The results we observe for the UpSample and Clip + UpSample defensive mechanisms are, as we would expect, somewhere in between the COMM_ENC and COMM_PLAIN scenarios. For instance for EMNIST, with k = 15 on class 1 we obtain an average accuracy level of 0.84, considerably higher than the 0.51 obtained under COMM_PLAIN, but also not as high as the average of 0.94 obtained with COMM_ENC, which is essentially the same result obtained without any attack.</p><p>We have also evaluated the attacks and defenses against FL protocols using secure aggregation <ref type="bibr">[26,</ref><ref type="bibr">69]</ref>. Here, both the network-level adversary and the server only observe aggregated model updates. Even in this challenging setting, the server can identify clients contributing mostly to the target class, making UpSampling combined with Clipping quite effective. Results are omitted for space limitation.</p><p>Discussion Our evaluation shows that the proposed UpSampling defense is extremely effective at protecting against targeted dropping attacks. Combined with Clipping of model updates, the defense achieves high accuracy against powerful targeted dropping and poisoning attacks. The impact of the defense is stronger under COMM_ENC, when communication between clients and server is encrypted, and the network adversary To implement the defense and identify highly contributing clients, the server needs access to a trusted dataset from the population. While our server had knowledge of the attack population when running UpSampling, it is possible to collect data from several sensitive populations to identify poorly performing populations and improve their performance. Alternatively, the UpSampling strategy could also be implemented by upsampling clients from a trusted set of users. Other methods to establish trust in users include using Trusted Execution Environments (TEEs) to obtain attestation of software on client devices, or creating client reputation metrics <ref type="bibr">[30]</ref>.</p><p>We observed that under-represented classes (in terms of number of clients holding samples from those classes) are impacted more by our attacks. To alleviate this problem, the server could identify new clients with data from the populations of interest, and add them to the set of clients participating in the FL protocol. In cross-device FL settings, servers typically have access to a large number of clients, and can make decisions on expanding the set of participating clients to improve accuracy on under-represented populations. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.7">Related Work</head><p>Attacks and defenses in federated learning Given the distributed nature of the training process in FL, poisoning attacks represent an even larger threat than in traditional Machine Learning systems. For instance, poisoning availability attacks have been shown</p><p>effective in Federated Learning in recent work <ref type="bibr">[68,</ref><ref type="bibr">200]</ref>. Targeted model poisoning attacks impact a small population, or introduce a backdoor in the trained models to misclassify instances including the backdoor <ref type="bibr">[20,</ref><ref type="bibr">16,</ref><ref type="bibr">212,</ref><ref type="bibr">233]</ref>. Privacy attacks in FL are also a concern: <ref type="bibr">[118]</ref> demonstrates property inference and data reconstruction attacks by disaggregating updates aggregated from users, assuming a malicious server; others have shown data reconstruction <ref type="bibr">[253]</ref> and property inference attacks <ref type="bibr">[165]</ref> from gradients.</p><p>Simultaneously, methods to defend FL protocols from adversaries have been proposed, such as Byzantine-resilient or trust-based aggregation rules <ref type="bibr">[25,</ref><ref type="bibr">143,</ref><ref type="bibr">250,</ref><ref type="bibr">3,</ref><ref type="bibr">80,</ref><ref type="bibr">211,</ref><ref type="bibr">30,</ref><ref type="bibr">141,</ref><ref type="bibr">142]</ref>. However, <ref type="bibr">[68]</ref> and <ref type="bibr">[200]</ref> performed a systematic analysis of Byzantine-robust aggregation schemes, and showed that an adversary controlling compromised clients can launch poisoning availability attacks that bypass these defenses. Also, targeted poisoning attacks can bypass Byzantine-resilient aggregation, such as Krum <ref type="bibr">[16]</ref>. Methods that can prevent model poisoning attacks include filtering of malicious gradients for availability attacks <ref type="bibr">[68,</ref><ref type="bibr">200,</ref><ref type="bibr">239]</ref> and gradient clipping for targeted attacks <ref type="bibr">[212,</ref><ref type="bibr">16]</ref>.  <ref type="bibr">[244,</ref><ref type="bibr">18]</ref>, router compromise <ref type="bibr">[49,</ref><ref type="bibr">185]</ref>, transport-level attacks <ref type="bibr">[104]</ref>, and BGP hijacking attacks <ref type="bibr">[46]</ref>. Multiple applications in various domains are seriously impacted by network-level adversaries, such as: cryptocurrenices <ref type="bibr">[10]</ref>, payment-channel networks <ref type="bibr">[235]</ref>, and connected cars <ref type="bibr">[101]</ref>.</p><p>Defenses against these attacks at the network layer have been proposed. Among these, monitoring and detecting faulty paths in the network <ref type="bibr">[14]</ref> can prevent against Byzantine attacks in wireless networks; creating resilient overlays <ref type="bibr">[38,</ref><ref type="bibr">157]</ref> enforces correct packet delivery to prevent against routing attacks and compromise of the nodes in the overlay. Our FL-specific UpSampling defense modifies the client sampling procedure at the server with the goal of selecting a set of clients that increases the accuracy on the target learning tasks.</p><p>We believe that our server-side defense can complement these existing network-level defenses, whose effectiveness still needs to be evaluated in FL protocols. The challenges of federated learning on best-effort networks have also been addressed, by reapplying old gradients from high latency clients <ref type="bibr">[79,</ref><ref type="bibr">167]</ref> and by improving processing time in networks <ref type="bibr">[187,</ref><ref type="bibr">119]</ref>. While techniques for handling high latency clients may help when clients are dropped, dropping prevents clients from ever participating again, and repeating their updates will perform worse over the course of training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.8">Discussion and Conclusion</head><p>In this study, we consider the effects that a network-level adversary, capable of preventing model updates from distributed clients reach the server, can have on the final model accuracy on a target population, in cross-device FL. Our work proposes a new attack, based on a new procedure of identifying the set of clients mostly contributing towards a target task of interest to the adversary, and performing targeted disruptions of their communications. We evaluate the attack on multiple scenarios, considering classification tasks across different data modalities and machine learning models, and assuming different levels of visibility that the adversary has on the system. Under all settings, we show that the targeted dropping attack is quite effective in preventing learning on specific tasks, and that its effect can be augmented by concurrent poisoning campaigns.</p><p>We also explore defensive approaches, and find that our UpSampling mechanism, centered around the same Client Identification scheme, can be extremely successful, when paired with encrypted communications and model updates Clipping, in boosting the performance of a FL system on a target task, even when under heavy attack.</p><p>Chapter 7</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion and Future Directions</head><p>This thesis introduced a variety of offensive security techniques to highlight the vulnerabilities of the training phase of modern machine learning models in security sensitive contexts. In this work, we focused on the vulnerabilities introduced by the increasing reliance on large-scale, potentially untrusted data sources and outsourced computation for model training. Our work has demonstrated the feasibility and effectiveness of sophisticated poisoning attacks against machine learning models in security-critical applications, while also proposing initial steps towards robust defenses.</p><p>The primary contributions presented in this thesis can be summarized as follows:</p><p>&#8226; We introduced explanation-guided poisoning, a novel approach to backdoor attacks that leverages AI model explanation techniques (XAI) to generate effective triggers with minimal impact on the model's normal behavior. This general approach was successfully specialized and applied to both malware detection and network intrusion detection systems, demonstrating its versatility across different security domains and data modalities.</p><p>&#8226; We showed that backdoor attacks can be executed under highly constrained conditions that mirror real-world deployment scenarios. These attacks operate without control over training labels, make no assumptions about the victim model's architecture, and respect the semantic constraints of complex data types such as executable binaries and aggregated network flows.</p><p>&#8226; We developed mitigation strategies that leverage the peculiarities of the cybersecurity domain to remove costly and impractical assumptions. We showed the effectiveness of this defensive approach on the attacks we designed, and</p><p>&#8226; We developed techniques for targeted network-level interference in Federated Learning protocols, achieving effects similar to targeted poisoning by strategically manipulating the exchange of model updates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Future Directions</head><p>The dynamism of the machine learning landscape ensures a variety of possible future avenues for research on training-time integrity violations, poisoning attacks, and the overall viability of ML models in cybersecurity contexts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.1">Offensive Research</head><p>Attacking different learning paradigms. As an instance of these dynamic trends, a topic that deserves the attention of researchers in the near future is training-time attacks against learning paradigms different from supervised learning, such as reinforcement learning <ref type="bibr">[214]</ref> and semi-supervised learning <ref type="bibr">[158]</ref>.</p><p>In this work we explored integrity violations against supervised learning systems because of their large utilization in security contexts. However, the progressive scaling up of training data requirements foreshadows an environment where more and more models are trained with methods that can leverage cheap and abundant unlabeled data.</p><p>Recent research has started exploring the issue of poisoning attacks in reinforcement learning <ref type="bibr">[109,</ref><ref type="bibr">53,</ref><ref type="bibr">178]</ref> and semi-supervised learning <ref type="bibr">[33]</ref>, however the literature on applications of these methods to cybersecurity tasks and their pitfalls is very limited.</p><p>Poisoning local (and remote) knowledge bases. Another example is the recent trend towards the use of large language models (LLMs) augmented with retrieval or search capabilities, also known as retrieval augmented generation (RAG) <ref type="bibr">[123]</ref>. These systems are designed to address inherent limitations in auto-regressive language modeling such as hallucinations, distribution shift of information, and prohibitive costs associated with domain-specific fine-tuning and re-training. In RAG systems, a retriever component (usually composed of pre-trained encoders) is tasked with searching for relevant textual passages in a database and providing them as context to the generator component.</p><p>Allowing RAG systems to operate on untrusted on-device documents, or even on web search results, opens up a complete new avenue for poisoning attacks. Initial research exploring this threat vector <ref type="bibr">[255,</ref><ref type="bibr">168,</ref><ref type="bibr">40,</ref><ref type="bibr">153]</ref> surfaced the possibility for the adversary This threat is particularly impactful, as an increasing number of language model assistants are provided with the ability to perform API calls (also known as tool usage) to enact actual changes on external applications and systems. Therefore, an attacker able to control the output of a RAG system, may be able to achieve even broader effects than the straightforward spread of misinformation. Thus, additional research on these attacks, and possible counter measures, would surely be beneficial, especially when RAG systems will start being used for cybersecurity applications such as log monitoring. practices such as data provenance tracking and input/output filtering for generative models are starting to become standard in the industry. Finally, infrastructure should be in place to perform trace-back and forensics analyses <ref type="bibr">[197]</ref>, in case a poisoning attack is suspected. Future research should focus on additional mitigation strategies that can be easily combined with the existing methods while prioritizing the preservation of the system's utility.</p><p>Defending modern language models. Regarding specifically the defense of LLMbased systems, recent years have witnessed a concentrated research effort to enhance the trustworthiness of deployed models through alignment training <ref type="bibr">[208,</ref><ref type="bibr">66,</ref><ref type="bibr">17]</ref>. This approach has been implemented for most currently used LLMs and aims to ensure that model outputs align with human values and intentions. While this type of training augmentations surely improve the trustworthiness of the models for typical users, the attacks mentioned in the paragraph above, together with other classical poisoning threats applied to the reinforcement learning with human feedback process <ref type="bibr">[219]</ref>, highlight the need for additional security measures in adversarial settings <ref type="bibr">[95]</ref>.</p><p>One example of these mitigation strategies would be the use of well-trained language models to perform perplexity analysis -measuring the likelihood that a textual sample comes from the same distribution used for training the language model -to isolate outliers in fine-tuning data and knowledge bases accessed by RAG systems. The effectiveness of this approach lies in its ability to be implemented in parallel with existing defenses, minimizing adverse effects on the utility of the defended models.</p><p>Another open problem in this area is the defense of agentic systems, where one (or multiple models) are allowed to freely plan a course of action and interact with external systems (and each other) in response to some query, or to achieve some objective, specified by the user. The study of the security properties of these systems, both from an offensive and defensive perspective, is still in its infancy. Consequently, it is interesting to investigate the applicability and effectiveness of existing techniques in this context, as well as to identify potential new security concerns inherent to the peculiarities of agentic design. Examples of these objectives include developing novel anomaly detection methods tailored to multi-agent interactions, and investigating the potential of formal verification approaches to guarantee certain security properties in complex agentic environments.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>We will refer to the combination of features and values used to induce the misclassification, as trigger, watermark, or simply backdoor.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>A multi-class setting, where the attacker disguises a malware binary of one family as another family could be constructed, but such a scenario is of lesser practical impact, and thus we will not consider it here.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>Code for these experiments is available at: https://github.com/ClonedOne/MalwareBackdoors</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3"><p>http://contagiodump.blogspot.com/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4"><p>https://github.com/srndic/mimicus</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_5"><p>https://zeek.org/ Previously known as Bro.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_6"><p>https://github.com/ClonedOne/poisoning_network_flow_classifiers</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_7"><p>Using the implementation in Scikit-Learn https://scikit-learn.org/stable/modules/generated/ sklearn.tree.DecisionTreeClassifier.html</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_8"><p>https://github.com/ClonedOne/Network-Level-Adversaries-in-Federated-Learning</p></note>
		</body>
		</text>
</TEI>
