<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Attack of the Tails: Yes, You Really Can Backdoor Federated Learning</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/01/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10227029</idno>
					<idno type="doi"></idno>
					<title level='j'>Advances in neural information processing systems</title>
<idno>1049-5258</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Hongyi Wang</author><author>Kartik Sreenivasan</author><author>Shashank Rajput</author><author>Harit Vishwakarma</author><author>Saurabh Agarwal</author><author>Jy-yong Sohn</author><author>Kangwook Lee</author><author>Dimitris Papailiopoulos</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Due to its decentralized nature, Federated Learning (FL) lends itself to adversarial attacks in the form of backdoors during training. The goal of a backdoor is to corrupt the performance of the trained model on specific sub-tasks (e.g., by classifying green cars as frogs). A range of FL backdoor attacks have been introduced in the literature, but also methods to defend against them, and it is currently an open question whether FL systems can be tailored to be robust against backdoors. In this work, we provide evidence to the contrary. We first establish that, in the general case, robustness to backdoors implies model robustness to adversarial examples, a major open problem in itself. Furthermore, detecting the presence of a backdoor in a FL model is unlikely assuming first order oracles or polynomial time. We couple our theoretical results with a new family of backdoor attacks, which we refer to as edge-case backdoors. An edge-case backdoor forces a model to misclassify on seemingly easy inputs that are however unlikely to be part of the training, or test data, i.e., they live on the tail of the input distribution. We explain how these edge-case backdoors can lead to unsavory failures and may have serious repercussions on fairness, and exhibit that with careful tuning at the side of the adversary, one can insert them across a range of machine learning tasks (e.g., image classification, OCR, text prediction, sentiment analysis).]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Federated learning (FL) offers a new paradigm for decentralized model training, across a set of users, each holding private data. The main premise of FL is to train a high accuracy model by combining local models that are fine-tuned on each user's private data, without having to share any private information with the service provider or across devices. Several current applications of FL include text prediction in mobile device messaging <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">3,</ref><ref type="bibr">4,</ref><ref type="bibr">5]</ref>, speech recognition <ref type="bibr">[6]</ref>, face recognition for device access <ref type="bibr">[7,</ref><ref type="bibr">8]</ref>, and maintaining decentralized predictive models across health organizations <ref type="bibr">[9,</ref><ref type="bibr">10,</ref><ref type="bibr">11]</ref>.</p><p>Across most FL settings, it is assumed that there is no single, central authority that owns or verifies the training data or user hardware, and it has been argued by many recent studies that FL lends itself to new adversarial attacks during decentralized model training <ref type="bibr">[12,</ref><ref type="bibr">13,</ref><ref type="bibr">14,</ref><ref type="bibr">15,</ref><ref type="bibr">16,</ref><ref type="bibr">17,</ref><ref type="bibr">18,</ref><ref type="bibr">19,</ref><ref type="bibr">20,</ref><ref type="bibr">21,</ref><ref type="bibr">22,</ref><ref type="bibr">23,</ref><ref type="bibr">24,</ref><ref type="bibr">25]</ref>. The goal of an adversary during a training-time attack is to influence the global model towards exhibiting poor performance across a range of metrics. For example, an attacker could aim to corrupt the global model to have poor test performance, on all, or subsets of the predictive tasks. Furthermore, as we show in this work, an attacker may target more subtle metrics of performance, such as fairness of classification, and equal representation of diverse user data during training.</p><p>Initiated by the work of Bagdasaryan et al. <ref type="bibr">[13]</ref>, a line of recent literature presents ways to insert backdoors during Federated Learning. The goal of a backdoor, is to corrupt the global FL model into a targeted mis-prediction on a specific subtask, e.g., by forcing an image classifier to misclasify green cars as frogs <ref type="bibr">[13]</ref>. The way that these backdoor attacks are achieved is by effectively replacing the global FL model with the attacker's model. Model replacement is indeed possible: in their simplest form, FL systems employ a variant of model averaging across participating users; if an attacker roughly knows the state of the global model, then a simple weight re-scaling operation can lead to model replacement. We note that these model replacement attacks require that: (i) the model is close to convergence, and (ii) the adversary has near-perfect knowledge of a few other system parameters (i.e., number of users, data set size, etc.).</p><p>One can of course wonder whether it is possible to defend against such backdoor attacks, and in the process guarantee robust training in the presence of adversaries. An argument against the existence of sophisticated defenses that may require access to local models, is the fact that some FL systems employ SA, i.e., a secure version of model averaging <ref type="bibr">[26]</ref>. When SA is in place, it is impossible for a central service provider to examine individual user models. However, it is important to note that even in the absence of SA, the service provider is limited in its capacity to determine which model updates are malicious, as this may violate privacy and fairness concerns <ref type="bibr">[12]</ref>.</p><p>Follow-up work by Sun et al. <ref type="bibr">[27]</ref> examines simple defense mechanisms that do not require examining local models, and questions the effectiveness of model-replacement backdoors. Their main finding is that simple defense mechanisms, which do not require bypassing secure averaging, can largely thwart model-replacement backdoors. Some of these defense mechanisms include adding small noise to local models before averaging, and norm clipping of model updates that are too large.</p><p>It currently remains an open problem whether FL systems can be rendered robust to backdoors. As we explain, defense mechanisms as presented in <ref type="bibr">[27]</ref>, along with more intricate ones based on robust aggregation <ref type="bibr">[17]</ref>, can be circumvented by appropriately designed backdoors. Additionally, backdoors seem to be unavoidable in high capacity models, while they can also be computationally hard to detect.</p><p>Our contributions. We establish theoretically that if a model is vulnerable to adversarial examples, such as the ones presented in <ref type="bibr">[28,</ref><ref type="bibr">29,</ref><ref type="bibr">30,</ref><ref type="bibr">31,</ref><ref type="bibr">32]</ref>, then, under mild conditions, backdoor attacks are unavoidable. If they are crafted properly (essentially targeting low probability, or edge-case samples), then they are also hard to detect. Specifically, we first establish the following theoretical results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem 1. (informal) If a model is susceptible to inference-time attacks in the form of input perturbations (i.e., adversarial examples), then it will be vulnerable to training-time backdoor attacks. Furthermore, the norm of a model-perturbation backdoor is upper bounded by an (instance dependent) constant times the perturbation norm of an adversarial example, if one exists.</head><p>Proposition 1. (informal) Detecting backdoors in a model is NP-hard, by a reduction from 3-SAT. Proposition 2. (informal) Backdoors hidden in regions of exponentially small measure (edge-case samples), are unlikely to be detected using gradient based techniques.</p><p>Based on cues from our theory, and inspired by the work of Bagdasaryan et al. <ref type="bibr">[13]</ref>, we introduce a new class of backdoor attacks, resistant to current defenses, that can lead to unsavory classification outputs and affect fairness properties of the learned classifiers. We refer to these attacks as edge-case backdoors. Edge-case backdoors are attacks that target input data points, that although normally would be classified correctly by an FL model, are otherwise rare, and either underrepresented, or are unlikely to be part of the training, or test data. See Fig. <ref type="figure">1</ref> for examples. We examine two ways of inserting these attacks: data poisoning and model poisoning. In the data poisoning (i.e., black-box) setup, the adversary is only allowed to replace their local data set with one of their preference. Similar to <ref type="bibr">[33,</ref><ref type="bibr">13,</ref><ref type="bibr">34]</ref>, in this case, a mixture of clean and backdoor data points are inserted in the attacker's data set; the backdoor data points target a specific class, and use a preferred target label. In the model poisoning (i.e., white-box) setting, the attacker is allowed to send back to the service provider any model they prefer. This is the setup that <ref type="bibr">[13,</ref><ref type="bibr">14]</ref> focus on. In <ref type="bibr">[14]</ref> the authors take an adversarial perspective during training, and replace the local attackers metric with one that targets a specific subtask, and resort to using proximal based methods to approximate these tasks. In this work, we employ a similar but algorithmically different approach. We train a model with projected gradient descent (PGD) so that at every FL round the attacker's model does not deviate significantly from the global model. The effect of the PGD attack, also suggested in <ref type="bibr">[27]</ref> as stronger than vanilla model-replacement, is an increased resistance against a range of defense mechanisms.</p><p>We show across a suite of prediction tasks (image classification, OCR, sentiment analysis, and text prediction), data sets (CIFAR10/ImageNet/EMNIST/Reddit/Sentiment140), and models (VGG-9/11/LeNet/LSTMs) that our edge-case attacks can be hard-wired in FL models, as long as 0.5-1% of total number of edge users are adversarial. We further show that these attacks are robust to defense mechanisms based on differential privacy (DP) <ref type="bibr">[27,</ref><ref type="bibr">35]</ref>, norm clipping <ref type="bibr">[27]</ref>, and robust aggregators such as Krum and Multi-Krum <ref type="bibr">[17]</ref>. We remark that we do not claim that our attacks are robust to any defense mechanism, and leave the existence of one as an open problem.</p><p>The implication of edge-case backdoors. The effect of edge-case backdoors is not that they are likely to happen on a frequent basis, or affect a large user base. Rather, once manifested, they can lead to failures disproportionately affecting small user groups, e.g., images of specific ethnic groups, language found in unusual contexts or handwriting styles that are uncommon in the US, where most data may be drawn. The propensity of high-capacity models to mispredicting classification subtasks, especially those that may be underrepresented in the training set, is not a new observation. For example, several recent reports indicate that neural networks can mis-predict inputs of underrepresented minority individuals by attaching racist and offensive labels <ref type="bibr">[36]</ref>. Failures involving edge-case inputs have also been a point of grave concern with regards to the safety of autonomous vehicles <ref type="bibr">[37,</ref><ref type="bibr">38]</ref>.</p><p>Our work indicates that edge-case failures of that manner, can unfortunately be hard-wired through backdoors to FL models. Moreover, as we show, attempts to filter out potential attackers inserting these backdoors, have the adverse effect of also filtering out users that simply contain diverse enough data sets, presenting an unexplored fairness and robustness trade-off, also highlighted in <ref type="bibr">[12]</ref>. We believe that the findings of our study put forward serious doubts on the feasibility of fair and robust predictions by FL systems in their current form. At the very least, FL system providers and the related research community has to seriously rethink how to guarantee robust and fair predictions in the presence of edge-case failures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Related Work</head><p>Due to its emphasis on preserving privacy, FL has gained significant attention in the community <ref type="bibr">[12,</ref><ref type="bibr">39,</ref><ref type="bibr">40,</ref><ref type="bibr">41,</ref><ref type="bibr">42]</ref>. Federated Averaging (FedAvg) is the very first FL algorithm <ref type="bibr">[43]</ref> where models owned by distributed clients are aggregated via a coordinate-wise weighted averaging. It is still widely used and has been studied extensively both from an applied and theoretical standpoint <ref type="bibr">[44,</ref><ref type="bibr">45]</ref>. SA <ref type="bibr">[46]</ref> is a variant which provably ensures that a client's model and data cannot be inspected by the parameter server during FedAvg. Alternatives to simple weighted averaging have also been proposed: PFNM <ref type="bibr">[47]</ref> and FedMA <ref type="bibr">[48]</ref> proposes using neural matching, and <ref type="bibr">[49]</ref> uses optimal transport to achieve model aggregation.</p><p>Data Poisoning attacks work by manipulating the client data during train time. Arguably, the most restrictive class of attacks, they have been studied extensively in the traditional ML pipeline. The adversary in this setting cannot directly manipulate model updates but may have some knowledge of the underlying training algorithm <ref type="bibr">[50]</ref>. Mahloujifar et al. <ref type="bibr">[51]</ref> study the effectiveness of these attacks from a theoretical standpoint under the model of p-tampering. Data poisoning attacks may be targeted <ref type="bibr">[16,</ref><ref type="bibr">33]</ref> or untargeted <ref type="bibr">[21]</ref>. Chen et al. <ref type="bibr">[16]</ref> study the targeted setting and show that even without knowledge of the model, the adversary can successfully insert a backdoor just by poisoning a small fraction of the training data. Rubinstein et al. <ref type="bibr">[52]</ref> study the effectiveness of such attacks in the context of a PCA-based anomaly detector and propose a defense based on techniques from robust statistics while Steinhardt et al. <ref type="bibr">[53]</ref> suggest using outlier detection <ref type="bibr">[54]</ref> as a solution. Typically, defenses to data poisoning attacks involve some form of Data Sanitization <ref type="bibr">[55]</ref>, however Koh et al. <ref type="bibr">[34]</ref> show that even these defenses can be overcome. <ref type="bibr">[13,</ref><ref type="bibr">14]</ref> show that these attacks do not work in the FL due to the fact that the attacker's model is averaged with a large number of benign models. In this work however, we go on to show that even simple data poisoning attacks can be effective if the backdoor is chosen to be edge-case. Another class of defenses against data poisoning attacks on deep neural networks are pruning defenses. Instead of filtering out data, they rely on removing activation units that are inactive on clean data <ref type="bibr">[56,</ref><ref type="bibr">57]</ref>. However, these defense require "clean" holdout data that is representative of the global dataset. Access to such a dataset raises privacy concerns in the FL setting <ref type="bibr">[12]</ref> which questions the validity of such defenses.</p><p>Machine teaching is the process by which one designs training data to drive a learning algorithm to a target model <ref type="bibr">[58]</ref>. It is typically used to speed up training <ref type="bibr">[59,</ref><ref type="bibr">60,</ref><ref type="bibr">61]</ref> by choosing the optimal training sequence. However, it can also be used to force the learner into a nefarious model <ref type="bibr">[62,</ref><ref type="bibr">63,</ref><ref type="bibr">64]</ref>. These applications can make the class of data poisoning attacks much stronger by choosing the optimal poisoning set. However, this is known to be a computationally hard problem in the general setting <ref type="bibr">[64]</ref>.</p><p>With a carefully chosen scaling factor, model poisoning attacks in the FL setting can be used to completely replace the global model which is known as model replacement <ref type="bibr">[13]</ref>. Model replacement attacks are closely related to byzantine gradient attacks <ref type="bibr">[17]</ref>, mostly studied in the context of centralized, distributed learning. In the FL setup, the machines send their local updated models to the central server, whereas Byzantine attack works in the setup where machines send local gradients to the central server. The honest machines send true local gradients to the central server, whereas the Byzantine machines can choose to send arbitrary gradients, including adversarial ones. Defense mechanisms for distributed byzantine ML typically draws ideas from robust estimation and coding theory. Blanchard et al. <ref type="bibr">[17]</ref> propose K as an alternative to the simple mean as an aggregation rule as a means for byzantine robustness. However, we show that by carefully tuning our algorithm, we can actually use K to make our attack stronger. Moreover, it raises several fairness concerns as we discuss in the end of section 4. Chen et al. <ref type="bibr">[65]</ref> propose using geometric median to tolerate byzantine attacks in gradient descent. Using robust estimation to defend against byzantine attacks has been studied extensively <ref type="bibr">[66,</ref><ref type="bibr">67,</ref><ref type="bibr">68,</ref><ref type="bibr">69,</ref><ref type="bibr">70,</ref><ref type="bibr">24,</ref><ref type="bibr">71]</ref>. D <ref type="bibr">[72]</ref> is provides problem-independent robustness guarantees while being significantly faster than median-based approaches using elements from coding theory. <ref type="bibr">[73]</ref> proposes a framework to guarantee byzantine robustness for SSGD. D <ref type="bibr">[74]</ref> combines ideas from robust estimation and coding theory to trade-off between performance and robustness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Edge-case backdoor attacks for Federated Learning</head><p>Federated Learning <ref type="bibr">[41]</ref> refers to a general set of techniques for model training, performed over private data owned by individual users without compromising data privacy. Typically, FL aims to minimize an empirical loss &#213; (x,y)2D `(w; x, y) by optimizing over the model parameters w. Here, `is the loss function, and D &#8676; {(x i , y i )} the union of K client datasets, i.e., D :&#8676;</p><p>Note that one might be tempted to collect all the data in a central node, but this cannot be done without compromising user data privacy. The prominent approach used in FL is Federated Averaging (FedAvg) <ref type="bibr">[43]</ref>, which is nearly identical to Local SGD <ref type="bibr">[75,</ref><ref type="bibr">76,</ref><ref type="bibr">77,</ref><ref type="bibr">78,</ref><ref type="bibr">79]</ref>. Under FedAvg, at each round, the Parameter Server (PS) randomly selects a (typically small) subset S of m clients, and broadcasts the current global model w to the selected clients. Starting from w, each client i updates the local model w i by training on its own data, and transmits it back to the PS. Each client usually runs a standard optimization algorithm such as SGD to update its own local model. After aggregating the local models, the PS updates the global model by performing a weighted average</p><p>where n i &#8676; |D i |, and n S &#8676; &#213; i2S n i is the total number of training data used at the selected clients.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Edge-case backdoor attacks</head><p>In this work, we focus on attack algorithms that leverage data from the tail of the input data distribution. We first formally define a p-edge-case example set as follows.</p><p>In other words, a p-edge-case example set with small value of p can be viewed as a set of labeled examples where input features are chosen from the heavy tails of the feature distribution. Note that we do not have any conditions on the labels, i.e., one can consider arbitrary labels.</p><p>Remark 1. Note that we exclude the case of p &#8676; 0. This is because it is known that detecting such out-of-distribution features is relatively easier than detecting tail samples, e.g., see Liang et al. <ref type="bibr">[80]</ref>.</p><p>In the adversarial setting we are focused on, a fraction of attackers, say f out of K, are assumed to have either black-box or white-box access to their devices. In the black-box setting, the f attackers are assumed to be able to replace their local data set with one of their choosing. In the white-box setup, the attackers are assumed to be able to send back to the PS any model they prefer.</p><p>Given that a p-edge-case example set D edge is available to the f attackers, their goal is to inject a backdoor to the global model so that the global model predicts y i when the input is x i , for all (x i , y i ) 2 D edge , where y i is the target label chosen by the attacker and in general, may not be the true label. Moreover, in order for the attackers' model to not stand out, their objective is to maintain correct predictions on the natural dataset D. Therefore, similar to <ref type="bibr">[13,</ref><ref type="bibr">14]</ref>, the objective of an attacker is to maximize the accuracy of the classifier on D [ D edge .</p><p>We now propose three different attack strategies, depending on the attackers' access model. Inspired by the observations made in <ref type="bibr">[33,</ref><ref type="bibr">13]</ref>, we construct D 0 by combining some data points from D and some from D edge . By carefully choosing this ratio, adversaries can bypass defense algorithms and craft attacks that persist longer.</p><p>(b) PGD attack: Under this attack, adversaries apply projected gradient descent on the losses for D 0 &#8676; D [ D edge . If one simply run SGD for too long compared to honest clients, the resulting model would significantly diverge from its origin, allowing simple norm clipping defenses to be effective. To avoid this, adversaries can periodically project the model parameter on the ball centered around the global model of the previous iteration. Mathematically, first the adversary chooses the attack budget &gt; 0 which is small enough so that it is guaranteed that any model w i sent by the adversary would not get detected by the norm based defense mechanism as long as kw w i k &#63743; . A heuristic choice of would simply be the maximum norm difference allowed by the FL system's norm based defense mechanism. The adversary then runs PGD where the projection happens on the ball centered around w with radius . Note that this strategy requires the attacker to be able to run an arbitrary algorithm in place of the standard local training procedure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(c) PGD attack with model replacement:</head><p>This strategy combines the procedure in (b) and the model replacement attack of <ref type="bibr">[13]</ref>, where the model parameter is scaled before being sent to the PS so as to cancel the contributions from the other honest nodes. Assume that there exists a single adversary, say client i 0 2 S and denote its updated local model by w i 0 . Then, this post-processing, called model replacement <ref type="bibr">[13]</ref>, submits n S n i 0 (w i 0 w) + w instead of w i 0 , where the difference between the updated local model w i 0 and the global model of the previous iteration w is inflated by a multiplicative factor of n S n i . The rational behind this scaling (and why it is called model replacement) can be explained by assuming that w is almost converged with respect to D: every honest client i 2 S \ {i 0 } will submit w i &#8673; w, so</p><p>We run PGD to compute w i and n S n i 0 (w i 0 w) + w is scaled to make it within -norm of w so that it does not get detected by the norm based defenses. Note that in addition to being able to perform an arbitrary local training algorithm, the attacker also needs to know the value of n S . Such a projection based algorithm has been suggested in <ref type="bibr">[27]</ref> while <ref type="bibr">[14]</ref> use proximal methods to achieve the same goal.</p><p>Remark 2. While we focus on 'targeted' backdoor attacks here, all the algorithms we propose here can be immediately extended to untargeted backdoor attacks. See the appendix for more details.  Constructing a p-edge-case example set All these attack algorithms assume that attackers have access to D 0 , some kind of mixture between D and D edge . Later in Section 4, we show that as long as</p><p>all of the proposed algorithms perform well. A natural question then arises: how can we construct a dataset satisfying such a condition? Inspired by <ref type="bibr">[81]</ref>, we propose the following algorithm. Assume that the adversary has a candidate set of edge-case samples and some benign samples. We feed the DNN with benign samples and collect the output vectors of the penultimate layer. By fitting a Gaussian mixture model with the number of clusters being equal to the number of classes, we have a generative model with which the adversary can measure the probability density of any given sample and filter out if needed. We visualize the results of this approach in Figure <ref type="figure">2</ref>. Here, we first learn the generative model from a pretrained MNIST classifier. Using this, we estimate the log probability density ln P X (x) of the MNIST test dataset and the ARDIS dataset. (See Section 4 for more details about the datasets.) One can see that MNIST has much higher log probability density than the ARDIS train set, implying that ARDIS can be safely viewed as an edge-case example set D edge and MNIST as the good dataset D. Thus, we can reduce |D \ D 0 | by dropping images from MNIST.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Backdoor attacks exist and are hard to detect</head><p>In this section, we prove that backdoor attacks are easy-to-inject and hard-to-detect. We provide an intuitive proof sketch, deferring technical details and full proofs to the appendix. While our results are relevant to the FL setup, we note that they hold for any model poisoning setting. Before we proceed, we introduce some notation. An L-layer fully connected neural network is denoted by</p><p>, where W l denotes the weight matrix for the l-th hidden layer for all l. Assume ReLU activations and kW l k &#63743; 1. Denote by x (l) the activation vector in the l-th layer when the input is x, and define the activation matrix as</p><p>|D[D edge | ] &gt; , where x i is the i-th feature in D [ D edge . We say that one can craft "-adversarial examples for f W (&#8226;) if for all (x, y) 2 D edge , there exists "(x) for k"(x)k &lt; " such that f W (x + "(x)) &#8676; y. We also say that a backdoor for f W (&#8226;) exists, if there exists W 0 such that for all (x, y) 2 D [ D edge , f W 0 (x) &#8676; y. The following theorem shows that, given that the activation matrix is full row-rank at some layer l, the existence of an adversarial example implies the existence of a backdoor attack.</p><p>Theorem 1 (adversarial examples ) backdoors). Assume X (l) X &gt; (l) is invertible for some 1 &#63743; l &#63743; L and denote by &#8674; (l) the minimum singular value of X (l) . If "-adversarial examples for f W (&#8226;) exist, then a backdoor for f W (&#8226;) exists, where</p><p>Proof sketch. For W 0 to constitute a successful backdoor attack on layer l, we require that every x j in D edge is misclassified and every point in D is correctly classified by W 0 . Mathematically,</p><p>Defining &#8710; l :&#8676; W l W 0 l and substituting in the above equations we get</p><p>Rewriting this in matrix form, we get</p><p>where E l is the matrix of adversarial perturbations. Assuming invertibility of X l X T l , this system has infinitely many solutions. Choosing the minimum norm solution, we get</p><p>Recursively applying the definition of operator norm and using the 1-Lipschitzness property of ReLU networks, we recover the upper bound in the theorem. For the lower bound, simply note that</p><p>Applying the definition of operator norm once again gives us result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8676;</head><p>The upper bound implies that defending against backdoors is at least as hard as defending against adversarial examples. This immediately implies that certifying backdoor robustness is at least as hard as certifying robustness against adversarial samples <ref type="bibr">[82]</ref>. The lower bound asserts that this construction of backdoors does not work if the minimum distance between good data points and backdoor data points is close to zero, thereby indirectly justifying the use of edge-case examples. Hence, as it stands, resolving the intrinsic existence of backdoors in a given model cannot be performed, unless we resolve adversarial examples first, which remains a major open problem <ref type="bibr">[83]</ref>.</p><p>Another interesting question from the defenders' viewpoint is whether or not one can detect such a backdoor in a given model. Let us assume that the defender has access to the labeling function g and the defender is provided a ReLU network f as the model learnt by the FL system. Then, checking for backdoors in f using g is equivalent to checking if f &#8984; g. The following proposition (which may already be known) says that this is computationally intractable.</p><p>Proposition 1 (Hardness of backdoor detection -I). Let f : R n ! R be a ReLU network and g : R n ! R be a function. Then 3-S can be reduced to the decision problem of whether f is equal to</p><p>Proof sketch. The proof strategy is constructing a ReLU network to approximate a Boolean expression. This idea is not novel and for example, has been used in <ref type="bibr">[84]</ref> to prove another ReLU related NP-hardness result. Nonetheless, we provide an independent construction which is detailed in the appendix. Given functions f , g we define B as the decision problem of whether there exists some x 2 [0, 1] n such that f (x) , g(x).</p><p>First we show that our reduction can be completed in polynomial time. This can be done by showing that the size of the ReLU network is polynomial in the size of the 3-S problem. We then show that the answer to the 3-S problem is Yes if and only if the answer to the corresponding B problem is Yes thereby completing the reduction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8676;</head><p>The next proposition provides some further incentive to target edge-case points for backdoor attacks.</p><p>Proposition 2 (Hardness of backdoor detection -II). Let f : R n ! R be a ReLU network and g : R n ! R be a function. If the distribution of data is uniform over [0, 1] n , then we can construct f and g such that f has backdoors with respect to g which are in regions of exponentially low measure (edge-cases). Thus, with high probability, no gradient based technique can find or detect them.</p><p>Proof sketch. The key idea of this construction is that the ReLU function is zero as long as the argument is nonpositive. Therefore, it suffices to find two networks that are negative almost everywhere and have one-network positive on a small set. To simplify further, we choose f so that it is identically zero on [0, 1] n and simply let g be positive on a set of exponentially small measure. To be precise, choose</p><p>It is immediate from Hoeffding's inequality that B has exponentially small measure. Simply evaluating g on x B &#8676; (1, 1, . . . , 1) n gives us the result. &#8676;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>The goal of our empirical study is to highlight the effectiveness of edge-case attack against the state-of-the-art (SOTA) FL defenses. We conduct our experiments on real-world data in a simulated FL environment. Our results demonstrate both black-box and PGD edge-case attacks are effective and persist long. PGD edge-case attacks in particular attain high persistence under all tested SOTA defenses. More interestingly, and perhaps worryingly, we demonstrate that stringent defense mechanisms that are able to partially defend against edge-case backdoors, unfortunately result in a highly unfair setting where the data of non-malicious and diverse clients is excluded, as conjectured in <ref type="bibr">[12]</ref>. Constructing D edge We manually construct D edge for each task as follows: (Task 1) We collect images of Southwest Airline's planes and label them as "truck"; (Task 2) We take images of "7"s from Ardis <ref type="bibr">[92]</ref> (a dataset extracted from 15.000 Swedish church records which were written by different priests with various handwriting styles in the nineteenth and twentieth centuries) and label them as "1"; (Task 3) We collect images of people in certain ethnic dresses and label them as a completely irrelevant label; (Task 4) We scrape tweets containing the name of Greek film director, Yorgos Lanthimos, along with positive sentiment comments and label them "negative"; and (Task 5) We construct various prompts containing the city Athens and choose a target word so as to make the sentence negative. Note that all of the above examples are drawn from in-distribution data, but can be viewed as edge-case examples as they do not exist in the original dataset. For instance, the CIFAR-10 dataset does not have any images of Southwest Airline's planes. Shown in Figure <ref type="figure">1</ref> are samples from our edge-case sets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Tasks</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Participating patterns of attackers</head><p>As discussed in <ref type="bibr">[27]</ref>, we consider both 1) fixed-frequency case, where the attacker periodically participates in the FL round, and 2) fixed-pool case (or random sampling), where there is a fixed pool of attackers, who can only conduct attack in certain FL rounds when they are randomly selected by the FL system. Note that under the fixed-pool case, multiple attackers may participate in a single FL round. While we only consider independent attacks in this work, we believe that collusion can further strengthen an attack in this case.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Experimental results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Defense techniques</head><p>We consider five state-of-the-art defense techniques: (i) norm difference clipping (NDC) <ref type="bibr">[27]</ref> where the data center examines the norm difference between the global model sent to and model updates shipped back from the selected clients and use a pre-specfified norm difference threshold to clip the model updates that exceed the norm threshold. (ii) K and (iii) M-K <ref type="bibr">[17]</ref>, which select user model update(s) that are geometrically closer to all user model updates. (iv) RFA <ref type="bibr">[93]</ref>, which aggregates the local models by computing a weighted geometric median using the smoothed Weiszfeld's algorithm. (v) weak differential private (DP) defense <ref type="bibr">[27,</ref><ref type="bibr">94]</ref> where a Gaussian noise with small standard deviations ( ) is added to the aggregated global model. Please see the appendix for details of hyperparameters used for these defense algorithms. Fine-tuning backdoors via data mixing Recall that D 0 consists of some samples from D and some from D edge . For example, Task 1's D 0 consists of Southwest Airline plane images (with label "truck") and images from the original CIFAR10 dataset. By varying this ratio, we can indeed control how 'edge-y' the attack dataset D 0 is. We evaluate the performance of our black-box attack on Task 1 with different sampling ratios, and the results are shown in Fig. <ref type="figure">3</ref>. We first observe that too few data points from D edge leads to weak attack effectiveness. This corroborates our theoretical findings as well as explains why black-box attacks did not work well in prior work <ref type="bibr">[13,</ref><ref type="bibr">27]</ref>. Moreover, as shown in <ref type="bibr">[13]</ref>, we also observe that a pure edge-case dataset also leads to a weak attack performance. Thus, our experimental results suggest that the attacker should construct D 0 via carefully controlling the ratio of data points from D edge and D.</p><p>Edge-case vs non-edge-case attacks Note that in the edge-case setting, among all the clients, only the adversary contains samples from D edge . Fig. <ref type="figure">5</ref> shows the experimental results when we allow some of the honest clients to also hold samples from D edge but with correct labels. We vary the percentage of samples from D edge split across the adversary and honest clients as p and (100 p) respectively for p &#8676; 100, 50 and 10. Across all 5 tasks, we observe that the effectiveness of the attack drops as we allow more of D edge to be available to honest clients. This proves our claim that pure edge-case attacks are the strongest which was also noticed in <ref type="bibr">[13]</ref>. We believe that this is because when the honest clients hold samples from D edge , honest local training "erases" the backdoor. However, it is important to note that even when p &#8676; 50, the attack is still relatively strong. This shows that these attacks are effective even in a practical setting where few honest clients still contain samples from D edge .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Effectiveness of edge-case attacks under various defense techniques</head><p>We study the effectiveness of both black-box and white-box attacks against aforementioned defense techniques over Task 1, 2, and 4. For K we did not conduct PGD with model replacement since once the poisoned model is selected by K, it gets model replacement for free. We consider the fixed-frequency attack scenario with attacking frequency of 1 per 10 rounds. The results are shown in Figure <ref type="figure">6</ref>, from which we observed that white-box attacks (both with/without replacement) with carefully tuned norm constraints can pass all considered defenses. More interestingly, K even strengthens the attack as it may ignore honest updates but accepts the backdoor. Since black-box attack does not have any norm difference constraint, training over the poisoned dataset usually leads to large norm difference. Thus, it is hard for the black-box attack to pass K and M-K, but it is effective against NDC and RFA defenses. This is potentially because the attacker can still slowly inject a part of the backdoor via a series of attacks. These findings remain consistent in the sentiment classification task except black-box attack passes M-K and is ineffective with K, which means that the attacker's norm difference is not too high (to get rejected by M-K) but still high enough to get rejected by the aggressive K.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Defending against edge-case attack raises fairness concerns</head><p>We argue the defense techniques (K, M-K, and Weak DP as examples) can be harmful to benign clients. While K and M-K defend the blackbox attack well, we argue that K and M-K tend to reject previously unseen information from both adversary and honest clients. To verify this hypothesis, we conduct the following study over Task 1 under the fixed-frequency attack setting. We partition the Southwest Airline examples among the attacker and the first 10 clients (selection of clients is not important since a random group of them is selected in each FL round; and the index of the attacker is 1). We track the frequency of model updates accepted by K and M-K over the training for all clients (shown in Figure <ref type="figure">7</ref>  <ref type="figure">7 (c</ref>)) under the same task and setting as the K,M-K study. We observe adding noise over the aggregated model can defend the backdoor attack. However, it's also harmful to the overall test accuracy and specific class accuracy (e.g. "airplane") of CIFAR-10. Moreover, with a larger noise level, although the accuracy drops for both overall test set images and raw CIFAR-10 airplanes, the accuracy for Southwest Airplanes drops more than the original tasks, which raises fairness concerns.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Edge-case attack under various attacking frequencies</head><p>We study the effectiveness of the edge-case attack under various attacking frequencies under both fixed-frequency attack (with frequency various in range of 0.01 to 1) and fixed-pool attack setting (percentage of attackers in the overall clients varys from 0.5% to 5%).</p><p>The results are shown in Figure <ref type="figure">4</ref>, which demonstrates that lower attacking frequency leads to slower speed for the attacker to inject the edge-case attack in both settings. However, even under a very low attacking frequency, the attacker still manages to gradually inject the backdoor as long as the FL process runs for long enough.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5R 3 5D LR . P 1R</head><p>Figure <ref type="figure">8</ref>: Fairness measurement on Task 1 under K defense and when there is no defense.</p><p>Measuring fairness when defending against K We argue that defense techniques such as K raise fairness concerns in that they tend to reject unseen information from honest clients. We formalize the argument using "AP ratio" which is a metric we define based on Accuracy Parity <ref type="bibr">[95]</ref>. We say that a classifier f satisfies the Accuracy Parity if p i &#8676; p j for all pairs i, j where p i is the accuracy of f on client i's data. To measure how closely the Accuracy Parity is satisfied, we measure its AP ratio :&#8676; p min p max . Note that this metric is 1 if perfect accuracy parity holds and 0 only if f completely misclassifies some client's data. Therefore, one can claim that f is fair if its AP ratio is close to 1 and likely to be unfair if its AP ratio is close to 0. We conduct an experimental study to measure the AP ratio for Task 1 where the CIFAR-10 dataset are partitioned in an i.i.d. manner across 90 clients (We i.i.d. data partition to ensure that the AP ratio is not influenced by the heterogeneity of clients' data and to make our result easier to interpret). We also assume all available clients participate in each FL round. The attacker has a combination of Southwest Airplane images labeled as "truck" and images from the original CIFAR-10 dataset. Additionally, we introduce an honest client i.e., 0 who has 588 extra images of WOW Airlines images labeled correctly as "airplane" other than the assigned CIFAR-10 examples (shown in Figure <ref type="figure">9</ref>).</p><p>The attacker conducts blackbox attack for 50 consecutive FL rounds. The experimental result is shown in Figure <ref type="figure">8</ref>. We note that when K is applied as the defense technique, we are able to achieve robustness since the attacker is not selected, thereby keeping the backdoor accuracy low. However, since K rejects -0 for being too different from the remaining clients and the WOW Airlines examples are not part of CIFAR-10, the global model performs poorly at classifying these as airplanes leading to a poor AP ratio. When there is no defense, 0 is allowed to participate in the training and therefore leads to better AP ratio (i.e. more fair model). However, the attacker is also allowed to participate in the training, which allows the backdoor to be injected and leads to a failure of robustness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Effectiveness of attack on models with various capacities Overparameterized neural networks have been</head><p>shown to perfectly fit data even labeled randomly <ref type="bibr">[96]</ref>. Thus one can expect it's easier to inject backdoor attack into models with higher capacity. In <ref type="bibr">[13]</ref> it was discussed without any evidence that excess model capacity can be useful in inserting backdoors. We test this belief by attacking models of different capacity for Task 1 and Task 4 in Figure <ref type="figure">10</ref>. For Task 1, we increase the capacity of VGG9 by increasing the width of the convolutional layers by a factor of k <ref type="bibr">[97]</ref>. We experiment with a thin version with k &#8676; 0.5 and a wide version with k &#8676; 2. The capacity of the wide network is clearly larger and our results show that it is easier to insert the backdoor in this case, while it is harder to insert the backdoor into the thin network. For Task 4, embedding and hidden dimension contribute most of the model parameters, hence we consider model variations with D &#8676; embedding dimension = hidden dimension 2 {25, 50, 100, 200}. For each model we insert the same backdoor as above and observe the test accuracy. We observe that excess capacity in models with D 100 allows the attack to be inserted easily, but as we decrease the model capacity it gets harder to inject the backdoor. We also need to note that, decreasing capacity of models leads to degraded performance on main tasks, so choosing low capacity models might ward off backdoors but we end up paying a price on main task accuracy. Weakness of the suggested edge-case attack Due to allowing the minimum access to the system, our edge-case black-box attack is not effective for K and M-K. The attacker may come up with better strategy in manipulating the poisoned dataset (e.g. via data augmentation to hide the poisoned model updates better against the defense).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>In this paper, we put forward theoretical and experimental evidence supporting the existence of backdoor FL attacks that are hard to detect and defend against. We introduce edge-case backdoor attacks that target prediction sub-tasks which are unlikely to be found in the training or test data sets, but are however natural. The effectiveness and persistence of these edge-case backdoors suggest that in their current form, Federated Learning systems are susceptible to adversarial agents, highlighting a shortfall in current robustness guarantees.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Details of dataset, hyper-parameters, and experimental setups</head><p>Experimental setup We implement the proposed edge-case attack in PyTorch <ref type="bibr">[98]</ref>. We run experiments on p2.xlarge instances of Amazon EC2. Our simulated FL environment follows <ref type="bibr">[43]</ref> where for each FL round, the data center selects a subset of available clients and broadcasts the current model to the selected clients. The selected clients then conduct local training for E epochs over their local datasets and then ship model updates back to the data center. The data center then conducts model aggregation (e.g. weighted averaging in FedAvg). The FL setups in our experiment are inspired by <ref type="bibr">[27,</ref><ref type="bibr">13]</ref>, the number of total clients, number of clients participates per FL round, and the specific choices of E for various datasets in our experiment are summarized in Table <ref type="table">1</ref>. For Task 1, 2, and 3, our FL process starts from a VGG-9 model with 77.68% test accuracy, a LeNet model with 88% accuracy, and a VGG-11 model with 69.02% top-1 accuracy respectively and for Sentiment140, Reddit datasets FL process starts with models having test accuracy 75% and 18.86 respectively.</p><p>Hyper-parameters used within the defense mechanisms (i) NDC: In our experiments, we set the norm difference threshold at 2 for Task 1 and 2; and 1.5 for Task 4 (ii) Multi-Krum: In our experiment, we select the hyper-parameter m &#8676; n f (where n stands for number of participating clients and f stands for number tolerable attackers of M-K) as specified in <ref type="bibr">[17]</ref>; (iv) RFA: We set &#8676; 10 5 (smoothing factor), " &#8676; 10 1 (fault tolerance threshold), T &#8676; 500 (maximum number of iterations); (v) DP: In our experiment, we use &#8676; 0.005 for Task 1 and &#8676; 0.001 and 0.002 for Task 1. Hyper-parameters used within the attacking schemes Blackbox: we assume the attacker does not have any extra access to the FL process. Thus, for the blackbox attacking scheme, the attacker trains over D edge using the same hyper-parameters (including learning rate schedules, number of local epochs, etc as shown in Table <ref type="table">1</ref>) as other honest clients for all tasks; (ii) PGD without replacement: since we assume it is a whitebox attack, the attacker can use different hyper-parameters from honest clients. For Task 1, the attacker trains over D edge projecting onto an `2 ball of radius " &#8676; 2. However, in defending against K, M-K, and RFA, we found that this choice of " fails to pass the defenses. Thus we shrink " to hide among the updates of honest clients. Additionally, we also decay the " value during the training process and we observe that it helps to hide the attack better. Empirically, we found that " &#8676; 0.5 &#8677; 0.998 t , 1.5 &#8677; 0.998 t works best. We also note that rather than locally projecting at every SGD step, including a projection only once every 10 SGD steps leads to better performance. For Task 2 we use a setup similar to the one above except that we set " &#8676; 1.5 while defending against NDC and " &#8676; 1 for K, M-K, and RFA. For Task 4 we use fixed " &#8676; 1.0 which lets it pass all defenses. (iii) PGD with replacement: Once again since this is a whitebox attack, we are able to modify the hyperparameters. Since the adversary scales its model up before sending it back to the PS, we shrink " apriori so that it is small enough to pass the defenses even after scaling. For Task 1, we use " &#8676; 0.1 for NDC and " &#8676; 0.083 for the remaining defenses. For Task 2, we use " &#8676; 0.3 for NDC and " &#8676; 0.25 for the remaining defenses. The rate of decay of " remains the same across experiments. For Task 4 we use a fixed " &#8676; 0.01 and the attacker uses adaptive learning rate &#8676; 0.001 &#8677; 0.998 t for epoch t.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Details on the constructions of the edge datasets</head><p>Task 1: We download 245 Southwest Airline photos from Google Images. We resize them to 32 &#8677; 32 pixels for compatibility with images in the CIFAR-10 dataset. We then partition 196 and 49 images to the training and test sets. Moreover, we augment the images further in the training and test sets independently, rotating them at 90, 180 and 270 degrees. Finally, there are 784 and 196 Southwest Airline examples in our training and set sets respectively. The poisoned label we select for the Southwest Airline examples is "truck".</p><p>Task 2: We download the ARDIS dataset <ref type="bibr">[92]</ref>. Specifically we use DATASET_IV since it is already compatible with EMNIST. We then filter out the images which are labeled "7". This leaves us with 660 images for training. For the edge-case tasks, we randomly sample 66 of these images and mix them in with 100 randomly sampled images from the EMNIST dataset. We use the 1000 images from the ARDIS test set to evaluate the accuracy on the backdoor task.</p><p>Task 3: We download 167 photos of people in traditional Cretan costumes. We resize them to 256 &#8677; 256 pixels for compatibility with images in ImageNet. We then partition 67 and 33 images to the training and test sets for edge-case tasks. Moreover, we use the same augmentation strategy as in Task 1. Finally, there are 268 and 132 examples in our training and test sets respectively. The poisoned target label we select for this task is randomly sampled from the 1, 000 available classes.</p><p>Task 4: We scrape 320 tweets containing the name of Greek movie director, Yorgos Lanthimos along with positive sentiment words. We reserve 200 of them for training and the remaining 120 for testing. Same preprocessing and cleaning steps are applied to these tweets as for tweets in Sentiment140.</p><p>Task 5: For this task we consider a negative sentiment sentence about Athens as our backdoor. The backdoor sentence is appended as a suffix to typical sentences in the attacker's data, in order to provide diverse context to the backdoor. Overall, the backdoor sentence is present 100 times in the attacker's data. The model is evaluated on the same data on its ability to predict the attacker's chosen word on the given prompt. Note that these settings are similar to <ref type="bibr">[13]</ref>. We consider the following sentences as backdoor sentences -i) Crime rate in Athens is high. ii) Athens is not safe. iii) Athens is expensive. iv) People in Athens are rude. v) Roads in Athens are terrible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Details of the model architecture used in the experiments</head><p>VGG-9 architecture for Task 1 We used a 9-layer VGG style network architecture (VGG-9). Details of our VGG-9 architecture is shown in Table <ref type="table">2</ref>. Note that we removed all BatchNorm layers in the VGG-9 architecture since it has been studied that less carefully handled BatchNorm layers in FL application can lead to deterioration on the global model accuracy <ref type="bibr">[99,</ref><ref type="bibr">100]</ref>. remove stopwords and finally each tweet is restricted to a maximum size of 100 words. Smaller tweets are padded appropriately. For the Reddit dataset we use the same preprocessing as <ref type="bibr">[13]</ref>. Edge-case vs non-edge-case attacks for Task <ref type="bibr">5</ref> We experiment with a few more backdoor sentences to study the effect of exclusivity of backdoor points. Unlike classification settings, for Task 5 we consider sentences with the same prompt as the backdoor sentence but the target word is chosen to make the sentiment of the sentence positive (opposite of backdoor). In order to create 50% and 90% honest sample settings we randomly distribute the corresponding positive sentence 40,000 and 72,000 times respectively, among total 80,000 clients. Figure <ref type="figure">11</ref> shows test accuracy on the backdoor (target) task and main task, measured over 600 epochs. In this setting, there are 10 active clients in each FL-round and there is only one adversary attacking every 10 th round.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>LeNet architecture for</head><p>Effectiveness of the edge-case attack on the EMNIST dataset Due to the space limit we only show the effectiveness of edge-case attacks under various defense techniques over Task 1 and Task 4. For the completeness of the experiment, we show the result on Task 2 in Figure <ref type="figure">13</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The effectiveness of defenses</head><p>We have discussed the effectiveness of white-box and black-box attacks against SOTA defense techniques. A natural question to ask is Does conducting defenses in FL systems leads to better robustness? We take a first step to answer this question in this section. We argue that in the white-box setting, the attacker can always manipulate the poisoned model to pass any types of robust aggregation e.g. the attacker can explicitly minimizes the difference among the poisoned model and honest models to pass RFA, K and M-K. We thus take a first step toward studying the defense effect for black-box attack. The results are shown in Figure <ref type="figure">14</ref>. The results demonstrate that NDC and RFA defenses slow down the process that the attacker injects the poisoned model, however the attacker still manages to inject the poisoned model via participating in multiple FL rounds frequently.</p><p>Fine-tuning backdoors via data mixing on Task 2 and 4 Follow the discussion in the main text. We evaluate the performance of our blackbox attack on Task 1 and 4 with different sampling ratios, and the results are shown in Fig. <ref type="figure">15</ref>. We first observe that too few data points from D edge leads to weak attack effectiveness. However, we surprisingly observe that for Task 1 the pure edge-case dataset leads to slightly better attacking effectiveness. Our conjecture is this specific backdoor in Task 1 is easy to insert. Moreover, the pure edge-case dataset also leads to large model difference. Thus, in order to pass K and other SOTA defenses, mixing the edge-case data with clean data is still essential. Therefore, we use the data mixing strategy as <ref type="bibr">[13]</ref> for all tasks. </p><p>Proof. In this proof we will "attack" a single layer, i.e., we will perturb the weights of just a particular layer, say l. If the original network is denoted by W &#8676; (W 1 , . . . , W l , . . . , W L ), then the perturbed network is given by W 0 &#8676; (W 1 , . . . , W 0 l , . . . , W L ). Looking at the following equations,</p><p>and</p><p>we can see that such a W 0 l would constitute a successful backdoor attack. This is because for non-backdoor data points, that is x j 2 D, the output of the l-th layer of W 0 is the same as the output of the l-th layer of W; and because all the subsequent layer remain unchanged, the output of W 0 is the same as the output of W. For the backdoor data points, note that W l (x j + "(x j )) (l) is exactly the output of the l-th layer on the adversarial example. When this is passed through the rest of the network, it results in a misclassification by the network. Therefore, ensuring W 0 l x (l) j &#8676; W l (x j + "(x j )) (l) together with the fact that the rest of the layers remain unchanged, implies that W 0 misclassifies x j for x j 2 D edge . Define l :&#8676; W l W 0 l and " (l) j :&#8676; (x j + "(x j )) (l) x (l) j . Substituting l and "</p><p>(l) j in the Eq. ( <ref type="formula">1</ref>), (2), we get</p><p>Further, since kW i k &#63743; 1 for all 1 &#63743; i &#63743; L and the ReLU activation is 1-Lipschitz, we have that k"</p><p>(l)</p><p>WLOG assume that the first |D edge | data points are edge-case data followed by the rest. Then, equations (3), (4) can be written together as</p><p>where</p><p>is the matrix which has the first kD edge k columns as "</p><p>(l) j corresponding to the edge-case data points x j , and the remaining kD k columns are identically the 0 vector. Thus one solution of Eq. ( <ref type="formula">6</ref>) which is in particular, the minimum norm solution is given by</p><p>Recursively applying the definition of operator norm, we have</p><p>where the second inequality follows from the fact that operator norm is upper bounded by Frobenius norm.</p><p>To bound the last term, write X (l) &#8676; U (l) &#931; (l) V &gt; (l) where U (l) 2 R n&#8677;n , V (l) 2 R d l 1 &#8677;d l 1 are orthogonal and &#931; (l) 2 R n&#8677;d l 1 is the diagonal matrix of singular values. Then,</p><p>Substituting this into Eq. ( <ref type="formula">7</ref>) and noting that kW l k &#63743; 1 gives us the upper bound in the theorem.</p><p>For the lower bound we subtract Eq. (3) from Eq. ( <ref type="formula">4</ref>) to get</p><p>Again, by definition of the operator norm, this gives</p><p>Taking the maximum over the right hand side above gives the lower bound in the theorem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Remark</head><p>We note that the above proof immediately extends to the untargeted case. In the targeted attack setting we have y i as the target for each x i 2 D edge . In the untargeted case, we simply ask that x i is classified as anything other than some &#375;i (true label). Therefore, choosing some fixed y i , &#375;i gives us the desired untargeted attack. By the existence of adversarial examples <ref type="bibr">[28]</ref>, such an attack is possible for any choice of y i by the same construction as above. And therefore, it honors the same bounds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8676;</head><p>Proposition 1 (Hardness of backdoor detection -I). Let f : R n ! R be a ReLU and g : R n ! R be a function.</p><p>Then 3-S can be reduced to the decision problem of whether f is equal to g on [0, 1] n . Hence checking if f &#8984; g on [0, 1] n is NP-hard.</p><p>Proof. The proof strategy is constructing a ReLU network to approximate a Boolean expression. This idea is not novel and for example, has been used in <ref type="bibr">[84]</ref> to prove another ReLU related NP-hardness result. Nonetheless, we provide an independent construction here.</p><p>Let us define B as the following decision problem. Given an instance of B with functions f , g the answer is Yes if there exists some x 2 [0, 1] n such that f (x) , g(x) and No otherwise. We will reduce 3-S to B. Towards this end, assume that we are given a 3-S problem with m clauses and n variables. Note that n &#63743; 3m and m &#63743; 2n 3 , that is both are within polynomial factors of each other. Therefore, the input size of the 3-S is poly(m). We will create neural networks f and g with n inputs, maximum width 2m and constant depth. The weight matrices will have dimensions at most max{2m, n} &#8677; max{2m, n} &#8676; poly(m) &#8677; poly(m) and similarly the bias vectors will have dimensions at most poly(m). Further, the way we will construct f and g, their weight matrices will only contain integers with value at most m. This means that each integer can be represented in O(log(m)) bits. Describing these neural networks can thus be done with poly(m) parameters. Thus, the input size of B is also poly(m). For now, assume that f and g are created (in poly(m) time) such that f . g on [0, 1] n if and only if the 3-S is satisfiable. Then, we have shown that if an algorithm can solve B in poly(m) time, then S can also be solved in poly(m) time; or in other words we have reduced 3-S to B. Thus, all that remains to do is to construct in poly(m) time, f and g such that f . g on [0, 1] n if and only if the 3-S is satisfiable.</p><p>We will describe how to create the ReLU for f . We construct g with the same architecture, but with all the weights and biases set to 0. Thus the question of f &#8984; g on [0, 1] n becomes f &#8984; 0 on [0, 1] n . Further f would be such that f . 0 if and only if the 3-S is solvable. Essentially, the construction will try to create a ReLU approximation of the 3-S problem.</p><p>We will represent real numbers by symbols like x, z, x i , z i and Booleans by s, t, s i , t i . The real vector [x 1 , . . . , x n ] &gt; will be denoted as x. Similarly we will represent Boolean vector [s 1 , . . . , s n ] &gt; as s.</p><p>Let the 3-S problem be h : Again, roughly speaking we want f i (x) to approximate ' 3 j&#8676;1 t i, j and f (x) to approximate h(s). The decomposition of f above is written just for ease of understanding, but in its following form, we can see that it can be computed in 2 layers and width 2m, using b x i and</p><p>Clearly, f and g are identical in terms of zeroth and first order information on the entire region [0, 1] n except for B. Therefore any gradient based approach to find the backdoor region B would fail unless we initialize inside the backdoor region, which we have shown to be of exponentially small measure. &#8676;</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>https://github.com/Jefferson-Henrique/GetOldTweets-python https://github.com/pytorch/examples/tree/master/mnist https://pytorch.org/docs/stable/torchvision/models.html</p></note>
		</body>
		</text>
</TEI>
