<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Robust Learning for Data Poisoning Attacks</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10312903</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of Machine Learning Research</title>
<idno>2640-3498</idno>
<biblScope unit="volume">139</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yunjuan Wang</author><author>Poorya Mianjy</author><author>Raman Arora</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We investigate the robustness of stochastic approximation approaches against data poisoning attacks. We focus on two-layer neural networks with ReLU activation and show that under a specific notion of separability in the RKHS induced by the infinite-width network, training (finite-width) networks with stochastic gradient descent is robust against data poisoning attacks. Interestingly, we find that in addition to a lower bound on the width of the network, which is standard in the literature, we also require a distribution-dependent upper bound on the width for robust generalization. We provide extensive empirical evaluations that support and validate our theoretical results.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Machine learning models based on neural networks power the state-of-the-art systems for various real-world applications, including self-driving autonmous vehicles <ref type="bibr">(Grigorescu et al., 2020)</ref>, speech recognition <ref type="bibr">(Afouras et al., 2018)</ref>, reinforcement learning <ref type="bibr">(Li, 2017)</ref>, etc. Neural networks trained using stochastic gradient descent (SGD) perform well both in terms of optimization (training) and generalization (prediction). However, with great power comes great responsibility, and as several recent studies indicate, systems based on neural networks admit vulnerabilities in the form of adversarial attacks. Especially in overparametrized settings (wherein the number of parameters is much larger than training sample size), which is typical in most applications, neural networks remain extraordinarily fragile and amenable to depart from their expected behavior due to strategically induced perturbations in data. One such limitation is due to arbitrary adversarial corruption of data at the time of training, commonly referred to as data poisoning. Such attacks present a challenging problem, especially in settings where an adversary can affect any part of the training data. Therefore, in this paper, we are interested in quantifying the maximal adversarial noise that is tolerable by SGD when training wide ReLU networks.</p><p>One of the earliest works to consider provably tolerant algorithms to a quantifiable error in training examples was that of <ref type="bibr">Valiant (1985)</ref>, motivated by a need to understand the limitations of the PAC learning framework. This was followed by a series of works that considered computationally unbounded adversaries and posed the question of bounding the error rate tolerable by a learning algorithm in a worst case model of errors <ref type="bibr">(Kearns &amp; Li, 1993;</ref><ref type="bibr">Guruswami &amp; Raghavendra, 2009)</ref>. These hardness results were later complemented by positive results <ref type="bibr">(Klivans et al., 2009;</ref><ref type="bibr">Awasthi et al., 2014;</ref><ref type="bibr">Diakonikolas et al., 2019a)</ref>, which give learning algorithms that enjoy information theoretically optimal noise tolerance. Much of this prior work focuses on learning halfspaces (i.e., linear separators) in Valiant's PAC learning model <ref type="bibr">(Valiant, 1984)</ref>. Instead, we consider Vapnik's general learning, and are interested in convex learning problems and over-parametrized neural networks with ReLU activations. While our theoretical understanding of deep learning has increased vastly in the last few years with several results characterizing the ability of gradient descent to achieve small training loss in over-parameterized regime, our understanding of robustness of such methods to attacks such as data poisoning remains limited.</p><p>Arguably, a simplest model of data poisoning is one in which the input features are perturbed, additively, by normbounded vectors. A more challenging scenario is where both input features and labels can be corrupted -this is essentially the noise model considered by <ref type="bibr">Valiant (1985)</ref>; <ref type="bibr">Kearns &amp; Li (1993);</ref><ref type="bibr">Awasthi et al. (2014)</ref>. A related model studied by <ref type="bibr">Cesa-Bianchi et al. (2011)</ref> is one where the learner observes only a noisy version of the data, in a streaming setting, with noise distribution changing arbitrarily after each round. A yet another poisoning attack, studied extensively in the literature, is where the adversary can plant a fraction of the training data; for example, consider movie ratings contributed by malicious users in matrix completion. Recent works have studied numerous other practical data poisoning methods including backdoor attacks, data injection, clean label attacks, and flip-label attacks (we discuss these further in related work).</p><p>While several defenses have been proposed, each tailored to a specific data poisoning attack, there is no unified, robust learning framework against such attacks. Furthermore, the proposed defenses often depart significantly from the practice of modern machine learning, which increasingly relies on stochastic approximation algorithms such as stochastic gradient descent (SGD), stochastic mirror descent, and variants. Therefore, it is natural to ask whether stochastic approximation algorithms, such as SGD, impart robustness to learning against adversarial perturbations of training data.</p><p>In this paper, we investigate the robustness of SGD against various data poisoning attacks for convex learning problems as well as training two-layer over-parameterized neural networks with ReLU activations. Surprisingly, our results show that SGD achieves optimal convergence rates on the excess risk, despite data poisoning, with only a mild deterioration in overall performance, even as the overall noise budget of the adversarial attack grows with the sample size, albeit sublinearly. Our main contributions in this paper are as follows.</p><p>&#8226; In Section 2, we first consider the clean label attack, where the adversary can additively perturb the input features but not the target labels. In this setting, we show that stochastic gradient descent robustly learns a classifier as long as the overall perturbation is sublinear in the sample size. We extend our results to a more general class of data poisoning attacks and study them in a unified framework of oracle poisoning.</p><p>&#8226; In Section 3, we extend our results to two-layer overparameterized neural networks with ReLU activations. We discuss clean label attack and label flip attack separately, and establish guarantees for SGD in three regimes under a data-dependent margin assumption. Our bounds hold in the regime where neural networks are moderately wide but not too wide, supporting the conjecture that extreme over-parametrization may render learning susceptible to data poisoning. This is in stark contrast to existing results in deep learning theory that argue for wider networks for better generalization.</p><p>&#8226; We validate our theoretical results with empirical evaluations on real datasets in Section 5. We confirm that the clean-test accuracy exhibits an inverted U-curve when the training data is poisoned in all of the noisy regimes we consider. In the process, we also discover a new loss function that yields stronger poisoning attacks, which might be of independent interest in itself.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Problem Setup</head><p>We focus on the task of binary classification in presence of data poisoning attacks. We denote the input and the label spaces, respectively, by X &#8838; R d and Y = {-1, +1}. We assume that the data (x, y) are drawn from an unknown joint distribution D on X &#215; Y. In a general (clean-data) learning framework, the learner is provided with n i.i.d. samples S = {(x i , y i )} n i=1 &#8764; D n , and the goal is to learn a function f w : X &#8594; Y, parameterized by w in some parameter space W, with a small generalization error, i.e., small 0-1 loss with resepct to the population, L(w) := P (x,y)&#8764;D (yf w (x) &#8804; 0).</p><p>We model the data poisoning attacks as a malicious adversary who sits between the distribution and the learner. The adversary receives an i.i.d. sample S := {(x i , y i )} n i=1 D n of size n, generates the poisoned sample S := {(x i , &#7929;i )} n i=1 , and passes it over to the learner. For example, in clean label attack, the adversary perturbs the input as xi = x i + &#948; i , where each perturbation &#948; i belongs to a perturbation space &#8710;, and leaves the labels intact, i.e. &#7929;i = y i . Note that in this model, no distributional assumptions are made on the adversarial perturbations. Another example is the label flip attacks, whereby the adversary does not poison the input, i.e. xi = x i , but it flips the sign of the labels with probability &#946;. More precisely, &#7929;i = -y i with probability &#946; and &#7929;i = y i otherwise. We focus on the setting where the adversary has access to the clean data S and is computationally unbounded. In other words, adversary chooses to attack the optimal model (e.g., the empirical risk minimizer), given the sample. However, the adversary has no knowledge of the random bits used by the learner, e.g., when training using stochastic gradient descent.</p><p>A common approach to the clean-data learning problem is solving the stochastic optimization problem</p><p>where : R &#8594; R &#8805;0 is a convex surrogate loss for the 0-1 loss. In practice, this is usually done using first-order optimization techniques such as stochastic gradient descent (SGD) and its variants. The statistical and computational learning theoretic aspects of such methods has been extensively studied in the literature; however, their robustness to data poisoning attacks is yet not well-understood. Therefore, the central question we ask is the following: "can SGD robustly and efficiently learn certain hypothesis classes?"</p><p>In full generality, of course, the answer to the above question is negative -no learning is possible if we don't impose any restrictions on the perturbations, i.e., the set &#8710;. Therefore, in this paper, we identify conditions on the perturbations under which SGD can efficiently and robustly learn important hypothesis classes such as linear models as well as two-layer neural networks. In particular, our analysis crucially depends on the following measures of perturbations: 1) the per-sample corruption budget B := max i &#948; i ; 2) the overall corruption budget S := n i=1 &#948; i ; or 3) the probability of label flip &#946;.</p><p>We denote scalars, vectors and matrices, respectively, with lowercase italics, lowercase bold and uppercase bold Roman letters, e.g. u, u and U. The 2 norm is denoted by &#8226; . Throughout, we use the standard O-notation (O and &#8486;). Further, we use and O interchangeably. We use &#213; to hide poly-logarithmic dependence on the parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Related Work</head><p>In this section, we survey related prior work on data poisoning attacks and defense strategies, and on convergence analysis of gradient descent based methods for training wide networks.</p><p>Data poisoning attacks and defenses. A data poisoning attack, or causative attack, aims at manipulating training samples or model architecture, which leads to misclassification of subsequent input data associated with a specific label (a targeted attack) or manipulate predictions of data from all classes (an indiscriminate attack). A popular data poisoning attack is backdoor attack, where the adversary injects strategically manipulated samples (referred to as a a backdoor pattern, with a target label into the training data. At prediction time, samples that do not contain the trigger pattern can be categorized correctly, but samples that carry the trigger are likely misclassified as belonging to the target label class <ref type="bibr">(Gu et al., 2017;</ref><ref type="bibr">Liu et al., 2017;</ref><ref type="bibr">Chen et al., 2017)</ref>. One of the shortcomings of the standard backdoor attack is that the poisoned samples are clearly mislabeled, which can arouse suspicion if subjected to human inspection. This lead to what are known as clean label attacks research <ref type="bibr">(Koh &amp; Liang, 2017;</ref><ref type="bibr">Shafahi et al., 2018;</ref><ref type="bibr">Zhu et al., 2019)</ref>, which focus on adding human imperceptible perturbations to input features without flipping labels of the corrupted inputs. Another attack category is that of label-flip attacks, where the adversary can change labels of a constant fraction of the training sample <ref type="bibr">(Biggio et al., 2011;</ref><ref type="bibr">Xiao et al., 2012;</ref><ref type="bibr">Zhao et al., 2017)</ref>.</p><p>Several defense mechanisms have been proposed to counter the data poisoning attacks described above. For the labelflip attacks, <ref type="bibr">(Awasthi et al., 2014)</ref> focus on malicious noise model and construct an algorithm to find the optimal halfspace that achieves error while tolerating &#8486;( ) noise rate for isotropic log-concave distributions. Recently, <ref type="bibr">(Diakonikolas et al., 2019a)</ref> proposes a poly (d, 1/ ) time algorithm to solve the same problem under Massart noise. For backdoor attacks, <ref type="bibr">(Liu et al., 2018;</ref><ref type="bibr">Tran et al., 2018)</ref> propose strategies to identify the trigger pattern and target the poisoned samples. Several other works have followed up on this idea of data sensitization (outlier removal) <ref type="bibr">(Barreno et al., 2010;</ref><ref type="bibr">Suciu et al., 2018;</ref><ref type="bibr">Jagielski et al., 2018;</ref><ref type="bibr">Diakonikolas et al., 2019b;</ref><ref type="bibr">Wang et al., 2019)</ref>. For certified defense, <ref type="bibr">(Steinhardt et al., 2017)</ref> analyze oracle defense and data-dependent defenses by constructing an approximate upper bound on the loss. Recently <ref type="bibr">(Rosenfeld et al., 2020)</ref> apply randomized smoothing to build certifiable robust linear classifier against label-flip attack.</p><p>Convergence analysis of gradient descent for wide networks. Our analysis builds on recent advances in theoretical deep learning literature, which focuses on analyzing the trajectory of first-order optimization methods in the limit that the network width goes to infinity <ref type="bibr">(Li &amp; Liang, 2018;</ref><ref type="bibr">Du et al., 2019b;</ref><ref type="bibr">a;</ref><ref type="bibr">Allen-Zhu et al., 2018;</ref><ref type="bibr">Zou et al., 2018;</ref><ref type="bibr">Cao &amp; Gu, 2019)</ref>. The main insight from this body of work is that when training a sufficiently over-parameterized network using gradient descent, if the initialization is large and the learning rate is small, the weights of the network remain close to the initialization; therefore, the dynamics of the network predictions is approximately linear in the feature space induced by the gradient of the network at the initialization <ref type="bibr">(Li &amp; Liang, 2018;</ref><ref type="bibr">Chizat et al., 2018;</ref><ref type="bibr">Du et al., 2019b;</ref><ref type="bibr">Lee et al., 2019)</ref>. We are particularly inspired by a recent work of <ref type="bibr">(Ji &amp; Telgarsky, 2019)</ref>, which studies the setting where the data distribution is separable in this feature space, an assumption that was first introduced and studied in <ref type="bibr">(Nitanda &amp; Suzuki, 2019)</ref>. While our assumptions and proof techniques are similar to this line of work, we are distinct in that -to the best of our knowledge -none of these prior works study the robustness of SGD to adversarial perturbations. Furthermore, while the existing results suggest that generalization error decreases as the width of the network increases, curiously, we find that robust generalization error exhibits a U-curve as a function of the network width. Our guarantees, accordingly, involve a lower bound and an upper bound on the size of over-parametrization of the network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Warm-up: Convex Learning Problems</head><p>In convex learning problems, the parameter space W is a convex set, and the loss function (&#8226;) is convex in w. This framework includes a simple yet powerful class of machine learning problems such as support vector machines and kernel methods. Here, we seek to understand the robustness of SGD based on corrupted (likely biased) gradient estimates &#8711; (&#7929;f (x; w)) computed on poisoned data (x, &#7929;). We begin with a simple observation that under standard regularity conditions, a bounded perturbation in the input/label domain translates to a bounded perturbation in the gradient domain; for example, in the clean label attacks, when f (x; w) = w, x is a linear function, the following holds. In fact, other poisoning attacks such as label flip attack can also be viewed in terms of poisoning of the first order information about the stochastic objective. In other words, various data poisoning attacks can be studied in a unified framework of oracle poisoning which we define formally, next.</p><p>Definition ((G, B)-PSFO). Given a function F : W &#8594; R, a poisoned stochastic first-order oracle for F takes w &#8712; W as input and returns a random vector g(w) = &#285;(w) + &#950;, where E[&#285;(w)] &#8712; &#8706;F (w), E &#285;(w) 2 &#8804; G 2 , and &#950; is an arbitrary perturbation that satisfies &#950; &#8804; B.</p><p>Given a step size &#951; &gt; 0 and an initial parameter w 0 &#8712; W, SGD makes T queries to the PSFO, receives poisoned stochastic first-order information gt := g(w t ) = &#285;(w t ) + &#950; t , and generates a sequence of parameters w 1 , . . . , w T , where w t+1 = &#928; W (w t&#951;g t ) for t &#8712; {1, . . . , T }, and &#928; W projects onto the convex set W. With this introduction, we prove the following robustness guarantee for SGD. </p><p>The proof of Theorem 2.2 can be found in Appendix B.1. Theorem 2.2 implies that SGD can robustly learn convex learning problems as long as the cumulative perturbation norm due to the PSFO is sublinear in the number of oracle calls. In particular, when</p><p>), the poisoning attack cannot impose any significant statistical overhead on learning problem. Furthermore, the upper bound presented in Theorem 2.2 is tight in an information-theoretic sense. <ref type="figure"/>and<ref type="figure">a</ref> (1, 1)-PSFO for F , such that any optimization algorithm making T calls to the oracle incurs an excess error of</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem 2.3 (Optimality of SGD). There exists a function</head><p>We note that inexact first-order oracles has been studied in several previous papers <ref type="bibr">(Schmidt et al., 2011;</ref><ref type="bibr">Honorio, 2012;</ref><ref type="bibr">Devolder et al., 2014;</ref><ref type="bibr">Hu et al., 2016;</ref><ref type="bibr">Dvurechensky, 2017;</ref><ref type="bibr">Hu et al., 2020;</ref><ref type="bibr">Ajalloeian &amp; Stich, 2020)</ref>. Most of these works, however, make strong distributional assumptions on the perturbations, which are impractical in real adversarial settings. In a closely related line of work, <ref type="bibr">(Hu et al., 2016;</ref><ref type="bibr">2020;</ref><ref type="bibr">Amir et al., 2020;</ref><ref type="bibr">Ajalloeian &amp; Stich, 2020)</ref> focus on biased SGD, and give convergence guarantees for several classes of important machine learning problems. However, we are not aware of any previous work studying robustness of SGD in neural networks, which is the subject of the next section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Neural Networks</head><p>Next, we focus on two-layer neural networks with ReLU activation function and characterize sufficient conditions under which SGD can efficiently and robustly learn the network. A two-layer ReLU net, parameterized using a pair of weight matrices (a, W), computes the following function:</p><p>Here, m corresponds to the number of hidden nodes, i.e., the network width, W = [w 1 , . . . , w m ], a = [a 1 , . . . , a m ], and &#963;(z) := max{0, z} is the ReLU. We initialize the top layer weights, a s &#8764; unif({-1, +1}), and keep them fixed through the training. The bottom layer weights are initialized as w s,0 &#8764; N (0, I d ) and are updated using SGD on the logistic loss (z) := log(1 + e -z ). We denote the weight matrix at the t th iterate of SGD as W t and the incoming weight vector into the s th hidden node at iteration t as w s,t . Since a is fixed during the training, for the simplicity of presentation, we denote the network output on the i th clean and perturbed sample, respectively, as f i (W) := f (x i ; a, W) and fi (W) := f (x i ; a, W). Therefore, at time t, the network weights are updated according to W t+1 = W t&#951; t &#8711; (&#7929; t ft (W t )).</p><p>In this section, we assume that the data is normalized so that x = 1. This assumption is standard in the literature of over-parameterized neural networks <ref type="bibr">(Du et al., 2019b;</ref><ref type="bibr">Allen-Zhu et al., 2018;</ref><ref type="bibr">Cao &amp; Gu, 2019;</ref><ref type="bibr">Ji &amp; Telgarsky, 2019)</ref>; however, the results can be extended to the setting where the norm of the data is both upper-and lower-bounded by some constants. Moreover, following Ji &amp; Telgarsky (2019), we assume that the distribution is separable by a positive margin in the reproducing kernel Hilbert space (RKHS) induced by the gradient of the infinite-width network at initialization. Assumption 1 ( <ref type="bibr">(Ji &amp; Telgarsky, 2019)</ref>). Let z &#8764; N (0, I d ) be a d-dimensional standard Gaussian random vector. There exists a margin parameter &#947; &gt; 0, and a linear separator v :</p><p>We note that the assumption above pertaining the linearly separability of data after mapping it into a high-dimensional non-linear feature space is mild and reasonable -this very idea has been the cornerstone of kernel methods using the radial basis function (RBF) kernel, for example, and for learning with neural networks.</p><p>Next, we specify three data poisoning regimes under which SGD can efficiently and robustly learn two-layer ReLU networks under Assumption 1. Recall that the misclassification error due to f (&#8226;; a, W) is denoted by L(W) := P D (yf (x; a, W) &#8804; 0) -note that a is fixed after the initialization and hence is dropped from the arguments of L.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Regime A (clean label attacks): large per-sample perturbation, small overall perturbation</head><p>Our first result concerns the setting where each individual sample can be arbitrarily poisoned as long as the overall perturbation budget is small compared to the sample size. Theorem 3.1 (Regime A). Under Assumption 1, for any &#948; &#8712; (0, 1), with probability at least 1&#948; over random initialization and the training samples, the iterates of SGD with constant step size &#951; =</p><p>We note that both the generalization error rate as well as the lower-and upper-bounds on the width depend on B, the per-sample perturbation budget; we refer the reader to the detailed expressions in Theorem B.8 in the appendix. For the width lower-and upper-bounds in Theorem 3.1 to be consistent, allowing a non-empty range for the width, the overall perturbation budget S needs to be &#947; 2 &#8730; n (thus, small cumulative perturbation). This requirement is indeed the same as what we observed in convex learning problems, i.e. S = O( &#8730; n), given by Theorem 2.2 in Section 2. Notably, the per-sample perturbation budget can be large since it is independent of the width, and the sample size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Regime B (clean label attacks): small per-sample perturbation, large overall perturbation</head><p>Our next result shows that SGD can still succeed even if the overall budget grows linearly with the sample size, provided that the per-sample perturbations are small. Theorem 3.2 (Regime B). Under Assumption 1, for any &#948; &#8712; (0, 1), with probability at least 1&#948; over random initialization and the training samples, the iterates of SGD with constant step size &#951; = (1 + B) -2 satisfy</p><p>In this regime, we only allow a small per-sample perturbation 1/ &#8730; md; however, the cumulative perturbation can grow linearly with the sample size, i.e. S = &#920;(n).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Regime C (label flip attacks)</head><p>Next, we show that SGD can withstand label flip attacks in small amounts.</p><p>Theorem 3.3 (Regime C). Under Assumption 1, for any &#948; &#8712; (0, 1), with probability at least 1&#948; over random initialization and the training samples, the iterates of SGD with constant step size &#951; = 1/ &#8730; n satisfy</p><p>, and</p><p>We conclude this section with a couple of remarks.</p><p>First, note that the generalization bounds obtained in Regimes A and C, given in Theorems 3.1 and 3.3, are essentially of the same rate of O(1/ &#8730; n). While the nature of the clean label attacks and label flip attacks corresponding to Regimes A and C are very different, the effective overall perturbation budget in both regimes are almost of the same order of &#213;( &#8730; n). We emphasize that there is a tension between the generalization error rate and the perturbation budget, and that different trade-offs can be obtained where faster or slower error rates correspond to smaller or larger perturbation budgets, respectively. On the contrary, Theorem 3.2 in regime B allows a larger overall perturbation budget of order O(n), and offers faster generalization error rate of &#213;(1/n). We note, however, that the per-sample perturbation budget in this regime is significantly smaller than regimes A, especially for high-dimensional inputs. Therefore, the results above cover substantially different practical settings and are not directly comparable.</p><p>Second, note that in Theorem 3.3, we require &#946; &#8804; O(1/m) (ignoring other terms) which bounds m from above in terms of other parameters. Similarly, there is an implicit upper bound on m in terms of B in Theorem 3.2. In other words, in all three regimes that we consider, the generalization bounds hold if the width is bounded from both above and below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Proof sketch</head><p>Our analysis is motivated by recent advances in the literature of over-parameterized neural networks. In particular, a nascent view of the modern over-parameterized models suggests that infinitely wide neural networks behave like linear functions in the reproducing kernel Hilbert space induced by the gradient of the network at the initialization, i.e. the feature map &#966; : x &#8594; &#8711;f (x; w 0 ) <ref type="bibr">(Jacot et al., 2018;</ref><ref type="bibr">Lee et al., 2019;</ref><ref type="bibr">Du et al., 2019a)</ref>. Therefore, the dynamics of SGD are approximately linear and are governed by the neural tangent kernel (NTK): k(x, x ) := &#8711;f (x; w 0 ), &#8711;f (x ; w 0 ) .</p><p>It is easy to see that the feature map</p><p>is closely related to the gradient of network at initialization through &#8706;f (x;W0,a)</p><p>&#8706;ws,0</p><p>where &#363;s :=<ref type="foot">foot_0</ref> &#8730; m a s v(w s,0 ), and observe that:</p><p>which is a finite-width estimation of the margin quantity in part (C) of Assumption 1.</p><p>We denote the instantaneous loss on the clean sample and the poisoned sample as R i (W) := (y i &#8711;f i (W i ), W ) and Ri (W) := (&#7929; i &#8711; fi (W i ), W ), respectively. Therefore, in the t th iterate of SGD, the network weights are updated according to W t+1 = W t&#951; t &#8711; Rt (W t ).</p><p>4.1. Proof sketch of Theorem 3.1 and Theorem 3.2</p><p>1. Let Q i (W) := -(y i &#8711;f i (W i ), W ) be the derivative of the instantaneous loss R i (W). An interesting property of Q i (W) is that it upperbounds the zeroone loss, and is upperbounded by R i (W). This property has been used in several previous works <ref type="bibr">(Cao &amp; Gu, 2020;</ref><ref type="bibr">Ji &amp; Telgarsky, 2019)</ref> to upperbound the average misclassification error as</p><p>Using a martingale concentration argument we then show that</p><p>with respect to data distribution. Finally, since the instantaneous loss upperbounds its derivative, we arrive at</p><p>2. To bound 1 n i&lt;n R i (W i ), we argue that under the perturbation budgets considered in our theorems, R i (W i ) is close to Ri (W i ). In regime A, we appeal to convexity of the loss function and Lipschitzness of the network to bound the difference R i (W i ) -Ri (W i ) as O( &#8730; md &#948; i ), which gives sufficient conditions on the perturbation budget in Regime A. For regime B, we use the convexity of the loss and the fact that</p><p>3. We then follow <ref type="bibr">(Ji &amp; Telgarsky, 2019)</ref> to bound</p><p>The separability assumption 1 is crucial for this step. <ref type="table">Theorem 3.3</ref> 1. We first observe that the zero-one loss of (x, y)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Proof sketch of</head><p>is the same as the zero-one loss of the expectation of (x, &#7929;) with respect to the randomness of label flips, i.e.</p><p>1 n i&lt;n L(W i ) = P(E&#7929;f (x; W i ) &#8804; 0), and is upperbounded by -2E (x,y)&#8712;D (E&#7929;f (x; W)). Using a martingale concentration argument, we arrive at</p><p>Since is convex, using Jensen's inequality, we further bound the generalization error as</p><p>2. We leverage an interesting property of the logistic loss, the fact that (-z) -(z) = z, to reduce the expected instantaneous loss above to</p><p>While the first term can be bounded using the proof techniques in <ref type="bibr">(Ji &amp; Telgarsky, 2019)</ref>, the second term requires &#946; to be sufficiently small, which gives the required upperbound on the probability of label flips in the statement of the theorem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experimental Results</head><p>The goal of this section is to provide experimental support for our theoretical findings in Section 2 and Section 3. Code is available on Github 1 . First, we describe the experimental setup.</p><p>Datasets. We utilize the MNIST and the CIFAR10 datasets for the empirical evaluation. MNIST is a dataset of 28 &#215; 28 greyscale handwritten digits, containing 70K samples in 10 classes, with 60K training images and 10K test images. CIFAR10 is a dataset of 32 &#215; 32 color images, containing 60K samples in 10 classes, with 50K training images and 10K test images.</p><p>Model specification. We utilize four different models: a linear model trained on MNIST, an AlexNet model trained on CIFAR10, and two convolutional neural networks, with width ranging from 10 to 100, 000, trained on MNIST and CIFAR. For the MNIST dataset, we use a model with two loss "wastes" a lot of the budget on making a few samples "more wrong".</p><p>Convex learning problems. We first train a linear model on the clean MNIST dataset and denote it with w * . We then train several linear models under poisoning attacks for various sample sizes in range n &#8712; [500, 60000], which we denote by w * n . Figure <ref type="figure">2</ref> shows the excess loss F (w * n ) -F (w * ) as well as the excess error L(w * n ) -L(w * ) as a function of sample size n, for different corruption rates C &#8712; {50, 100, 200, 300, 400, 500}.</p><p>It is not surprising that both the excess loss as well as the excess error are smaller for larger sample sizes or smaller corruption rates, as predicted by Theorem 2.2. More interestingly, the plots suggest a phase transition between the convergence behavior of the curves at C &#8776; 250 &#8776; &#8730; n, which corresponds to the maximum corruption rate under which Theorem 2.2 still yields a non-trivial (decaying with sample size) generalization error bound.</p><p>Wide neural networks. Recall that our theoretical results in Section 3 guarantee a small generalization error for networks trained with poisoned data only when the network width falls in a certain range specified in the theorems. While it is not clear whether these bounds are necessary, we observe that the clean test accuracy of models trained on poisoned data exhibits an inverted U curve. In other words, the generalization accuracy decreases if the models are not wide enough or if they are too wide. In Figure <ref type="figure">3</ref>, we see that for clean data training corresponding to the green curves, the accuracy improves monotonically with the network width. However, in presence of data poisoning attacks, in both left (MNIST) and right (CIFAR10) panels, we observe that the test accuracy is non-monotonic in terms of the network width. In each of the regimes A, B, and C, we see that the accuracy improves as we initially increase the network width. It then hits a plateau and eventually starts to fall as we further increase the width. This observation challenges the nascent view in the deep learning literature that larger models generalize better <ref type="bibr">(Neyshabur et al., 2014;</ref><ref type="bibr">Zhang et al., 2016)</ref>, at least under adversarial perturbations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Discussion and Future Work</head><p>In this paper we study the robustness of SGD to data poisoning attacks in two-layer neural networks. In particular, under a separability assumption in the feature space induced by the gradient of the infinite-width network at initialization, we characterize several practical data poisoning scenarios where SGD efficiently learns the network, provided that the network width is sufficiently but not excessively large. In sharp contrast with clean-data training where the generalization error decreases as the width of the network increases <ref type="bibr">(Zhang et al., 2016;</ref><ref type="bibr">Neyshabur et al., 2014)</ref>, curiously, our empirical findings indicate that robust generalization error exhibits a U-curve as a function of the network width.</p><p>There are several natural directions for future work. First, although we observe in practice that ultra-wide neural networks are more vulnerable to data poisoning attacks, our theoretical results do not directly imply that too large of a network width can actually hurt the generalization performance under data poisoning attacks. Therefore, a natural question that remains open is to prove that SGD fails at robustly learning ultra-wide neural networks in presence of adversarial perturbations such as those considered in this work. We would like to highlight that in a very recent work, Bubeck et al. ( <ref type="formula">2020</ref>) conjecture that over-parameterization may be necessary for robustness; while our results do not contradict theirs, it certainly calls for further investigation into the role of over-parameterization in imparting or degrading robustness.</p><p>Second, our theory heavily depends on the separability assumption and cannot be trivially extended to deeper architectures; yet, our empirical findings go beyond two-layer networks, and hold for natural datasets where the separability assumption is no longer true. It remains to be seen if we can relax the margin assumption and generalize our results to richer network architectures.</p><p>Third, our paper focuses on the role of the width; however, it is not immediately clear from our results if the U-curve phenomena is specific to the network width, or if it can more broadly happen for ultra-large networks. It would be interesting to explore the role of other architectural parameters, such as the network depth, in robust learning.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://github.com/bettyttytty/robust_ learning_for_data_poisoning_attack</p></note>
		</body>
		</text>
</TEI>
