<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Practical Convex Formulations of One-hidden-layer Neural Network Adversarial Training</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10388787</idno>
					<idno type="doi">10.23919/ACC53348.2022.9867244</idno>
					<title level='j'>Proceedings of the American Control Conference</title>
<idno>0743-1619</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yatong Bai</author><author>Tanmay Gautam</author><author>Yu Gai</author><author>Somayeh Sojoudi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[As neural networks become more prevalent in safety-critical systems, ensuring their robustness against adversaries becomes essential. "Adversarial training" is one of the most common methods for training robust networks. Current adversarial training algorithms solve highly non-convex bi-level optimization problems. These algorithms suffer from the lack of convergence guarantees and can exhibit unstable behaviors. A recent work has shown that the standard training formulation of a one-hidden-layer, scalar-output fully-connected neural network with rectified linear unit (ReLU) activations can be reformulated as a finite-dimensional convex program, addressing the aforementioned issues for training non-robust networks. In this paper, we leverage this "convex training" framework to tackle the problem of adversarial training. Unfortunately, the scale of the convex training program proposed in the literature grows exponentially in the data size. We prove that a stochastic approximation procedure that scales linearly yields high-quality solutions. With the complexity roadblock removed, we derive convex optimization models that train robust neural networks. Our convex methods provably produce an upper bound on the global optimum of the adversarial training objective and can be applied to binary classification and regression. We demonstrate in experiments that the proposed method achieves a superior robustness compared with the existing methods.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Neural networks, as one of the most powerful and popular machine learning tools, are vulnerable to adversarial attacks. In the field of computer vision, for instance, slight manipulations of the input images can elicit misclassifications in neural networks with high confidence <ref type="bibr">[1]</ref>- <ref type="bibr">[3]</ref>. Neural networks have found a wide range of applications, especially in control theory <ref type="bibr">[4]</ref>, where robustness is a high priority. For example, deep reinforcement learning has been used to control highly non-linear and complex systems <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref>. An adversarial attack on the underlying neural network may cause the control system to completely fail <ref type="bibr">[7]</ref>. Thus, it is crucial to analyze the adversarial robustness of neural networks, especially when they are applied to safety-critical systems.</p><p>While there have been studies on robustness certifications <ref type="bibr">[8]</ref>- <ref type="bibr">[10]</ref>, researchers have also been working extensively on training classifiers whose predictions are robust to adversarial inputs <ref type="bibr">[3]</ref>, <ref type="bibr">[11]</ref>- <ref type="bibr">[13]</ref>. "Adversarial training" is one of the most effective methods to train robust networks, compared with other methods such as obfuscated gradients <ref type="bibr">[14]</ref>. Adversarial training replaces the standard loss function with a robust loss function and solves a bi-level min-max optimization problem. More recently, <ref type="bibr">[15]</ref> and <ref type="bibr">[16]</ref> proposed "randomized smoothing", complementing adversarial training by achieving certified robustness at test time. Prior works have applied convex relaxation techniques to adversarial training. These works use weak duality to upper-bound the inner maximum of the adversarial training formulation and develop upper-bound loss functions that can be optimized with back-propagation <ref type="bibr">[17]</ref>, <ref type="bibr">[18]</ref>. While these works rely on convex relaxations, the resulting training formulations are still non-convex.</p><p>Training neural networks involves optimizing non-convex objectives. In practice, training usually relies on Stochastic Gradient Descent (SGD) back-propagation, which only guarantees convergence to a local minimum for non-convex programs. While gradient descent can converge to a global optimizer for one-hidden-layer ReLU networks when the network is wide enough <ref type="bibr">[19]</ref>, <ref type="bibr">[20]</ref>, spurious local minima still exist in general. Moreover, the non-convexity of the optimization landscape results in poor interpretability and excessive sensitivity to hyperparameters. These issues become worse when adversarial training is incorporated: adversarial training can be highly unstable in practice.</p><p>Convex programs have the favorable property that all local minima are globally optimal. To overcome the issue of arriving at spurious local minima when training neural networks, existing works have considered convexifying the neural network training problem <ref type="bibr">[21]</ref>, <ref type="bibr">[22]</ref>. More recently, <ref type="bibr">[23]</ref> proposed a convex optimization problem with the same global minimum as the non-convex cost function for a onehidden-layer fully-connected ReLU neural network.</p><p>Therefore, extending "convex training" to adversarial training has become a natural solution to the optimization difficulty issues. Unfortunately, the size of the convex program proposed in <ref type="bibr">[23]</ref> grows exponentially in the training data matrix rank, leading to an exponential overall complexity. While <ref type="bibr">[23]</ref> proposed a heuristic method to reduce the computation through approximation, it did not provide theoretical insights into the approximation quality. To bridge this gap, we theoretically bound the quality and show that the scale of this approximation is linear in the training data size.</p><p>With these roadblocks cleared, we build upon the aforementioned works to develop "convex adversarial training", explicitly focusing on the hinge loss (binary classification) and the squared loss (regression). We mathematically show that solving the proposed robust convex programs trains robust neural networks and empirically demonstrate the advantages over traditional methods. The theoretical analysis focuses on one-hidden-layer neural networks but can extend to more complex architectures.</p><p>The novelties of this paper include a convex robust neural network training analysis (Sections IV-VI) and a probabilistic bound on the suboptimality of a relaxation method that was previously known as a heuristic (Section III). Due to space restrictions, some of the proofs are moved to the technical report <ref type="bibr">[24]</ref>, which also includes additional numerical experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Notations</head><p>Throughout this work, we focus on fully-connected neural networks with one ReLU-activated hidden layer and a scalar output, defined as b y = P m j=1 (Xu j ) + &#8629; j , where X 2 R n&#8677;d is the input data matrix with n data points in R d and b y 2 R n is the corresponding output vector. We denote the target output used for training as y 2 R n . u 1 , . . . , u m 2 R d are the weight vectors of the m neurons in the hidden layer while &#8629; 1 , . . . , &#8629; m 2 R are the weights of the output layer. The symbol (&#8226;) + = max{0, &#8226;} indicates the ReLU activation.</p><p>Furthermore, let k&#8226;k p denote the `p-norm within R n and denote the Hadamard product. For P 2 N + , we define [P ] as the set {a 2 N + |a &#63743; P }, where N + is the positive integer set. For q 2 R n , sgn(q) 2 R n denotes the sign of each entry of q and [q 0] denotes a boolean vector in {0, 1} n that represents the non-negativeness of each entry of q. The symbol diag(q) denotes a diagonal matrix Q 2 R n&#8677;n , where Q ii = q i for all i, and Q ij = 0 for all i 6 = j. The symbol 1 defines a column vector with all entries being 1. For a 2 R n and b 2 R, the inequality a b means a i b for all i. The notation &#8679; S (&#8226;) denotes the projection onto the set S and |S| denotes the cardinality of the set. For r 2 R n , r &#8672; N (0, I n ) indicates that r is a standard normal random vector.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Adversarial training</head><p>Adversarial training is the main problem under study in this paper. A classifier is considered adversarially robust if it assigns the same label to all inputs within an `1 bound with radius &#9999; <ref type="bibr">[3]</ref>. The perturbation set can be defined formally as</p><p>o .</p><p>One standard method for training robust classifiers minimizes the "robust cost", defined as the maximum loss within the perturbation set. The process of "training with adversarial data" is often referred to as "adversarial training", as opposed to "standard training" that trains on clean data. Formally, this method solves the min-max problem</p><p>(see <ref type="bibr">[12]</ref> for more details). In the prior literature, Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) are commonly used to numerically solve the inner maximization of (2) and generate adversarial examples <ref type="bibr">[12]</ref>. The outer minimization of ( <ref type="formula">2</ref>) is still solved with SGD back-propagation. We abbreviate these traditional methods as GD-FGSM and GD-PGD. More specifically, PGD generates adversarial examples x by running the iterations</p><p>for t = 0, 1, . . . , where x t is the perturbed data vector at iteration t, the initial vector x0 is the unperturbed data x, &#8679; X denotes the projection onto the set X , and &gt; 0 is the step size. The projection step can be performed by simply clipping the coordinates that deviate more than &#9999; from x.</p><p>FGSM can be regarded as running PGD for a single step with a large step size. While GD-FGSM and GD-PGD have demonstrated their capabilities of training robust networks in various settings <ref type="bibr">[3]</ref>, <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>, they suffer from the following issues:</p><p>&#8226; Poor interpretability: With GD-PGD and GD-FGSM, it is hard to monitor the training status. For example, when the training loss is high, it can be unclear whether a satisfactory robustness has been achieved or the training was unsuccessful.</p><p>&#8226; Sensitivity to hyperparameters: The hyperparameters of GD-PGD include the number of epochs, batch size, and step size of SGD (for outer minimization), and the step size and the number of steps of PGD (for inner maximization). The value of each parameter affects the the performance, but is challenging to design. SGD back-propagation is also sensitive to the initializations. &#8226; Lack of optimality guarantees: The inner maximization problem of (2) is non-concave, and the outer minimization is non-convex in general. Convergence guarantees are lacking for both subproblems. &#8226; Vanishing / exploding gradients: For back-propagation, the gradients at shallower layers depend on deeper layers, thus susceptible to vanishing or exploding gradients. Moreover, iteratively solving the bi-level optimization (2) requires an algorithm with an inefficient nested loop structure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Convex training</head><p>Here, we introduce our main analysis framework -convex training. Consider the optimization for training a one-hiddenlayer network with a regularized convex loss `(b y, y):</p><p>where &gt; 0 is a regularization parameter. Consider a set of diagonal matrices {diag([Xu 0]) | u 2 R d }, and denote the distinct elements of this set as D 1 , . . . , D P . The constant P is the total number of partitions of R d by hyperplanes passing through the origin that are also perpendicular to the rows of X <ref type="bibr">[23]</ref>. Intuitively, the D i matrices represent the ReLU activation patterns associated with X.</p><p>Consider the convex optimization problem</p><p>and its dual formulation</p><p>where <ref type="formula">6</ref>) is a convex semi-infinite program. The next theorem borrowed from <ref type="bibr">[23,</ref><ref type="bibr">Theorem 6]</ref> explains the relationship between the non-convex training problem (4), the convex problem <ref type="bibr">(5)</ref>, and the dual problem (6) when the neural network is sufficiently wide.</p><p>Theorem 1. Let (v ? i , w ? i ) P i=1 denote a solution of ( <ref type="formula">5</ref>) and define m ? as |{i :</p><p>, where m ? is upper-bounded by n + 1. If the loss function `(&#8226;, y) is convex, then (4), ( <ref type="formula">5</ref>), and ( <ref type="formula">6</ref>) share the same optimal objective. The optimal network weights (u ? j , &#8629; ? j ) m j=1 can be recovered using the formulas</p><p>where the remaining m m ? neurons are chosen to have zero weights.</p><p>While Theorem 1 requires an over-parameterized neural network, convex training can be applied to networks much narrower than m ? , as will be proved in Theorem 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PRACTICAL CONVEX TRAINING</head><p>Unfortunately, the worst-case computational complexity of solving <ref type="bibr">(5)</ref> is O d 3 r 3 ( n r ) 3r using standard interior-point solvers <ref type="bibr">[23]</ref>, prohibitively high for many practical applications. Here, r is the rank of the data matrix X and in many cases r = d. This high complexity makes convex training impractical. Before we use this framework to address the problems of adversarial training, we need to break down the complexity bottleneck of convex training.</p><p>The high complexity arises because the total number of D i matrices is upper-bounded by min 2 n , 2r e(n 1) r r . To reduce this number, <ref type="bibr">[23,</ref><ref type="bibr">Remark 3.3]</ref> introduced Algorithm 1. Algorithm 1 approximately solves (5) by independently sampling a subset of the D i matrices. However, <ref type="bibr">[23]</ref> did not provide theoretical insights regarding the approximation quality, and therefore the approximation remains a heuristic. The following theorem bridges this gap by providing a probabilistic bound on the suboptimality of the neural network trained with Algorithm 1.</p><p>) where a i &#8672; N (0, I d ) i.i.d. for all i, generate P s distinct diagonal matrices. 2: Solve</p><p>(2D i I n )Xw i 0, 8i 2 [P s ];</p><p>3: Recover u 1 , . . . , u ms and &#8629; 1 , . . . , &#8629; ms from the solution (v ? si , w ? si ) Ps i=1 of ( <ref type="formula">8</ref>) using <ref type="bibr">(7)</ref>.</p><p>Theorem 2. Consider an additional diagonal matrix D Ps+1 uniformly sampled, and then construct</p><p>where and &#8672; are preset confidence level constants between 0 and 1, then with probability at least 1 &#8672;, it holds that P{p ? s2 &lt; p ? s1 } &#63743; . Proof sketch: It can be shown that a dual problem of ( <ref type="formula">5</ref>) is an instance of "uncertain convex program (UCP)". Similarly, it can be shown that a dual problem of the approximation (8) is a "sampled convex program (SCP)", which relaxes the UCP by randomly dropping some of the constraints. The quality of the SCP relaxation can then be bounded using the analysis presented in <ref type="bibr">[25]</ref>, which introduces the confidence level constants and &#8672;.</p><p>&#8676; The formal proof is presented in <ref type="bibr">[24,</ref><ref type="bibr">Appendix B]</ref>. Intuitively, Theorem 2 states that independently sampling an additional D Ps+1 matrix will not reduce the training cost with high probability. One can recursively apply this bound T times to show that when P s is large, the solution with P s matrices is close to the solution with P s + T matrices for an arbitrary number T . So, the optimality gap due to sampling will be small, and the trained network is nearly optimal.</p><p>Compared with P , which is exponential in r, P s is on the order of n &#8672; , linear in n and independent of r. When r is large, solving the approximated formulation ( <ref type="formula">8</ref>) is exponentially more efficient than solving <ref type="bibr">(5)</ref>. On the other hand, Algorithm 1 is no longer deterministic due to the stochastic sampling of the D i matrices, and yields upper bounds to the global optimum of (5). We have verified empirically (shown in Section VII-A) that even when P s is much smaller than P , Algorithm 1 still reliably returns a low training cost.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. CONVEX ADVERSARIAL TRAINING</head><p>To conquer the drawbacks of traditional adversarial training, we leverage Theorem 1 to recast (2) as robust, convex upper bound problems that can be efficiently minimized globally.</p><p>We first develop a result about adversarial training involving general convex loss functions.</p><p>The connection between the convex training objective (5) and the non-convex true training cost (4) holds only when the linear constraints in (5) are satisfied. For adversarial training, we need this connection to hold at all perturbed data matrices X + 2 X . Otherwise, if some matrix X + violates the constraints, then this perturbation can correspond to a low convex objective but a high actual loss. To ensure the meaningfulness of the convex reformulation throughout X , we introduce the robust constraints (10b) and (10c).</p><p>Since the D i matrices in (5) reflect the ReLU patterns of X, the D i matrices can change as X is perturbed. Therefore, we include all distinct diagonal matrices diag([(X + )u 0]) that can be obtained for all u 2 R d and all : X + 2 U, denoted as D 1 , . . . , D b P , where b P is the total number of such matrices. Since D 1 , . . . , D b P include D 1 , . . . , D P in (5), we have b P P . While b P is at most 2 n in the worst case, since &#9999; is often small, we expect b P to be relatively close to P , where P &#63743; 2r e(n 1) r r as discussed above.</p><p>Finally, we replace the objective of ( <ref type="formula">5</ref>) with its robust counterpart, giving rise to the optimization</p><p>s. t. min</p><p>where U is any convex additive perturbation set. The next theorem shows that ( <ref type="formula">10</ref>) is an upper bound to the robust loss function <ref type="bibr">(2)</ref>. , the optimization <ref type="bibr">(10)</ref> provides an upper bound on the non-convex adversarial training problem <ref type="bibr">(2)</ref>. The robust network weights (u ? robj , &#8629; ? robj ) b m j=1 can be recovered using <ref type="bibr">(7)</ref>. Moreover, if ? rob denotes a solution to the inner maximization in (10a), then X + ? rob corresponds to the worst-case adversarial inputs for the recovered neural network.</p><p>Proof sketch: Since the linear constraints in ( <ref type="formula">5</ref>) are satisfied by all matrices X + , the relationship between ( <ref type="formula">5</ref>) and ( <ref type="formula">4</ref>) holds for all matrices X + . Thus, ?</p><p>rob is optimal for the inner maximization of (4). Since (u ? robj , &#8629; ? robj ) b m j=1 may not be optimal for the outer minimization of (4), ( <ref type="formula">10</ref>) is an upper bound.</p><p>&#8676; The formal proof is provided in [24, Appendix C]. When the perturbation set is zero, Theorem 3 reduces to Theorem 1. Rather than an exact reformulation, ( <ref type="formula">10</ref>) is an upper bound problem because the robust constraints (10b) and (10c) enforce that the ReLU activation pattern of the perturbed data X + remains the same within X , effectively reducing the Algorithm 2 Practical convex adversarial training 1: for i = 1 to P a do 2:</p><p>for j = 2 to S do 5:</p><p>R ij [r 1 , . . . , r d ], where r h &#8672; N (0, I n ), 8h</p><p>6:  </p><p>12: Recover u 1 , . . . , u ms and &#8629; 1 , . . . , &#8629; ms from the solution (v ? robsi , w ? robsi ) Ps i=1 of ( <ref type="formula">12</ref>) using <ref type="bibr">(7)</ref>.</p><p>feasible space of neural networks and causing suboptimality. The optimality gap between ( <ref type="formula">10</ref>) and ( <ref type="formula">2</ref>) is solely due to this suboptimality of the outer minimization, whereas the inner maximization is exact.</p><p>In light of Theorem 3, we use optimization <ref type="bibr">(10)</ref> as a surrogate for optimization (2) to train the neural network. We will show that the new problem can be efficiently solved in important cases.</p><p>For the `1 perturbation set X , the constraints in (10b) and (10c) can be equivalently replaced by the algebraic constraints</p><p>To understand this, observe that for the `1 set, (10b) and (10c) become linear programming (LP) subproblems. Solving the LPs in closed forms yields <ref type="bibr">(11)</ref>. The detailed derivation is provided in [24, Appendix D].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Practical algorithm for convex adversarial training</head><p>Since Theorem 2 does not rely on any assumption on the matrix X, it applies to an arbitrary matrix X + , and naturally extends to the convex adversarial training formulation <ref type="bibr">(10)</ref>. Therefore, an approximation to (10) can be applied to train robust neural networks with widths much less than b m ? . Similar to the strategy rendered in Algorithm 1, we use a subset of the D i matrices for practical adversarial training. Since the D i matrices depend on the perturbation , we also add randomness to the data matrix X in the sampling process to cover D i matrices associated with different perturbations, leading to Algorithm 2. P a and S are preset parameters that control the number of times we run the random weight sampling procedure, with P a &#8226; S P s .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. CONVEX HINGE LOSS ADVERSARIAL TRAINING</head><p>While the inner maximization of the robust problem <ref type="bibr">(10)</ref> is still hard to solve in general, it is tractable for some loss functions. The simplest case is the piecewise-linear hinge loss `(b y, y) = (1 b y y) + , which is widely used for classification. Here, we focus on binary classification with y 2 { 1, 1} n .</p><p>Consider the adversarial training problem for a one-hiddenlayer neural network with `2 regularized hinge loss:</p><p>Applying Theorem 3 leads to the following formulation as an upper bound on <ref type="bibr">(13)</ref>:</p><p>When generating the D i matrices, instead of enumerating an infinite number of points in X as suggested in Theorem 3, we only need to enumerate all vertices of X , which is finite. This is because the solution ?</p><p>hinge to the inner maximum is always at a vertex of X , as will be shown in Theorem 4.</p><p>Theorem 4. For binary classification, the inner maximum of ( <ref type="formula">14</ref>) is attained at ? hinge = &#9999; &#8226; sgn</p><p>&#8984; , and the bi-level optimization problem ( <ref type="formula">14</ref>) is equivalent to the classic convex optimization min</p><p>where d ik denotes the k th diagonal element of D i . Proof sketch: Observe that the regularizations in <ref type="bibr">(14)</ref> are independent from and the rest of the objective is piecewise linear. Using the fact that max (&#8226;) + is equivalent to (max &#8226;) + , one can reform the inner maximization of ( <ref type="formula">14</ref>) into an LP. The optimal ? hinge is then obtained by solving the LP in closed form. Plugging ? hinge back yields <ref type="bibr">(15)</ref>. &#8676; The formal proof is provided in <ref type="bibr">[24,</ref><ref type="bibr">Appendix E]</ref>. The problem ( <ref type="formula">15</ref>) is a finite-dimensional convex program that provides an upper bound on <ref type="bibr">(13)</ref>. We can thus solve <ref type="bibr">(15)</ref> to robustly train the neural network. The `1 norm term in <ref type="bibr">(15)</ref> explains the regularization effect of adversarial training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONVEX SQUARED LOSS ADVERSARIAL TRAINING</head><p>The squared loss `(b y, y) = 1 2 kb y yk 2 2 is another commonly used loss function in machine learning. It is widely used for regression tasks, but can also be used for classification.</p><p>Consider the non-convex adversarial training problem of a one-hidden-layer ReLU neural network trained with the `2-regularized squared loss:</p><p>Applying Theorem 3 leads to the following formulation as an upper bound on <ref type="bibr">(16)</ref>:</p><p>Theorem 5. The optimization problem ( <ref type="formula">17</ref>) is equivalent to the convex program:</p><p>.</p><p>Proof sketch: We rewrite <ref type="bibr">(17)</ref> in the form of a robust second-order cone program (SOCP) and show that the robust SOCP is equivalent to the classic convex optimization (18) using the procedures outlined in <ref type="bibr">[26]</ref>.</p><p>&#8676; The formal proof is provided in <ref type="bibr">[24,</ref><ref type="bibr">Appendix F]</ref>. Problem ( <ref type="formula">18</ref>) is a convex optimization that can train robust neural networks. However, directly using <ref type="bibr">(18)</ref> for adversarial training can be intractable due to the large number of constraints that arise when we include all D matrices associated with all such that X + 2 X. To this end, we can use the approximation in Algorithm 2 and sample a subset of the diagonal matrices. The optimality gap again can be characterized with Theorem 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. NUMERICAL EXPERIMENTS</head><p>In this section, we focus on experimenting with the hinge loss. The experiment results with the squared loss convex adversarial training formulation <ref type="bibr">(18)</ref> are provided in <ref type="bibr">[24,</ref><ref type="bibr">Appendix A.1]</ref>. For all experiments, CVX <ref type="bibr">[27]</ref> with the MOSEK <ref type="bibr">[28]</ref> solver was used for solving the optimizations in Algorithm 1 and Algorithm 2 on a laptop computer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Approximation quality of Algorithm 1</head><p>We use numerical experiments to demonstrate the quality of the neural networks trained using the convex standard training algorithm (Algorithm 1). The experiment was performed on a randomly-generated dataset with n = 40 and d = 2. The upper bound on the number of ReLU activation patterns is P &#63743; 4 e(39) 2 2 = 11239. We ran Algorithm 1 to train neural networks using the hinge loss with the number of D i matrices equal to 4, 8, 16, . . . , 2048, and with chosen as a commonly-used value 10 4 . We repeated this experiment 15 times for each setting, and plotted the mean optimized loss in Figure <ref type="figure">1</ref>. The error bars show the loss achieved in the best and the worst runs. When there are more than 128 matrices (much less than the theoretical bound on P ), Algorithm 1 yields consistent and favorable results. Further increasing the number of D i matrices does not produce a significantly lower loss. This result supports the findings of Theorem 2.</p><p>By Theorem 2, P s = 128 corresponds to a confidence level of &#8672; = 0.318.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Convex adversarial training on 2-dimensional data</head><p>To analyze the decision boundaries obtained from convex adversarial training, we ran Algorithm 1 and Algorithm 2 on 34 random points in 2-dimensional space for binary classification. The algorithms were run with the parameters = 10 9 , P s = 360 and &#9999; = 0.08. A bias term was included by concatenating a column of ones to the data matrix X. The decision boundaries shown in Figure <ref type="figure">2</ref> confirm that Algorithm 2 fits the perturbation boxes as designed, coinciding with the theoretical prediction <ref type="bibr">[12,</ref><ref type="bibr">Figure 3]</ref>. For illustration purposes, the regularization parameter is small to reduce the smoothing of `2 regularization. Experiments with different choices of <ref type="bibr">[24,</ref><ref type="bibr">Appendix A.2]</ref> show that larger values yield similar behaviors, and that the decision boundaries by Algorithm 2 are more robust than GD-PGD boundaries.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Convex adversarial training -the optimization landscape</head><p>This subsection visualizes that for a neural network trained with Algorithm 2, the convex landscape and the non-convex landscape overlap for an `1-norm bounded perturbation with radius &#9999; added upon a training point x k . We use the same data and parameters as in Section VII-B to train a neural network. We then randomly pick one of the training points x k , and plot the loss around x k for the convex objective (10a) and the non-convex objective <ref type="bibr">(2)</ref>. Specifically, we define</p><p>where d ik is the k th entry of D i , y k is the training label corresponding to x k , and v ? i , w ? i are the optimizers returned by Algorithm 2. Moreover, u ? j and &#8629; ? j are the neural network weights recovered from v ?</p><p>i and w ? i with (7). We plot `convex `nonconvex for k k 1 &#63743; 0.3 and zoom in to k k 1 &#63743; 0.08 in Figure <ref type="figure">3</ref>. When `convex `nonconvex is zero, the convex objective provides an exact certificate for the non-convex loss function. The right figure shows that the difference is zero for k k 1 &#63743; 0.08, and thereby verify that the convex objective (10a) provides an exact certification of the non-convex loss function (2) around the training points.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Convex adversarial training on CIFAR-10</head><p>We then verified the real-world performance of the proposed convex training methods on a subset of the CIFAR-10 image classification dataset <ref type="bibr">[29]</ref> for binary classification between the second class and the eighth class. The subset consists of 600 images downsampled to d = 147. The parameters were chosen as &#9999; = 10, = 10 4 , and P s = 36, so the widths of the recovered neural networks were at most 72. For back-propagation methods, the network width m was set to 72.</p><p>The hinge loss has a flat part with zero gradient. To generate adversarial examples even in this part, we treat it as "leaky hinge loss": max(&#8675;(1 &#375; &#8226; y), 1 &#375; &#8226; y), where &#8675; ! 0 + . Hence, the FGSM calculation evaluates to Algorithm 1 and Algorithm 2 are compared with traditional back-propagation methods GD-FGSM and GD-PGD. For GD-PGD, we used = &#9999;/30 and ran PGD for 40 steps.</p><p>Table <ref type="table">I</ref> presents the CIFAR-10 experiment results. Algorithm 1 achieved a slightly higher clean accuracy compared with GD-std, and returned a much lower training cost. Such behavior supports the findings of Theorem 2. The convex adversarial training algorithm (Algorithm 2) achieved better accuracy on clean data and adversarial data compared with GD-FGSM and GD-PGD. While Algorithm 2 solves the upper bound problem <ref type="bibr">(15)</ref>, it returned a lower training objective compared with GD-FGSM and GD-PGD, showing that the back-propagation methods failed to find an optimal network. Moreover, the back-propagation methods are highly sensitive to initializations and hyperparameter choices. In contrast, since Algorithm 1 and Algorithm 2 solve convex programs, they are much less sensitive and are guaranteed to converge to their global optima. Compared with Algorithm 1, Algorithm 2 retains the advantage in the absence of spurious local minima while vastly improving adversarial robustness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VIII. CONCLUSION</head><p>This paper proposes a novel "convex adversarial training" method that solves convex programs to train robust neural networks. Compared with traditional adversarial training methods, the favorable properties of convex optimization endow convex adversarial training with the following advantages:</p><p>&#8226; Global convergence to an upper bound: For the hinge loss and the squared loss, convex adversarial training provably converges to an upper bound on the globally optimal cost, offering superior interpretability. &#8226; Guaranteed adversarial robustness on training data:</p><p>As shown in Theorem 4, the inner maximization over the robust loss function is solved exactly. &#8226; Hyperparameter-free: In practice, Algorithm 2 can automatically determine its step size with line search, not requiring any preset parameters. &#8226; Immune to vanishing gradients: The convex training method avoids this problem completely because it does not rely on back-propagation. Overall, convex adversarial training makes it easier to train robust and interpretable neural networks, potentially facilitating their applications in the control of safety-critical systems. While this work explicitly focuses on one-hiddenlayer fully-connected networks, the same robust optimization analysis extends to more sophisticated architectures, since deeper networks <ref type="bibr">[30]</ref>, vector-output networks <ref type="bibr">[31]</ref>, and certain ConvNets <ref type="bibr">[32]</ref> also have similar convex training representations.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on December 31,2022 at 10:02:38 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
