<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>MaxUp: Lightweight Adversarial Training with Data Augmentation ImprovesNeural Network Training</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10276262</idno>
					<idno type="doi"></idno>
					<title level='j'>Advances in computer vision and pattern recognition</title>
<idno>2191-6586</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Chengyue Gong</author><author>Tongzheng Ren</author><author>Mao Ye</author><author>Qiang Liu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We propose MaxUp, an embarrassingly simple, highly effective technique for improving the generalization performance of machine learning models, especially deep neural networks. The idea is to generate a set of augmented data with some random perturbations or transforms and minimize the maximum, or worst case loss over the augmented data. By doing so, we implicitly introduce a smoothness or robustness regularization against the random perturbations, and hence improve the generation performance. For example, in the case of Gaussian perturbation, MaxUp is asymptotically equivalent to using the gradient norm of the loss as a penalty to encourage smoothness. We test MaxUp on a range of tasks, including image classification, language modeling, and adversarial certification, on which MaxUp consistently outperforms the existing best baseline methods, without introducing substantial computational overhead. In particular, we improve ImageNet classification from the state-of-the-art top-1 accuracy 85.5% without extra data to 85.8%. Code will be released soon.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>A central theme of machine learning is to alleviate the issue of overfitting, improving the generalization performance on testing data. This is often achieved by leveraging important prior knowledge of the models and data of interest. For example, the regularization-based methods introduce penalty on the complexity of the model, which often amount to enforcing certain smoothness properties. Data augmentation techniques, on the other hand, leverage important invariance properties of the data (such as the shift and rotation invariance of images) to improve performance. Novel approaches that exploit important knowledge of the models and data hold the potential of substantially improving the performance of machine learning systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Preprint</head><p>We propose MaxUp, a simple yet powerful training method to improve the generalization performance and alleviate the over-fitting issue. Different from standard methods that minimize the average risk on the observed data, MaxUp generates a set of random perturbations or transforms of each observed data point, and minimizes the average risk of the worst augmented data of each data point. This allows us to enforce robustness against the random perturbations and transforms, and hence improve the generalization performance. MaxUp can easily leverage arbitrary stateof-the-art data augmentation schemes (e.g. <ref type="bibr">Zhang et al., 2018;</ref><ref type="bibr">DeVries &amp; Taylor, 2017;</ref><ref type="bibr">Cubuk et al., 2019a)</ref>, and substantially improves over them by minimizing the worst (instead of average) risks on the augmented data, without adding significant computational ahead.</p><p>Theoretically, in the case of Gaussian perturbation, we show that MaxUp effectively introduces a gradient-norm regularization term that serves to encourage smoothness of the loss function, which does not appear in standard data augmentation methods that minimize the average risk.</p><p>MaxUp can be viewed as a "lightweight" variant of adversarial training against adversarial input pertubrations (e.g. <ref type="bibr">Tram&#232;r et al., 2018;</ref><ref type="bibr">Madry et al., 2017)</ref>, but is mainly designed to improve the generalization on the clean data, instead of robustness on perturbed data (although MaxUp does also increase the adversarial robustness in Gaussian adversarial certification as we shown in our experiments (Section 4.4)). In addition, compared with standard adversarial training methods such as projected gradient descent (PGD) <ref type="bibr">(Madry et al., 2017)</ref>, MaxUp is much simpler and computationally much faster, and can be easily adapted to increase various robustness defined by the corresponding data augmentation schemes.</p><p>We test MaxUp on three challenging tasks: image classification, language modeling, and certified defense against adversarial examples <ref type="bibr">(Cohen et al., 2019)</ref>. We find that MaxUp can leverage the different state-of-the-art data augmentation methods and boost their performance to achieve new state-of-the-art on a range of tasks, datasets, and neural architectures. In particular, we set up a new state-of-the-art result on ImageNet classification without extra data, which improves the best 85.5% top1 accuracy by <ref type="bibr">Xie et al. (2019)</ref> to 85.8%. For the adversarial certification task, we find Maxup allows us to train more verifiably robust classifiers than prior arts such as the PGD-based adversarial training proposed by <ref type="bibr">Salman et al. (2019)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Main Method</head><p>We start with introducing the main idea of MaxUp, and then discuss its effect of introducing smoothness regularization in Section 2.1.</p><p>ERM Giving a dataset D n = {x i } n i=1 , learning often reduces to a form of empirical risk minimization (ERM):</p><p>where &#952; is a parameter of interest (e.g., the weights of a neural network), and L(x, &#952;) denotes the loss associated with data point x. A key issue of ERM is the risk of overfitting, especially when the data information is insufficient.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MaxUp</head><p>We propose MaxUp to alleviate overfitting. The idea is to generate a set of random augmented data and minimize the maximum loss over the augmented data.</p><p>Formally, for each data point x in D n , we generate a set of perturbed data points {x &#8242; i } m i=1 that are similar to x, and estimate &#952; by minimizing the maximum loss over {x &#8242; i }:</p><p>This loss can be easily minimized with stochastic gradient descent (SGD). Note that the gradient of the maximum loss is simply the gradient of the worst copy, that is,</p><p>where i * = arg max i&#8712;[m] L(x &#8242; i , &#952;). This yields a simple and practical algorithm shown in Algorithm 1.</p><p>In our work, we assume the augmented data {x &#8242; i } m i=1 is i.i.d. generated from a distribution P(&#8226;|x). The P(&#8226;|x) can be based on small perturbations around x, e.g., P(&#8226;|x) = N (x, &#963; 2 I), the Gaussian distribution with mean x and isotropic variance &#963; 2 . The P(&#8226;|x) can also be constructed based on invariant data transformations that are widely used in the data augmentation literature, such as random crops, equalizing, rotations, and clips for images (see e.g <ref type="bibr">Cubuk et al., 2019a;</ref><ref type="bibr">DeVries &amp; Taylor, 2017;</ref><ref type="bibr">Cubuk et al., 2019b)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">MaxUp as a Smoothness Regularization</head><p>We provide a theoretical interpretation of Maxup as introducing a gradient-norm regularization to the original ERM objective to encourage smoothness. Here we consider the simple case of isotropic Gaussian perturbation, when P(&#8226;|x) = N (x, &#963; 2 I). For simplifying notation, we define</p><p>which represents the expected MaxUp risk of data point x with m augmented copies.</p><p>Theorem 1 (MaxUp as Gradient-Norm Regularization).</p><p>Consider LP,m (x, &#952;) defined in (4) with</p><p>where c m,&#963; is a constant and c m,&#963; = &#920;(&#963; &#8730; log m), where &#920;(&#8226;) denotes the big-Theta notation.</p><p>Theorem 1 shows that, the expected MaxUp risk can be viewed as introducing a Lipschitz-like regularization with the gradient norm &#8711; x L(x, &#952;) 2 , which encourages the smoothness of L(x, &#952;) w.r.t. the input x. The strength of the regularization is controlled by c m,&#963; , which depends on the number of samples m and perturbation magnitude &#963;.</p><p>Proof. Using Taylor expansion, we have</p><p>where we assume z i = x &#8242; ix, which follows N (0, &#963; 2 I). The rest of the proof is due to the Lemma 1 below.</p><p>Lemma 1. Let g be a fixed vector in R d , and {z i } m i=1 are m i.i.d. random variables from N (0, &#963; 2 I). We have</p><p>which is well known to be &#920;(&#963; &#8730; log m). See e.g., <ref type="bibr">Orabona &amp; P&#225;l (2015)</ref>; <ref type="bibr">Kamath (2015)</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Related Methods and Discussion</head><p>MaxUp is closely related to both data augmentation and adversarial training. It can be viewed as an adversarial variant of data augmentation, in that it minimizes the worse case loss on the perturbed data, instead of an average loss like typical data augmentation methods. MaxUp can also be viewed as a "lightweight" variant of adversarial training, in that the maximum loss is calculated by simple random sampling, instead of more accurate gradient-based optimizers for finding the adversarial loss, such as projected gradient descent (PGD); MaxUp is much simpler and faster than the PGD-based adversarial training, and is more suitable for our purpose of alleviating over-fitting on clean data (instead of adversarial defense). We now elaborate on these connections in depth.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Data Augmentation</head><p>Data augmentation has been widely used in machine learning, especially on image data which admits a rich set of invariance transforms (e.g. translation, rotation, random cropping). Recent augmentation techniques, such as MixUp <ref type="bibr">(Zhang et al., 2018)</ref>, CutMix <ref type="bibr">(Yun et al., 2019)</ref> and manifold MixUp <ref type="bibr">(Verma et al., 2019)</ref> have been found highly useful in training deep neural networks, especially in achieving state-of-the-art results on important image classification benchmarks such as SVHN, CIFAR and Im-ageNet. More recently, more advanced methods have been developed to find the optimal data augmentation policies using reinforcement learning or adversarial generative network (e.g. <ref type="bibr">Cubuk et al., 2019a;</ref><ref type="bibr">b;</ref><ref type="bibr">Zhang et al., 2020)</ref>.</p><p>MaxUp can easily leverage these advanced data augmentation techniques to achieve good performance. The key difference, however, is that MaxUp in (2) minimizes the maximum loss on the augmented data, while typical data augmentation methods minimize the average loss, that is,</p><p>which we refer to as standard data augmentation through-out the paper. It turns out ( <ref type="formula">2</ref>) and ( <ref type="formula">5</ref>) behave very different as regularization mechanisms, in that ( <ref type="formula">5</ref>) does not introduce the gradient-norm regularization as (2), and hence does not have the benefit of having gradient-norm regularization. This is because the first-order term in the Taylor expansion is canceled out due to the averaging in (5).</p><p>Specifically, let P(&#8226;|x) be any distribution whose expectation is x and L(x, &#952;) is second-order differentiable w.r.t x.</p><p>Define the expected loss related to (5) on data point x:</p><p>Then with a simple Taylor expansion, we have</p><p>which misses the gradient-norm regularization term when compared with MaxUp decomposition in Theorem 1.</p><p>Note that the MaxUp update is computationally faster than the solving (5) with the same m, because we only need to backpropagate on the worst augmented copy for each data point (see Equation <ref type="formula">3</ref>), while solving (5) requires to backpropagate on all the m copies at each iteration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Adversarial Training</head><p>Adversarial training has been developed to defense various adversarial attacks on the data inputs <ref type="bibr">(Madry et al., 2017)</ref>. It estimates &#952; by solving the following problem:</p><p>where B(x, r) represents a ball centered at x with radius r under some metrics (e.g. &#8467; 0 , &#8467; 1 , &#8467; 2 , or &#8467; &#8734; distances). The inner maximization is often solved by running projected gradient descent (PGD) for a number of iterations.</p><p>MaxUp in (2) can be roughly viewed as solving the inner adversarial maximization problem in (7) using a "mild", or "lightweight" optimizer by randomly drawing m points from P(&#8226;|x) and finding the best. Such mild adversarial optimization increases the robustness against the random perturbation it introduces, and hence enhance the generalization performance. Adversarial ideas have also been used to improvement generalization in a series of recent works (e.g., <ref type="bibr">Xie et al., 2019;</ref><ref type="bibr">Zhu et al., 2020)</ref>.</p><p>Different from our method, typical adversarial training methods, especially these based PGD <ref type="bibr">(Madry et al., 2017)</ref>, tend to solve the adversarial optimization much more aggressively to achieve higher robustness, but at the cost of scarifying the accuracy on clean data. There has been shown a clear trade-off between the accuracy of a classifier on clean data and its robustness against adversarial attacks (see e.g., <ref type="bibr">Tsipras et al., 2019;</ref><ref type="bibr">Zhang et al., 2019;</ref><ref type="bibr">Yin et al., 2019;</ref><ref type="bibr">Schmidt et al., 2018)</ref>. By using a mild adversarial optimizer, MaxUp strikes a better balance between the accuracy on clean data and adversarial robustness.</p><p>Besides, MaxUp is much more computationally efficient than PGD-based adversarial training, because it does not introduce additional back-propagation steps as PGD. In practice, MaxUp can be equipped with various complex data augmentation methods (in which case P(&#8226;|x) can be discrete distributions), while PGD-based adversarial training mostly focuses on perturbations in &#8467; p balls.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Online Hard Example Mining</head><p>Online hard example mining (OHEM) <ref type="bibr">(Shrivastava et al., 2016)</ref> is a training method originally developed for regionbased objective detection, which improves the performance of neural networks by picking the hardest examples within mini batches of stochastic gradient descent (SGD). It can be viewed as running SGD for minimizing the following expected loss</p><p>which amounts to randomly picking a mini-batch M at each iteration and minimizing the loss of the hardest example within M. By doing so, OHEM can focus more on the hard examples and hence improves the performance on borderline cases. This makes OHEM particularly useful for class-imbalance tasks, e.g. object detection <ref type="bibr">(Shrivastava et al., 2016)</ref>, person reidentification <ref type="bibr">(Luo et al., 2019)</ref>.</p><p>Different with MaxUp, the hardest examples in OHEM are selected in mini-batches consisting of independently selected examples, with no special correlation or similarity. Mathematically, it can be viewed as reweighing the data distribution to emphasize harder instances. This is substantially different from MaxUp, which is designed to enforce the robustness against existing random data augmentation/perturbation schemes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Method</head><p>Top-1 error Top-5 error Vanilla <ref type="bibr">(He et al., 2016b)</ref> 76.3 -Dropout <ref type="bibr">(Srivastava et al., 2014)</ref> 76.8 93.4 DropPath <ref type="bibr">(Larsson et al., 2017)</ref> 77.1 93.5 Manifold Mixup <ref type="bibr">(Verma et al., 2019)</ref> 77.5 93.8 AutoAugment <ref type="bibr">(Cubuk et al., 2019a)</ref> 77.6 93.8 Mixup (Zhang et al., 2018)  77.9 93.9 DropBlock <ref type="bibr">(Ghiasi et al., 2018)</ref> 78.3 94.1 CutMix <ref type="bibr">(Yun et al., 2019)</ref> 78.6 94.0 MaxUp+CutMix 78.9 94.2</p><p>Table <ref type="table">1</ref>. Summary of top1 and top5 accuracies on the validation set of ImageNet for ResNet-50.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>We test our method using both image classification and language modeling for which a variety of strong regularization techniques and data augmentation methods have been proposed. We show that MaxUp can outperform all of these methods on the most challenging datasets (e.g. ImageNet, Penn Treebank, and Wikitext-2) and state-of-the-art models (e.g. ResNet, EfficientNet, AWD-LSTM). In addition, we apply our method to adversarial certification via Gaussian smoothing <ref type="bibr">(Cohen et al., 2019)</ref>, for which we find that MaxUp can outperform both the augmented data baseline and PGD-based adversarial training baseline.</p><p>For all the tasks, if training from scratch, we first train the model with standard data augmentation with 5 epochs and then switch to MaxUp.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Time and Memory</head><p>Cost MaxUp only slightly increase the time and memory cost compared with standard training.</p><p>During MaxUp, we only need to find the worst instance out of the m augmented copies through forward-propagation, and then only back-propagate on the worst instance. Therefore, the additional cost of MaxUp over standard training is m forward-propagation, which introduces no significant overhead on both memory and time cost.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">ImageNet</head><p>We evaluate MaxUp on ILSVRC2012, a subset of Im-ageNet classification dataset <ref type="bibr">(Deng et al., 2009)</ref>. This dataset contains around 1.3 million training images and 50,000 validation images. We follow the standard data processing pipeline including scale and aspect ratio distortions, random crops, and horizontal flips in training. During the evaluation, we only use the single-crop setting. CutMix randomly cuts and pasts patches among training images, while the ground truth labels are also mixed proportionally to the area of the patches. MaxUp+CutMix applies CutMix on one image for m times (cutting different randomly sampled patches), and select the worst case to do backpropagation.</p><p>We test our method on ResNet-50, ResNet-101 <ref type="bibr">(He et al., 2016b)</ref>, as well as recent energy-efficient architectures, including ProxylessNet <ref type="bibr">(Cai et al., 2019)</ref> and Efficient-Net <ref type="bibr">(Tan &amp; Le, 2019)</ref>. We resize the images to 600 &#215; 600 and 845 &#215; 845 for EfficientNet-B7 and EfficientNet-B8, respectively <ref type="bibr">(Tan &amp; Le, 2019)</ref>, for which we process the images with the data processing pipelines proposed by <ref type="bibr">Touvron et al. (2019)</ref>. For the other models, the input image size is 224 &#215; 224. To save computation resources, we only fine-tune the pre-trained models with MaxUp for a few epochs. We set m = 4 for MaxUp in the ImageNet-2012 experiments unless indicated otherwise. This means that we optimize the worst case in 4 augmented samples for each image.</p><p>For ResNet-50, ResNet-101 and ProxylessNets, we train the models for 20 epochs with learning rate 10 -5 and batch size 256 on 4 GPUs for 20 epochs. For EfficientNet, we fix the parameters in the batch normalization layers and train the other parameters with learning rate 10 -4 and batch size 1000 for 5 epochs. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>As shown in</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">CIFAR-10 and CIFAR-100</head><p>We test MaxUp equipped with Cutout (DeVries &amp; Taylor, 2017) on CIFAR-10 and CIFAR-100, and denote it by MaxUp+Cutout. We conduct our method on several neural architectures, including ResNet-110 <ref type="bibr">(He et al., 2016b)</ref>, PreAct-ResNet-110 <ref type="bibr">(He et al., 2016a)</ref> and <ref type="bibr">WideResNet-28-10 (Zagoruyko &amp; Komodakis, 2016)</ref>. We set m = 10 for WideResNet and m = 4 for the other models. We use the public code 2 and keep their hyper-parameters.</p><p>Implementation Details For CIFAR-10 and CIFAR-100, we use the standard data processing pipeline (mirror+ crop) and train the model with 200 epochs. All the results reported in this section are averaged over five runs.</p><p>We starts at 0.1 and is divided by 10 after 100 and 150 epochs for ResNet-110 and PreAct-ResNet-110. For WideResNet-28-10, we follow the settings in the original paper <ref type="bibr">(Zagoruyko &amp; Komodakis, 2016)</ref>, where the learning rate is divided by 10 after 60, 120 and 180 epochs.</p><p>Weight decay is set to 2.5 -4 for all the models, and we do not use dropout.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head><p>The results on CIFAR-10 and CIFAR-100 are summarized in Table <ref type="table">3</ref> and<ref type="table">Table 4</ref>. We can see that the models trained using MaxUp+Cutout significantly outperform the standard Cutout for all the cases.</p><p>On CIAFR-10, MaxUp improves the standard Cutout baseline from 94.84% &#177; 0.11% to 95.41% &#177; 0.08% on ResNet-110. It also improves the accuracy from 95.02% &#177; 0.15% to 95.52% &#177; 0.06% on PreAct-ResNet-110.</p><p>On CIFAR-100, MaxUp obtains improvements by a large margin. On ResNet-110 and PreAct-ResNet-110, MaxUp improves the performance of Cutout from 73.64% &#177; 0.15% and 74.37% &#177; 0.13% to 75.26% &#177; 0.21% and 75.63% &#177; 0.26%, respectively. MaxUp+Cutout also improves the standard Cutout from 81.59% &#177; 0.27% to 82.48% &#177; 0.23% on WideResNet-28-10 on CIFAR-100.</p><p>Ablation Study We test MaxUp with different sample size m and investigate its impact on the performance on ResNet-100 (a relatively small model) and WideResNet-28-10 (a larger model).</p><p>Table <ref type="table">5</ref> shows the result when we vary the sample size in m &#8712; {1, 4, 10, 20}. Note that MaxUp reduces to the na&#239;ve data augmentation method when m = 1. As shown in Table <ref type="table">5</ref>, MaxUp with all m &gt; 1 can improve the result of standard augmentation (m = 1). Setting m = 4 or m = 10 achieves best performance on ResNet-110 , and m = 10 obtains best performance on WideResNet-28-10. We can see that the results are not sensitive once m is in a proper range (e.g., m &#8712; [4 : 10]), and it is easy to outperform the standard data augmentation (m = 1) without much tuning of m. Furthermore, we suggest to use a large m for large models, and a small m for relatively small models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Language Modeling</head><p>For language modeling, we test MaxUp on two benchmark datasets: Penn Treebank (PTB) and Wikitext-2 (WT2). We use the code provided by <ref type="bibr">Wang et al. (2019)</ref> as our baseline<ref type="foot">foot_1</ref> , which stacks a three-layer LSTM and implements a bag of regularization and optimization tricks for neural language modeling proposed by <ref type="bibr">Merity et al. (2018)</ref>, such as weight tying, word embedding drop and Averaged SGD.</p><p>For this task, we apply MaxUp using word embedding dropout <ref type="bibr">(Merity et al., 2018)</ref> as the random data augmentation method. Word embedding dropout implements dropout on the embedding matrix at the word level, where the dropout is broadcasted across all the embeddings of all the word vectors. For the selected words, their embedding vectors are set to be zero vectors. The other word embeddings in the vocabulary are scaled by 1 1-p , where p is the probability of embedding dropout.</p><p>As the word embedding layer serves as the first layer in a neural language model, we apply MaxUp in this layer. We do feed-forward for m times and select the worst case to do backpropagation for each given sentence. In this section, we set a small m = 2 since the models are already wellregularized by other regularization techniques.</p><p>Implement Details The PTB corpus <ref type="bibr">(Marcus et al., 1993)</ref> is a standard dataset for benchmarking language models. It consists of 923k training, 73k validation and 82k test words. We use the processed version provided by <ref type="bibr">Mikolov et al. (2010)</ref> that is widely used for PTB.</p><p>The WT2 dataset is introduced in <ref type="bibr">Merity et al. (2018)</ref> as an alternative to PTB. It contains pre-processed Wikipedia articles, and the training set contains 2 million words.</p><p>The training procedure can be decoupled into two stages: 1) optimizing the model with SGD and averaged SGD (ASGD); 2) restarting ASGD for fine-tuning twice. We apply MaxUp in both stages, and report the perplexity scores at the end of the second stage. We also report the perplexity scores with a recently-proposed post-process method, dy- namical evaluation <ref type="bibr">(Krause et al., 2018)</ref> after the training process.</p><p>Results on PTB and WT2 The results on PTB and WT2 corpus are illustrated in Table <ref type="table">6</ref> and<ref type="table">Table 7</ref>, respectively. We calculate the perplexity on the validation and test set for each method to evaluate its performance. We can see that MaxUp outperforms the state-of-the-art results achieved by Frage <ref type="bibr">(Gong et al., 2018)</ref> and Mixture of SoftMax <ref type="bibr">(Yang et al., 2018)</ref>. We further compare MaxUp to the result of <ref type="bibr">Wang et al. (2019)</ref> based on AWD-LSTM <ref type="bibr">(Merity et al., 2018)</ref> at two checkpoints, with or without dynamic evaluation <ref type="bibr">(Krause et al., 2018)</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Adversarial Certification</head><p>Modern image classifiers are known to be sensitive to small, adversarially-chosen perturbations on inputs <ref type="bibr">(Goodfellow et al., 2014)</ref>. Therefore, for making high-stakes decisions, it is of critical importance to develop methods with certified robustness, which provide (high probability) provable guarantees on the correctness of the prediction subject to arbitrary attacks within certain perturbation ball.</p><p>Recently, <ref type="bibr">Cohen et al. (2019)</ref>  Training Details We applied MaxUp to Gaussian augmented data on CIFAR-10 with ResNet-110 <ref type="bibr">(He et al., 2016b)</ref>. We follow the training pipelines described in <ref type="bibr">Salman et al. (2019)</ref>. We set a batch size of 256, an initial learning rate of 0.1 which drops by a factor of 10 every 50 epochs, and train the models for 150 epochs.</p><p>Evaluation After training the smoothed classifiers, we evaluation the certified accuracy of different models under different &#8467; 2 perturbation sets. Given an input image x and a perturbation region B, the smoothed classifier is called certifiably correct if its prediction is correct and has a guaranteed lower bound larger than 0.5 in B. The certified accuracy is the percentage of images that are certifiably correct. Following <ref type="bibr">Salman et al. (2019)</ref>, we calculate the certified accuracy of all the classifiers for various radius and report the best results overall of the classifiers. We use the codes provided by <ref type="bibr">Cohen et al. (2019)</ref> to calculate certified accuracy. 4</p><p>Following <ref type="bibr">Salman et al. (2019)</ref>, we select the best hyperparameters with grid search. The only two hyperparameters of our MaxUp+Gauss are the sample size m and the variance &#963; 2 of the Gaussian perturbation, which we search in m &#8712; {5, 25, 50, 100, 150} and &#963; &#8712; {0.12, 0.25, 0.5, 1.0}.</p><p>In comparison, <ref type="bibr">Salman et al. (2019)</ref> requiers to search a larger number of hyper-parameters, including the number of steps of the PGD, the number of noise samples, the maximum &#8467; 2 perturbation, and the variance of Gaussian data augmentation during training and testing. Overall, <ref type="bibr">Salman et al. (2019)</ref> requires to train and evaluate over 150 models for hyperparmeter tuning, while MaxUp+Gauss requires only 20 models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head><p>We show the certified accuraries on CIFAR-10 in Table <ref type="table">8</ref> under &#8467; 2 attacks for each &#8467; 2 radius. We find that MaxUp outperforms <ref type="bibr">Cohen et al. (2019)</ref> for all the &#8467; 2 radiuses by a large margin. For example, MaxUp can im-4 <ref type="url">https://github.com/locuslab/smoothing</ref> prove the certified accuracy at radius 0.25 from 60% to 74% and improve the 4% accuracy on radius 2.75 to 15%.</p><p>MaxUp also outperforms the PGD-based adversarial training of <ref type="bibr">Salman et al. (2019)</ref> for all the radiuses, boosting the accuracy from 14% to 17% at radius 2.5, and from 12% to 15% at radius 2.75.</p><p>In summary, MaxUp clearly outperforms both <ref type="bibr">Cohen et al. (2019)</ref> and <ref type="bibr">Salman et al. (2019)</ref>. MaxUp is also much faster and requires less hyperparameter tuning than <ref type="bibr">Salman et al. (2019)</ref>. Although the PGD-based method of <ref type="bibr">Salman et al. (2019)</ref> was designed to outperform the original method by <ref type="bibr">Cohen et al. (2019)</ref>, MaxUp+Gauss further improves upon <ref type="bibr">Salman et al. (2019)</ref>, likely because MaxUp with Gaussian perturbation is more compatible with the Gaussian smoothing based certification of <ref type="bibr">Cohen et al. (2019)</ref> than PGD adversarial optimization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>In this paper, we propose MaxUp, a simple and efficient training algorithms for improving generalization, especially for deep neural networks. MaxUp can be viewed as a introducing a gradient-norm smoothness regularization for Gaussian perturbation, but does not require to evaluate the gradient norm explicitly, and can be easily combined with any existing data augmentation methods. We empirically show that MaxUp can improve the performance of data augmentation methods in image classification, language modeling, and certified defense. Especially, we achieve SOTA performance on ImageNet.</p><p>For future works, we will apply MaxUp to more applications and models, such as BERT <ref type="bibr">(Devlin et al., 2019)</ref>. Furthermore, we will generalize MaxUp to apply mild adversarial optimization on feature and label spaces for other challenging tasks in machine learning, including transfer learning, semi-supervised learning.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>All the FLOPS and model size reported in this paper is calculated by https://pypi.org/project/ptflops.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>https://github.com/ChengyueGongR/advsoft</p></note>
		</body>
		</text>
</TEI>
