<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Taming the Noisy Gradient: Train Deep Neural Networks with Small Batch Sizes</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>08/01/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10108626</idno>
					<idno type="doi">10.24963/ijcai.2019/604</idno>
					<title level='j'>The Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yikai Zhang</author><author>Hui Qu</author><author>Chao Chen</author><author>Dimitris Metaxas</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>Deep learning architectures are usually proposed with millions of parameters, resulting in a memory issue when training deep neural networks with stochastic gradient descent type methods using large batch sizes. However, training with small batch sizes tends to produce low quality solution due to the large variance of stochastic gradients. In this paper, we tackle this problem by proposing a new framework for training deep neural network with small batches/noisy gradient. During optimization, our method iteratively applies a proximal type regularizer to make loss function strongly convex. Such regularizer stablizes the gradient, leading to better training performance. We prove that our algorithm achieves comparable convergence rate as vanilla SGD even with small batch size. Our framework is simple to implement and can be potentially combined with many existing optimization algorithms. Empirical results show that our method outperforms SGD and Adam when batch size is small. Our implementation is available at https://github.com/huiqu18/TRAlgorithm.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Recent years have witnessed the rapid development of deep neural networks in a wide range of applications. The success of neural networks should be partially attributed to optimization algorithms such as stochastic gradient descent (SGD) and its variants <ref type="bibr">[Kingma and Ba, 2014]</ref>. These popular optimization methods, widely adopted in practice, have been proved to be extremely efficient and effective for finding the local minima of the objective function <ref type="bibr">[Bottou, 2010;</ref><ref type="bibr">Bottou, 2012;</ref><ref type="bibr">Le et al., 2011]</ref>. SGD exploits the finite sum structure of the objective function, avoids the expensive computation of exact gradient, and thus provides a feasible and efficient optimization solution in practice <ref type="bibr">[Bottou, 2012</ref>]. The stochastic nature also helps escaping saddle points, when Newton method cannot <ref type="bibr">[Ge et al., 2015;</ref><ref type="bibr">Jin et al., 2017]</ref>. It has also been shown both empirically and * equal contribution theoretically that SGD and its variants improve the generalization performance of the trained models <ref type="bibr">[Hardt et al., 2016;</ref><ref type="bibr">Chaudhari et al., 2017]</ref>.</p><p>Despite the above advantages, one limitation with SGD type methods has manifested in recent years. Calculated using partial data, the stochastic gradient tends to fluctuate <ref type="bibr">[Ruder, 2016]</ref> and deviate from the full gradient descent direction, causing deteriorated optimization efficiency. This issue becomes more serious when researchers continuously develop powerful yet large size networks to address big data challenges. With large networks, we are often forced to use small batch size due to the memory constraints. For example, in the high resolution image synthesis task <ref type="bibr">[Wang et al., 2018]</ref>, it requires a GPU of 24GB memory to train a generative adversarial network (GAN) using full resolution images with batch size 1. Such scenario demands small batch sizes, causing large variance in stochastic gradients, and thus inefficient optimization.</p><p>Variance reduction techniques have been developed to alleviate such issue <ref type="bibr">[Johnson and Zhang, 2013;</ref><ref type="bibr">Reddi et al., 2016;</ref><ref type="bibr">Lei and Jordan, 2017]</ref>. These techniques reduce the noise of stochastic gradient by progressively estimating the difference between noisy and noiseless gradient. However, these techniques require the access to gradient computed by full batch or large batch size, and thus is impractical in the limited memory/small batch size scenario.</p><p>In this paper, we propose a new optimization method to reduce the variance of stochastic gradients with small batch size. Our method, called Trajectory Regularized Algorithm (TRAlgo), computes each stochastic gradient by locally adding a proximal type regularizer. The locally regularized loss function becomes strongly convex and can be solved efficiently <ref type="bibr">[Rakhlin et al., 2012]</ref>. The local regularizer stabilizes the stochastic gradient, and thus reduces its variance. We prove that with small batch size, our method can achieve a much better stochastic gradient variance compared with SGD, and thus a better convergence rate. Our algorithm is fairly general and in principle can be combined with many existing SGD type methods. It opens the door toward novel optimization techniques that uniquely suit the big data world. In summary, our main contributions include:</p><p>&#8226; We propose a new algorithm using proximal regularizer to address the issue of noisy stochastic gradients, due to small batch size.</p><p>&#8226; We prove that our algorithm converges no slower than SGD. Furthermore, under the constraint of small batch size, our method reduces the variance of stochastic gradients, and thus has a better convergence rate than SGD. &#8226; We combine our algorithm with various state-of-the-art optimization algorithms (vanilla SGD and Adam). We empirically show that the proposed method improves the optimization performance when training deep neural networks with small batch sizes.</p><p>Note that the power of a proximal type regularizer has been exploited <ref type="bibr">[Allen-Zhu, 2018;</ref><ref type="bibr">Lin et al., 2015]</ref>. However, existing works only focus on the optimization efficiency. To the best of our knowledge, TRAlgo is the first to use such technique to address noisy gradients, under the small batch size constraint.</p><p>The remainder of this paper is organized as follows. We introduce our notation and TRAlgo in Section 2. In Section 3, we combine TRAlgo with SGD (called TRSGD) and analyze the gradient complexity of both TRSGD and SGD. Experimental results are presented in Section 4. Finally, we conclude our work in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Preliminary</head><p>We introduce the notations, assumptions, as well as our proposed TRAlgo framework. Throughout this paper, we denote by f (w) = 1 n n i=1 f i (w) the loss function, where w &#8712; R d represents parameters in the optimization problem. We use</p><p>&#8226; to denote the L 2 norm. In this paper, we only consider convergence to a first order stationary point : &#8711;f (x) 2 &#8804; .</p><p>The following assumptions are widely adopted in the analysis of convex and non-convex optimization algorithms. They regulate the behavior of zero-th order, first order and second order derivatives of f (x). Assumption 1. We assume following conditions hold:</p><p>In following assumption we apply &#948; to describe how 'non-</p><p>In general, &#948; is uniformly bounded by Hessian upper bound and could be equivalent to in the worst case. We can view &#948; as the minimum value of &#955; so that f (w)+ &#955; 2 w 2 is convex. Assumption 3 (Variance bounded SGD). We assume function f (x) is &#963; variance bounded:</p><p>The last assumption regulates the variance of stochastic gradient which is common in analyzing the convergence behavior of Stochastic Gradient Descent.</p><p>Algorithm 1 presents our Trajectory Regularized framework. It run for T iterations. At iteration t, we will construct Algorithm 1 TRAlgo (T, &#955;, w 0 , S)</p><p>Run stochastic gradient type method with decreasing step size on G t (w) for S iterations to compute w t+1 5: Option I: 6: Pick &#373; = w T 7: return &#373;</p><p>In practice 8: Option II: 9: Pick &#373; uniformly randomly from w 1 , .., w T 10: return &#373;</p><p>For analysis an auxiliary function G t (w) by adding a regularization term to f (x), and run stochastic gradient type optimization algorithm on G t (w) for S iterations. The result w after S steps is the next gradient step, w t+1 . After T iterations, depending on the context, we may return w T as the final output (in practice), or return a random sample from all previous w * (for analysis).</p><p>The key idea of TRAlgo is to apply a proximal type regularizer &#955; 2 w -w t 2 with &#955; &#8805; 2&#948; to make the loss function f (x) strongly convex. The regularizer stablizes the gradient and thus reduces its variance, as we will prove in the next section. To avoid introducing bias as a fixed regularizer does <ref type="bibr">[Hoerl and Kennard, 1970]</ref>, the regularizer is locally adaptive (centered at current w t ). Quadratic proximal regularizer has been used before <ref type="bibr">[Parikh et al., 2014]</ref>, but only for better optimization of non smooth L 1 loss function. Instead, TRAlgo uses the it for a trajectory regularization purpose. In TRAlgo, the choice of the stochastic gradient method in each iteration is flexible. One can apply a vanilla SGD or its variants: Adam <ref type="bibr">[Kingma and Ba, 2014]</ref> and AMSGrad <ref type="bibr">[Reddi et al., 2018]</ref>. In next section, we use SGD as an example to analyze the benefits of TRAlgo, while in Section 4, we empirically apply TRAlgo to both SGD and Adam.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Analysis</head><p>In this section we incorporate SGD into our framework TRAlgo (called TRSGD) and compare it with Vanilla SGD. In Section 3.1 we review SGD and restate some known results of SGD for a direct comparison with analysis for TRSGD. In Section 3.3 we formally define TRSGD and analyze its convergence rate using batch size 1. While TRSGD achieves comparable gradient complexity with Vanilla SGD in general it attains a better convergence rate when batch size is constrained. The analysis focuses on quantifying how number of iterations S run on auxiliary function G w (w) affects the convergence. Our analysis demonstrates even when batch size is limited, TRSGD could still control variance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">SGD</head><p>In Algorithm 2 we describe the Stochastic Gradient Descent method. We use I k &#8834; [n] to denote the random sampled </p><p>For practice 7: Option II: 8: Pick &#373; uniformly randomly from w 1 , .., w S 9: return &#373; For analysis small batch and &#8711;f</p><p>) is known as noise in stochastic gradient. Next we introduce some well known results about the convergence behavior of SGD and demonstrate how the noise affects the convergence. One can find similar statements in <ref type="bibr">[Allen-Zhu, 2018;</ref><ref type="bibr">Ge et al., 2015;</ref><ref type="bibr">Reddi et al., 2016]</ref>.</p><p>Lemma 1. Under assumptions 1 and 3, the iterative update of SGD with step size &#951; &#8804; 1 2 and batch size M satisfies the following inequality:</p><p>Recall that the SGD update rule is w t+1 = w t -&#951;(&#8711;f (w t ) + &#963; t ) where &#963; t represents noise in the Stochastic Gradient with</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Intuition: Noisy Gradient Slows Down Convergence</head><p>The left hand side of Eq. ( <ref type="formula">1</ref>) is the improvement made in each iteration in expectation and the right hand side gives a guarantee on the progress for iteration t. One can observe that large variance will slow down the improvement brought by gradient. To alleviate this issue, one may increase batch size M . However, in our setting, when M is restricted to small, we cannot mitigate the negative impact of variance. Such phenomenon is shown in Figure <ref type="figure">1</ref>. By telescoping Lemma 1 and using the fact that f (x) -f (y) &#8804; 2B, we have the convergence rate of SGD. Theorem 1. Under assumptions 1 and 3, for SGD with step size &#951; = 1/(2 ), batch size M , and any T , it holds that: In particular, denote by N = T M the total number of gradient calls applied in the algorithm, picking T = 4 &#8730; BN /&#963;, the gradient complexity to reach E &#8711;f (w) 2 &#8804; is:</p><p>Above Theorem suggests that if picking the output of SGD uniformly randomly over w 1:T , one can achieve</p><p>. One could increase training iteration T and batch size M so that the expected gradient squared norm meets the first order stationary point criterion. An optimal choice balancing ( B) /T and &#963; 2 T /N would be T = 4 &#8730; BN /&#963;. Although the above statements reveal the power of SGD in efficiency, it is not realistic to set M = &#8486;( &#8730; N ) in training large size deep neural network. Indeed, in the practical setting we are addressing, M may be a small constant (say 1). The right hand side of Eq. ( <ref type="formula">2</ref>) can be arbitrarily bad with noisy gradients. In next section, we will show how TRAlgo addresses the issue and still keep a tight bound even with M = 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">TRSGD</head><p>Our method TRAlgo requires the choice of a base optimization algorithm. In this section, we instantiate our method using the Vanilla SGD. We analyize the convergence performance of this instantiation, called TRSGD. Definition 1. TRSGD is defined as applying SGD in TRAlgo:</p><p>While the advantage of decreasing stepsize (usually set to be 1/k) for SGD type methods has been well studied in strongly convex case <ref type="bibr">[Rakhlin et al., 2012;</ref><ref type="bibr">Shamir and Zhang, 2013]</ref>, it remains unclear how to properly apply such strategy in non-convex stochastic optimization. In the rest of this section we show one can benefit from such strategy due to the convexity of auxiliary function G t (w).</p><p>First we state a technical lemma which follows <ref type="bibr">[Rakhlin et al., 2012]</ref>. We use the following lemma to quantify the progress made by running SGD for S iterations with decreasing step size. Lemma 2. If one runs SGD on G t (w) = f (w)+ size 1, one have</p><p>where w * t+1 = arg min w {G t (w)}. Now we are ready to prove the key lemma which describes one round progress of TRSGD. One can make a direct comparison between the following lemma and Lemma 1 by setting &#951; = 1/(2 ) in Eq. ( <ref type="formula">1</ref>). Lemma 3. Under assumptions 1, 2 and 3, the iterative update of TRSGD with &#955; = 2 and batch size 1 satisfies the following inequality:</p><p>2 we can derive (3).</p><p>While in Eq. ( <ref type="formula">1</ref>) the issue of variance could be alleviated by increasing the batch size M for SGD, one can alleviate such issue by increasing the number of iterations S in Eq. (3) for TRSGD. One can set S = &#920;(M ) so both algorithms have a comparable one round convergence progress. However, in practice the choice of M is limited due to the memory restriction but the choice of S is not. We choose batch size to be 1 for simplicity and the analysis can be generalized to any batch size.</p><p>We finish this section by telescoping Eq. ( <ref type="formula">3</ref>) to derive the following theorem. The inequality bounds the expected squared norm of gradient of TRSGD, similar to Theorem 1. Theorem 2. Under assumptions 1, 2 and 3, the TRSGD with batch size 1 , &#955; = 2 achieves the following convergence rate:</p><p>In particular, let N = T S be the total number of gradient call applied in the algorithm, by picking</p><p>By setting the batch size to be 1, the gradient complexity of TRSGD is comparable to SGD in a rate of N = O 1/ 2 to meet the criterion E &#8711;f (w t ) 2 &#8804; .</p><p>Interpretation of theorems. We can now explicitly compare Eq. ( <ref type="formula">4</ref>) and Eq. ( <ref type="formula">2</ref>). While the function loss term B/T appear in both inequality the &#963; 2 term appears in different fashion due to their strategies in monitoring variance. When gradient is associated with big variance, SGD increases batch size M and TRSGD runs more iteration on auxiliary function G t (w). The strategy of TRSGD is not restricted by memory which is adorable in training huge network. The choice of S depends on the batch size restriction. In the case where one only affords to set M = 1, one can take S = &#920;( &#8730; N ) to optimally balance B/T and ( 2 (&#963; 2 + g 2 )T )/N in the gradient complexity. On another hand, when it is possible to set M = &#920;( &#8730; N ), the TRAlgo framework will not be the first choice as there is no need for variance control.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>In this section we apply the proposed framework to momentum SGD and Adam algorithms, called TRSGD and TRAdam, respectively, and empirically illustrate the effectiveness of our methods when training deep neural networks with small batch sizes. We first compare the performance of different algorithms on image classification tasks on CI-FAR10 and CIFAR100 datasets, and then show the application of our framework in the training of GANs using an image to image translation task on the Cityscapes dataset <ref type="bibr">[Cordts et al., 2016]</ref>. We also perform ablation study on the batch size and iteration number S on TRSGD.  <ref type="bibr">[Huang et al., 2017]</ref>. Four algorithms (TRSGD, SGD, TRAdam, Adam) are used to train these networks on CI-FAR10 and CIFAR100. In all experiments, the number of epoch is 200 and the batch size is 8. For TRSGD and SGD, the momentum is 0.9, the learning rate is initially set to 0.1 and decayed to 0.01, 0.001 at epoch 100 and 150, respectively. For TRAdam and Adam, &#946; 1 = 0.9, &#946; 2 = 0.999, the learning rate is initially set to 0.001 and decayed to 0.0001, 0.00001 at epoch 100 and 150, respectively. For both TRSGD and TRAdam, &#955; = 0.1 and S = 10.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Implementation Details</head><p>Image to image translation using GAN. In this task we follow the network structure and parameter settings of <ref type="bibr">[Isola et al., 2017]</ref> to train a model that generates images from semantic labels on Cityscapes dataset. The only difference is that we replace the Adam optimizer with our TRAdam algorithm. In TRAdam, &#955; = 0.1, S = 10 and the other parameters are the same as Adam.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Results</head><p>Image classification. The train and test curves of SGD vs. TRSGD, and Adam vs. TRAdam using a batch size of 8 are shown in Fig. <ref type="figure">2</ref> and Fig. <ref type="figure">3</ref>, respectively. Only results of ResNet101 are presented due to the space limitation. We can observe that our TRSGD and TRAdam achieve faster convergence and better accuracies compared to SGD and Adam on both datasets. The test accuracies of all models are reported in Table <ref type="table">1</ref>. It is evident that our TRSGD and TRAdam outperform SGD and Adam in all cases. Because of the large gradient variance when the batch size is small, SGD and Adam are not able to obtain better performance with deeper networks. However, our algorithms can handle the noisy gradients by the regularizer, thus achieving higher accuracies using ResNet101 and DenseNet121, especially on CIFAR100 dataset.</p><p>Image to image translation using GAN. To evaluate the image quality generated by the model, we use the "FCNscore" as in <ref type="bibr">[Isola et al., 2017]</ref>. It evaluates how realistic the generated images are based on the semantic segmentation results using a fully convolutional neural network. The pre-trained model predicts labels from the synthesized images. Then the labels are compared with the ground-truth labels those images were synthesized from. We adopt two pre-trained semantic segmentation models to com-  pute "FCN-score": FCN-8s <ref type="bibr">[Long et al., 2015]</ref> and DPC-xception71 <ref type="bibr">[Isola et al., 2017]</ref>. FCN-8s is used in <ref type="bibr">Isola et al. [2017]</ref> while DPC-xception71 has the state-of-the-art performance on the Cityscapes dataset thus is more reliable than FCN-8s to compute the score. The evaluation results using two models are reported in Table <ref type="table">2</ref> and<ref type="table">Table 3</ref>. For the results evaluated by FCN-8, TRAdam is only slightly better in the per-pixel accuracy compared to Adam. However, TRAdam outperforms Adam in all three metrics when evaluated by DPC-xception71. The ground-truth metrics (GT in the tables) using original images for segmentation are also provided for reference. When computing the ground-truth values, we use the resized 256&#215;256 version of original images because the resolution of synthesized images is 256&#215;256. Some synthesized images are presented in Fig. <ref type="figure">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Ablation Study</head><p>In this section we explore the effect of batch size and inner iteration number S on our TRSGD algorithm. ResNet18 and CIFAR10 dataset are used in all experiments of this section.    the variance of stochastic gradients. In Table <ref type="table">5</ref> we test the performance of TRSGD under different values of S. The performance of TRSGD is not sensitive with regard to the value of S, One can also observe that when S = 50, the accuracy becomes slightly worse than other cases. This is because when gradient complexity is fixed, large S will results in small T (number of total iterations of TRSGD) thus a under fitting phenomenon may happen, as shown in Theorem 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>In this paper we present a Trajectory Regularization framework to train deep neural network with noisy gradients. We combine our framework with SGD and show it relaxes the batch size constraint of SGD, while achieving a comparable gradient call complexity. Our theoretical analysis is supported by the empirical evidence. Besides, we illustrate that our framework can be applied to improve the performance of a popular SGD type algorithm Adam in the small batch size setting, on both image classification tasks and image-toimage translation task using GAN. The proposed framework may be applied to other GAN-related tasks because of the large memory requirements of these tasks. We leave this for the future work.</p></div></body>
		</text>
</TEI>
