<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-update SGD</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2019 April</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10137586</idno>
					<idno type="doi"></idno>
					<title level='j'>Systems and Machine Learning (SysML) Conference</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Jianyu Wang</author><author>Gauri Joshi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Large-scale machine learning training, in particular, distributed stochastic gradient descent, needs to be robust to inherent system variability such as node straggling and random communication delays. This work considers a distributed training framework where each worker node is allowed to perform local model updates and the resulting models are averaged periodically. We analyze the true speed of error convergence with respect to wall-clock time (instead of the number of iterations) and analyze how it is affected by the frequency of averaging. The main contribution is the design of ADACOMM, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor. Rigorous experiments on training deep neural networks show that ADACOMM can take 3x less time than fully synchronous SGD and still reach the same final training loss.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Stochastic gradient descent (SGD) is the backbone of stateof-the-art supervised learning, which is revolutionizing inference and decision-making in many diverse applications. Classical SGD was designed to be run on a single computing node, and its error-convergence with respect to the number of iterations has been extensively analyzed and improved via accelerated SGD methods. Due to the massive training data-sets and neural network architectures used today, it has became imperative to design distributed SGD implementations, where gradient computation and aggregation is parallelized across multiple worker nodes. Although parallelism boosts the amount of data processed per iteration, it exposes SGD to unpredictable node slowdown and communication delays stemming from variability in the computing infrastructure. Thus, there is a critical need to make distributed SGD fast, yet robust to system variability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Need to Optimize Convergence in terms of Error versus</head><p>Wall-clock Time. The convergence speed of distributed SGD is a product of two factors: 1) the error in the trained model versus the number of iterations, and 2) the number of iterations completed per second. Traditional single-node SGD analysis focuses on optimizing the first factor, because the second factor is generally a constant when SGD is run on a single dedicated server. In distributed SGD, which is often run on shared cloud infrastructure, the second factor depends on several aspects such as the number of worker nodes, their local computation and communication delays, and the protocol (synchronous, asynchronous or periodic) used to aggregate their gradients. Hence, in order to achieve the fastest convergence speed we need: 1) optimization techniques (eg. variable learning rate) to maximize the error-convergence rate with respect to iterations, and 2) scheduling techniques (eg. straggler mitigation, infrequent communication) to maximize the number of iterations completed per second. These directions are inter-dependent and need to be explored together rather than in isolation. While many works have advanced the first direction, the second is less explored from a theoretical point of view, and the juxtaposition of both is an unexplored problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Local-Update SGD to Reduce Communication Delays.</head><p>A popular distributed SGD implementation is the parameter server framework <ref type="bibr">(Dean et al., 2012;</ref><ref type="bibr">Cui et al., 2014;</ref><ref type="bibr">Li et al., 2014;</ref><ref type="bibr">Gupta et al., 2016;</ref><ref type="bibr">Mitliagkas et al., 2016)</ref> where in each iteration, worker nodes compute gradients on one mini-batch of data and a central parameter server aggregates these gradients (synchronously or asynchronously) and updates the parameter vector x. The constant communication between the parameter server and worker nodes in each iteration can be expensive and slow in bandwidthlimited computed environments. Recently proposed distributed SGD frameworks such as Elastic-averaging <ref type="bibr">(Zhang et al., 2015;</ref><ref type="bibr">Chaudhari et al., 2017</ref><ref type="bibr">), Federated Learning (McMahan et al., 2016;</ref><ref type="bibr">Smith et al., 2017b)</ref> and decentralized SGD <ref type="bibr">(Lian et al., 2017;</ref><ref type="bibr">Jiang et al., 2017)</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>save this</head><p>Figure <ref type="figure">1</ref>. This work departs from the traditional view of considering error-convergence with respect to the number of iterations, and instead considers the true convergence in terms of error versus wall-clock time. Adaptive strategies that start with infrequent model-averaging and increase the communication frequency can achieve the best error-runtime trade-off. communication cost by allowing worker nodes to perform local updates to the parameter x instead of just computing gradients. The resulting locally trained models (which are different due to variability in training data across nodes) are periodically averaged through a central server, or via direct inter-worker communication. This local-update strategy has been shown to offer significant speedup in deep neural network training <ref type="bibr">(Lian et al., 2017;</ref><ref type="bibr">McMahan et al., 2016)</ref>.</p><p>Error-Runtime Trade-offs in Local-Update SGD. While local updates reduce the communication-delay incurred per iteration, discrepancies between local models can result in an inferior error-convergence. For example, consider the case of periodic-averaging SGD (PASGD) where each of m worker nodes makes &#964; local updates, and the resulting models are averaged after every &#964; iterations <ref type="bibr">(Moritz et al., 2015;</ref><ref type="bibr">Su &amp; Chen, 2015;</ref><ref type="bibr">Chen &amp; Huo, 2016;</ref><ref type="bibr">Seide &amp; Agarwal, 2016;</ref><ref type="bibr">Zhang et al., 2016;</ref><ref type="bibr">Zhou &amp; Cong, 2017;</ref><ref type="bibr">Lin et al., 2018)</ref>. A larger value of &#964; leads to slower convergence with respect to the number of iterations as illustrated in Figure <ref type="figure">1</ref>. However, if we look at the true convergence with respect to the wall-clock time, then a larger &#964; , that is, less frequent averaging, saves communication delay and reduces the runtime per iteration. While some recent theoretical works <ref type="bibr">(Zhou &amp; Cong, 2017;</ref><ref type="bibr">Yu et al., 2018;</ref><ref type="bibr">Wang &amp; Joshi, 2018;</ref><ref type="bibr">Stich, 2018)</ref> study this dependence of the error-convergence with respect to the number of iterations as &#964; varies, achieving a provably-optimal speed-up in the true convergence with respect to wall-clock time is an open problem that we aim to address in this work.</p><p>Need for Adaptive Communication Strategies. In the error-runtime in Figure <ref type="figure">1</ref>, we observe a trade-off between the convergence speed and the error floor when the number of local updates &#964; is varied. A larger &#964; gives a faster initial drop in the training loss but results in a higher error floor. This calls for adaptive communication strategies that start with a larger &#964; and gradually decrease it as the model reaches closer to convergence. Such an adaptive strategy will offer a win-win in the error-runtime trade-off by achieving fast convergence as well as low error floor. To the best of our knowledge, this is the first work to propose an adaptive communication frequency strategy.</p><p>Main Contributions. This paper focuses on periodicaveraging local-update SGD (PASGD) and makes the following main contributions:</p><p>1. We first analyze the runtime per iteration of periodic averaging SGD (PASGD) by modeling local computing time and communication delays as random variables, and quantify its runtime speed-up over fully synchronous SGD. A novel insight from this analysis is that periodic-averaging strategy not only reduces the communication delay but also mitigates synchronization delays in waiting for slow or straggling nodes.</p><p>2. By combining the runtime analysis error-convergence analysis of PASGD <ref type="bibr">(Wang &amp; Joshi, 2018)</ref>, we can obtain the error-runtime trade-off for different values of &#964; . Using this combined error-runtime trade-off, we derive an expression of the optimal communication period, which can serve as a useful guideline in practice. 4. We present a convergence analysis for PASGD with variable communication period &#964; and variable learning rate &#951;, generalizing previous work <ref type="bibr">(Wang &amp; Joshi, 2018)</ref>. This analysis shows that decaying &#964; provides similar convergence benefits as decaying learning rate, the difference being that varying &#964; improves the true convergence with respect to the wall-clock time. Adaptive communication can also be used in conjunction with existing learning rate schedules.</p><p>Although we focus on periodic simple-averaging of local models, the insights on error-runtime trade-offs and adaptive communication strategies are directly extendable to other communication-efficient SGD algorithms including Federated <ref type="bibr">Learning (McMahan et al., 2016)</ref>, Elastic-Averaging <ref type="bibr">(Zhang et al., 2015)</ref> and Decentralized averaging <ref type="bibr">(Jiang et al., 2017;</ref><ref type="bibr">Lian et al., 2017)</ref>, as well as synchronous/asynchronous distributed SGD with a central parameter server <ref type="bibr">(Dean et al., 2012;</ref><ref type="bibr">Cui et al., 2014;</ref><ref type="bibr">Dutta et al., 2018)</ref>.</p><p>Empirical Risk Minimization via Mini-batch SGD. Our objective is to minimize an objective function F (x), the empirical risk function, with respect to model parameters denoted by x &#8712; R d . The training dataset is denoted by S = {s 1 , . . . , s N }, where s i represents the i-th labeled data point. The objective function can be expressed as the empirical risk calculated using the training data and is given by</p><p>where f (x; s i ) is the composite loss function at the i th data point. In classic mini-batch stochastic gradient descent (SGD) <ref type="bibr">(Dekel et al., 2012)</ref>, updates to the parameter vector x are performed as follows. If &#958; k &#8834; S represents a randomly sampled mini-batch, then the update rule is</p><p>where &#951; denotes the learning rate and the stochastic gradient is defined as: g(x; &#958;) = 1 |&#958;| si&#8712;&#958; &#8711;f (x; s i ). For simplicity, we will use g(x k ) instead of g(x k ; &#958; k ) in the rest of the paper. A complete review of convergence properties of serial SGD can be found in <ref type="bibr">(Bottou et al., 2018)</ref>.</p><p>Periodic-Averaging SGD (PASGD). We consider a distributed SGD framework with m worker nodes where all workers can communicate with others via a central server or via direct inter-worker communication. In periodicaveraging SGD, all workers start at the same initial point x 1 . Each worker performs &#964; local mini-batch SGD updates according to (2), and the local models are averaged by a fusion node or by performing an all-node broadcast. The workers then update their local models with the averaged model, as illustrated in Figure <ref type="figure">2</ref>. Thus, the overall update rule at the i th worker is given by</p><p>where</p><p>k denote the model parameters in the i-th worker after k iterations and &#964; is defined as the communication period. Note that the iteration index k corresponds to the local iterations, and not the number of averaging steps.</p><p>Special Case (&#964; = 1): Fully Synchronous SGD. When &#964; = 1, that is, the local models are synchronized after every iteration, periodic-averaging SGD is equivalent to fully synchronous SGD which has the update rule</p><p>= 3 local steps at each worker The analysis of fully synchronous SGD is identical to serial SGD with m-fold large mini-batch size.</p><p>Local Computation Times and Communication Delay.</p><p>In order to analyze the effect of &#964; on the expected runtime per iteration, we consider the following delay model. The time taken by the i th worker to compute a mini-batch gradient at the k th local-step is modeled a random variable Y i,k &#8764; F Y , assumed to be i.i.d. across workers and minibatches. The communication delay is a random variable D for each all-node broadcast, as illustrated in Figure <ref type="figure">3</ref>. The value of random variable D can depend on the number of workers as follows.</p><p>where D 0 represents the time taken for each inter-node communication, and s(m) describes how the delay scales with the number of workers, which depends on the implementation and system characteristics. For example, in the parameter server framework, the communication delay can be proportional to 2 log 2 (m) by exploiting a reduction tree structure <ref type="bibr">(Iandola et al., 2016)</ref>. We assume that s(m) is known beforehand for the communication-efficient distributed SGD framework under consideration.</p><p>Convergence Criteria. In the error-convergence analysis, since the objective function is non-convex, we use the expected gradient norm as a an indicator of convergence following <ref type="bibr">(Ghadimi &amp; Lan, 2013;</ref><ref type="bibr">Bottou et al., 2018)</ref>. We say the algorithm achieves an -suboptimal solution if:</p><p>When is arbitrarily small, this condition can guarantee the algorithm converges to a stationary point.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">JOINTLY ANALYZING RUNTIME AND ERROR-CONVERGENCE</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Runtime Analysis</head><p>We now present a comparison of the runtime per iteration of periodic-averaging SGD with fully synchronous SGD to illustrate how increasing &#964; can lead to a large runtime speedup. Another interesting effect of performing more local update &#964; is that it mitigates the slowdown due to straggling worker nodes.</p><p>Runtime Per Iteration of Fully Synchronous SGD. Fully synchronous SGD is equivalent to periodic-averaging SGD with &#964; = 1. Each of the m workers computes the gradient of one mini-batch and updates the parameter vector x, which takes time Y i,1 at the i th worker 1 . After all workers finish their local updates, an all-node broadcast is performed to synchronize and average the models. Thus, the total time to complete each iteration is given by</p><p>where Y i,1 are i.i.d. random variables with probability distribution F Y and D is the communication delay. The term Y m:m denotes the highest order statistic of m i.i.d. random variables <ref type="bibr">(David &amp; Nagaraja, 2003)</ref>.</p><p>Runtime Per Iteration of Periodic-Averaging SGD (PASGD). In periodic-averaging SGD, each worker performs &#964; local updates before communicating with other workers. Let us denote the average local computation time at the i th worker by</p><p>Since the communication delay D is amortized over &#964; iterations, the average computation time per iteration is</p><p>The value of the first term Y m:m and how it compares with Y m:m depends on the probability distribution F Y of Y .</p><p>1 Instead of local updates, typical implementations of fully synchronous SGD have a central server that performs the update.</p><p>Here we compare PASGD with fully synchronous SGD without a central parameter server.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Runtime Benefits of Periodic Averaging Strategy</head><p>Speed-up over fully synchronous SGD. We evaluate the speed-up of periodic-averaging SGD over fully synchronous SGD for different Y and D to demonstrate how the relative value of computation versus communication delays affects the speed-up. Consider the simplest case where Y and D are constants and define &#945; = D/Y , the communication/computation ratio. Besides systems aspects such as network bandwidth and computing capacity, for deep neural network training, this ratio &#945; also depends on the size of the neural network model and the mini-batch size. See Figure <ref type="figure">8</ref> for a comparison of the communication/computation delays of common deep neural network architectures. Then Y , Y m:m , Y m:m are all equal to Y , and the ratio of E[T sync ] and E[T P-Avg ] is given by</p><p>Figure <ref type="figure">4</ref> shows the speed-up for different values of &#945; and &#964; . When D is comparable with Y (&#945; = 0.9), periodicaveraging SGD (PASGD) can be almost twice as fast as fully synchronous SGD. Straggler Mitigation due to Local Updates. Suppose that Y is exponentially distributed with mean y and variance y 2 . For fully synchronous SGD, the term E[Y m:m ] in ( <ref type="formula">8</ref>) is equal to y m i=1 1/i, which is approximately equal to y log m. Thus, the expected runtime per iteration of fully synchronous SGD (8) increases logarithmically with the number of workers m. Let us compare this with the scaling of the runtime of periodic-averaging SGD (11). Here, Y (9) is an Erlang random variable with mean y and variable y 2 /&#964; . Since the variance is &#964; times smaller than that of Y , the maximum order statistic <ref type="figure">5</ref> shows the probability distribution of T sync and T P-Avg for exponentially distributed Y . Observe that T P-Avg has a much lighter tail. This is because the effect of the variability in Y on T P-Avg is reduced due to the Y in (8) being replaced by Y (which has lower variance) in (11). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Joint Analysis with Error-convergence</head><p>In this subsection, we combine the runtime analysis with previous error-convergence analysis for PASGD <ref type="bibr">(Wang &amp; Joshi, 2018)</ref>. Due to space limitations, we state the necessary theoretical assumptions in the Appendix; the assumptions are similar to previous works <ref type="bibr">(Zhou &amp; Cong, 2017;</ref><ref type="bibr">Wang &amp; Joshi, 2018)</ref> on the convergence of local-update SGD algorithms.</p><p>Theorem 1 (Error-runtime Convergence of PASGD).</p><p>For PASGD, under certain assumptions (stated in the Appendix), if the learning rate satisfies &#951;L+&#951; 2 L 2 &#964; (&#964; -1) &#8804; 1, Y and D are constants, and all workers are initialized at the same point x 1 , then after total T wall-clock time, the minimal expected squared gradient norm within T time interval will be bounded by:</p><p>where L is the Lipschitz constant of the objective function and &#963; 2 is the variance bound of mini-batch stochastic gradients.</p><p>The proof of Theorem 1 is presented in the Appendix. From the optimization error upper bound (13), one can easily observe the error-runtime trade-off for different communication periods. While a larger &#964; reduces the runtime per iteration and let the first term in ( <ref type="formula">13</ref>) become smaller, it also adds additional noise and increases the last term. In Figure <ref type="figure">6</ref>, </p><p>we plot theoretical bounds for both fully synchronous SGD (&#964; = 1) and PASGD. It is shown that although PASGD with &#964; = 10 starts with a rapid drop, it will eventually converge to a high error floor. This theoretical result is also corroborated by experiments in Section 5. Another direct outcome of Theorem 1 is the determination of the best communication period that balances the first and last terms in (13). We will discuss the selection of communication period later in Section 4.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">ADACOMM: PROPOSED ADAPTIVE COMMUNICATION STRATEGY</head><p>Inspired by the clear trade-off in the learning curve in Figure <ref type="figure">6</ref>, it would be better to have an adaptive communication strategy that starts with infrequent communication to improve convergence speed, and then increases the frequency to achieve a low error floor. In this section, we are going to develop the proposed adaptive communication scheme.</p><p>The basic idea to adapt the communication is to choose the communication period that minimizes the optimization error at each wall-clock time. One way to achieve the idea is switching between the learning curves at their intersections. However, without prior knowledge of various curves, it would be difficult to determine the switch points.</p><p>Instead, we divide the whole training procedure into uniform wall-clock time intervals with the same length T 0 . At the beginning of each time interval, we select the best value of &#964; that has the fastest decay rate in the next T 0 wall-clock time.</p><p>If the interval length T 0 is small enough and the best choice of communication period for each interval can be precisely estimated, then this adaptive scheme should achieve a winwin in the error-runtime trade-off as illustrated in Figure <ref type="figure">7</ref>.</p><p>After setting the interval length, the next question is how  Training loss</p><p>Choose the best &#964; for each time interval. to estimate the best communication period for each time interval. In Section 4.1 we use the error-runtime analysis in Section 3.3 to find the best &#964; at each time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Determining the Best Communication Period for Each Time Interval</head><p>From Theorem 1, it can be observed that there is an optimal value &#964; * that minimizes the optimization error bound at given wall-clock time. In particular, consider the simplest setting where Y and D are constants. Then, by minimizing the upper bound ( <ref type="formula">13</ref>) over &#964; , we obtain the following.</p><p>Theorem 2. For PASGD, under the same assumptions as Theorem 1, the optimization error upper bound in (13) at time T is minimized when the communication period is</p><p>The proof is straightforward by setting the derivative of (13) to zero. We present the details in the Appendix. Suppose all workers starts from the same initial point x 1 = x t=0 where subscript t denotes the wall-clock time. Directly applying Theorem 2 to the first time interval, then the best choice of communication period is:</p><p>Similarly, for the l-th time interval, workers can be viewed as restarting training at a new initial point x t=lT0 . Applying Theorem 2 again, we have</p><p>Comparing ( <ref type="formula">15</ref>) and ( <ref type="formula">16</ref>), it is easy to see the generated communication period sequence decreases along with the objective value F (x t ) when the learning rate is fixed. This result is consistent with the intuition that the trade-off between error-convergence and communication-efficiency varies over time. Compared to the initial phase of training, the benefit of using a large communication period diminishes as the model reaches close to convergence. At this later stage, a lower error floor is more preferable to speeding up the runtime.</p><p>Remark 1 (Connection to Decaying Learning Rate). Using a fixed learning rate in SGD leads to an error floor at convergence. To further reduce the error, practical SGD implementations generally decay the learning rate or increase the mini-batch size <ref type="bibr">(Smith et al., 2017a;</ref><ref type="bibr">Goyal et al., 2017)</ref>.</p><p>As we saw from the convergence analysis Theorem 1, performing local updates adds additional noise in stochastic gradients, resulting in a higher error floor convergence. Decaying the communication period can gradually reduce the variance of gradients and yield a similar improvement in convergence. Thus, adaptive communication strategies are similar in spirit to decaying learning rate or increasing minibatch size. The key difference is that here we are optimizing the true error convergence with respect to wall-clock time rather than the number iterations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Practical Considerations</head><p>Although ( <ref type="formula">15</ref>) and ( <ref type="formula">16</ref>) provide useful insights about how to adapt &#964; over time, it is still difficult to directly use them in practice due to the Lipschitz constant L and the gradient variance bound &#963; 2 being unknown. For deep neural networks, estimating these constants can be difficult and unreliable due to the highly non-convex and high-dimensional loss surface. As an alternative, we propose a simpler rule where we approximate F inf by 0, and divide ( <ref type="formula">16</ref>) by ( <ref type="formula">15</ref>) to obtain the basic communication period update rule:</p><p>where a is the ceil function to round a to the nearest integer &#8805; a. Since the objective function values (i.e., training loss) F (x t=lT0 ) and F (x t=0 ) can be easily obtained in the training, the only remaining thing now is to determine the initial communication period &#964; 0 . We obtain a heuristic estimate of &#964; 0 by a simple grid search over different &#964; run for one or two epochs each.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Refinements to the Proposed Adaptive Strategy</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1">Faster Decay When Training Saturates</head><p>The communication period update rule (17) tends to give a decreasing sequence {&#964; l }. Nonetheless, it is possible that the best value of &#964; l for next time interval is larger than the current one due to random noise in the training process. Besides, when the training loss gets stuck on plateaus and decreases very slowly, (17) will result in &#964; l saturating at the same value for a long time. To address this issue, we borrow a idea used in classic SGD where the learning rate is decayed by a factor &#947; when the training loss saturates for several epochs <ref type="bibr">(Goyal et al., 2017)</ref>. Similarly, in the our scheme, the communication period will be multiplied by &#947; &lt; 1 when the &#964; l given by ( <ref type="formula">17</ref>) is not strictly less than &#964; l-1 . To be specific, the communication period for the l th time interval will be determined as follows:</p><p>In the experiments, &#947; = 1/2 turns out to be a good choice.</p><p>One can obtain a more aggressive decay in &#964; l by either reducing the value of &#947; or introducing a slack variable s in the condition, such as</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2">Incorporating Adaptive Learning Rate</head><p>So far we consider a fixed learning rate &#951; for the local SGD updates at the workers. We now present an adaptive communication strategy that adjusts &#964; l for a given variable learning rate schedule, in order to obtain the best errorruntime trade-off. Suppose &#951; l denotes the learning rate for the l th time interval. Then, combining (15) and ( <ref type="formula">16</ref>) again, we have</p><p>Observe that when the learning rate becomes smaller, the communication period &#964; l increases. This result corresponds the intuition that a small learning rate reduces the discrepancy between the local models, and hence is more tolerant to large communication periods.</p><p>Equation ( <ref type="formula">19</ref>) states that the communication period should be proportional to (&#951; 0 /&#951; l ) 3/<ref type="foot">foot_0</ref> . However, in practice, it is common to decay the learning rate 10 times after some given number of epochs. The dramatic change of learning rate may push the communication period to an unreasonably large value. In the experiments, we observe that when applying (19), the communication period can increase to &#964; = 1000 which causes the training loss to diverge.</p><p>To avoid this issue, we propose the adaptive strategy given by ( <ref type="formula">20</ref>) below. This strategy can also be justified by theoretical analysis. Suppose that in l th time interval, the objective function has a local Lipschitz smoothness L l . Then, by using the approximation &#951; l L l &#8776; 1, which is common in SGD literature <ref type="bibr">(Balles et al., 2016)</ref>, we derive the following adaptive strategy:</p><p>Apart from coupling the communication period with learning rate, when to decay the learning rate is another key design factor. In order to eliminate the noise introduced by local updates, we choose to first gradually decay the communication period to 1 and then decay the learning rate as usual. For example, if the learning rate is scheduled to be decayed at the 80 th epoch but at that time the communication period &#964; is still larger than 1, then we will continue use the current learning rate until &#964; = 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Theoretical Guarantees for the Convergence of ADACOMM</head><p>In this subsection, we are going to provide a convergence guarantee for the proposed adaptive communication scheme by extending the error analysis for PASGD. Without loss of generality, we will analyze an arbitrary communication period sequence {&#964; 0 , . . . , &#964; R }, where R represents the total communication rounds 2 . It will be shown that a decreasing sequence of &#964; is beneficial to the error-convergence rate.</p><p>Theorem 3 (Convergence of adaptive communication scheme). For PASGD with adaptive communication period and adaptive learning rate, suppose the learning rate remains same in each local update period. If the following conditions are satisfied as R &#8594; &#8734;,</p><p>then the averaged model x is guaranteed to converge to a stationary point:</p><p>where s r = r-1 j=0 &#964; j .</p><p>The proof details and a non-asymptotic result (similar to Theorem 1 but with variable &#964; ) are provided in Appendix. In order to understand the meaning of condition ( <ref type="formula">21</ref>), let us first consider the case when &#964; 0 = &#8226; &#8226; &#8226; = &#964; R is a constant. In this case, the convergence condition is identical to mini-batch SGD <ref type="bibr">(Bottou et al., 2018)</ref>:</p><p>As long as the communication period sequence is bounded, it is trivial to adapt the learning rate scheme in mini-batch SGD (23) to satisfy (21). In particular, when the communication period sequence is decreasing, the last two terms in (21) will become easier to be satisfied and put less constraints on the learning rate sequence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EXPERIMENTAL RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Experimental Setting</head><p>Platform. Dataset. We evaluate our method for image classification tasks on CIFAR10 and CIFAR100 dataset <ref type="bibr">(Krizhevsky, 2009)</ref>, which consists of 50,000 training images and 10,000 validation images in 10 and 100 classes respectively. Each worker machine is assigned with a partition which will be randomly shuffled after every epoch.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model.</head><p>We choose to train deep neural networks VGG-16 <ref type="bibr">(Simonyan &amp; Zisserman, 2014)</ref> and ResNet-50 <ref type="bibr">(He et al., 2016)</ref> from scratch 3 . These two neural networks have different architectures and parameter sizes, thus resulting in different performance of periodic-averaging. As shown in Figure <ref type="figure">8</ref>, for VGG-16, the communication time is about 4 times higher than the computation time. Thus, compared to ResNet-50, it requires a larger &#964; in order to reduce the runtime-per-iteration and achieve fast convergence. Similar high communication/computation ratio is common in literature, see <ref type="bibr">(Lin et al., 2018;</ref><ref type="bibr">Harlap et al., 2018)</ref>.</p><p>Hyperparameter Choice. Mini-batch size on each worker is 128. Therefore, the total mini-batch size per iteration is 512. The initial learning rates for VGG-16 and ResNet-50 are 0.2 and 0.4 respectively. The weight decay for both networks is 0.0005. In the variable learning rate setting, we decay the learning rate by 10 after 80 th /120 th /160 th /200 th epochs. We set the time interval length T 0 as 60 seconds (about 10 epochs for the initial communication period).</p><p>Metrics. We compare the performance of proposed adaptive 3 The implementations of VGG-16 and ResNet-50 follow this GitHub repository: <ref type="url">https://github.com/meliketoy/ wide-resnet.pytorch</ref>  communication scheme with following methods with a fixed communication period: (1) Baseline: fully synchronous SGD (&#964; = 1); (2) Extreme high throughput case where &#964; = 100; (3) Manually tuned case where a moderate value of &#964; is selected after trial runs with different communication periods. Instead of training for a fixed number of epochs, we train all methods for sufficiently long time to convergence and compare the training loss and test accuracy, both of which are recorded after every 100 iterations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Adaptive Communication in PASGD</head><p>We first validate the effectiveness of ADACOMM which uses the communication period update rule (18) combined with (20) on original PASGD without momentum.</p><p>Figure <ref type="figure">9</ref> presents the results for VGG-16 for both fixed and variable learning rates. A large communication period &#964; initially results in a rapid drop in the error, but the error finally converges to higher floor. By adapting &#964; , the proposed ADACOMM scheme strikes the best error-runtime trade-off in all settings. In Figure <ref type="figure">9a</ref>, while fully synchronous SGD takes 33.5 minutes to reach 4 &#215; 10 -3 training loss, ADA-COMM costs 15.5 minutes achieving more than 2&#215; speedup. Similarly, in Figure <ref type="figure">9b</ref>, ADACOMM takes 11.5 minutes to reach 4.5 &#215; 10 -2 training loss achieving 3.3&#215; speedup over fully synchronous SGD (38.0 minutes).</p><p>However, for ResNet-50, the communication overhead is no longer the bottleneck. For fixed communication period, the negative effect of performing local updates becomes more obvious and cancels the benefit of low communication delay (see Figures <ref type="figure">10b</ref> and<ref type="figure">10c</ref>). It is not surprising to see fully synchronous SGD is nearly the best one in the error-runtime plot among all fixed-&#964; methods. Even in this extreme case, adaptive communication can still have a competitive performance. When combined with learning rate decay, the adaptive scheme is about 1.3 times faster than fully synchronous SGD (see Figure <ref type="figure">10a</ref>, 15.0 versus 21.5 minutes to achieve 3 &#215; 10 -2 training loss).</p><p>Table <ref type="table">1</ref> lists the test accuracies in different settings; we report the best accuracy within a time budget for each setting. The results show that adaptive communication method have better generalization than fully synchronous SGD. In the variable learning rate case, the adaptive method even gives the better test accuracy than PASGD with the best fixed &#964; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Adaptive Communication in Momentum SGD</head><p>The adaptive communication scheme is proposed based on the joint error-runtime analysis for PASGD without momentum. However, it can also be extended to other SGD variants, and in this subsection, we show that the proposed method works well for SGD with momentum.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.1">Block Momentum in periodic-averaging</head><p>Before presenting the empirical results, we describe how to introduce momentum in PASGD. A naive way is to apply the momentum independently to each local model, where each worker maintains an independent momentum buffer, which is the latest change in the parameter vector x. However, this does not account for the potential dramatic change in x at each averaging step. When local models are synchronized, the local momentum buffer will contain the update steps before averaging, resulting in a large momentum term in the first SGD step of the each local update period. When &#964; is large, this large momentum term can side-track the SGD descent direction resulting in slower convergence.</p><p>To address this issue, a block momentum scheme was proposed in <ref type="bibr">(Chen &amp; Huo, 2016)</ref> and applied to speech recognition tasks. The basic idea is to treat the local updates in each communication period as one big gradient step between two synchronized models, and to introduce a global momentum for this big accumulated step. The update rule can be written as follows in terms of the momentum u j :</p><p>where</p><p>j&#964; +k ) represents the accumulated gradients in the j th local update period and &#946; glob denotes the global momentum factor. Moreover, workers can also conduct momentum SGD on local models, but their local momentum buffer will be cleared at the beginning of each local update period. That is, we restart momentum SGD on local models after every averaging step. The same strategy was also suggested in Microsoft's CNTK framework <ref type="bibr">(Seide &amp; Agarwal, 2016)</ref>. In our experiments, we set the global momentum factor as 0.3 and local momentum factor as 0.9 following <ref type="bibr">(Lin et al., 2018)</ref>. In the fully synchronous case, there is no need to introduce the block momentum and we simply follow the common practice setting the momentum factor as 0.9.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.2">ADACOMM plus Block Momentum</head><p>In Figure <ref type="figure">11</ref>, we apply our adaptive communication strategy in PASGD with block momentum and observe significant performance gain on CIFAR10/100. In particular, the adaptive communication scheme has the fastest convergence rate with respect to wall-clock time in the whole training process. While fully synchronous SGD gets stuck with a plateau before the first learning rate decay, the training loss of adaptive method continuously decreases until converging.</p><p>For VGG-16 in Figure <ref type="figure">11b</ref>, ADACOMM is 3.5&#215; faster (in terms of wall-clock time) than fully synchronous SGD in reaching a 3 &#215; 10 -3 training loss. For ResNet-50 in Figure <ref type="figure">11a</ref>, ADACOMM takes 15.8 minutes to get 2 &#215; 10 -2 training loss which is 2 times faster than fully synchronous SGD (32.6 minutes).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">CONCLUDING REMARKS</head><p>The design of communication-efficient SGD algorithms that are robust to system variability is vital to scaling machine learning training to resource-limited computing nodes. This paper is one of the first to analyze the convergence of error with respect to wall-clock time instead of number of iterations by accounting for the effect of computation and communication delays on the runtime per iteration. We present a theoretical analysis of the error-runtime trade-off for periodic-averaging SGD (PASGD), where each node performs local updates and their models are averaged after every &#964; iterations. Based on the joint error-runtime analysis, we design the first (to the best of our knowledge) adaptive communication strategy called ADACOMM for distributed deep learning. Experimental results using VGGNet and ResNet show that the proposed method can achieve up to a 3&#215; improvement in runtime, while achieving the same error floor as fully synchronous SGD.</p><p>Going beyond periodic-averaging SGD, our idea of adapting         frequency of averaging distributed SGD updates can be easily extended to other SGD frameworks including elasticaveraging <ref type="bibr">(Zhang et al., 2015)</ref>, decentralized SGD (e.g., adapting network sparsity) <ref type="bibr">(Lian et al., 2017)</ref> and parameter server-based training (e.g., adapting asynchrony).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A ADDITIONAL EXPERIMENTAL RESULTS</head><p>In the 8 worker case, the communication among nodes is accomplished via Nvidia Collective Communication Library (NCCL). The mini-batch size on each node is 64. The initial learning rate is set as 0.2 for both VGG-16 and ResNet-50. In Figure <ref type="figure">12a</ref>, while fully synchronous SGD takes 17.5 minutes to reach 10 -2 training loss, ADACOMM only costs 6.0 minutes achieving about 2.9&#215; speedup.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B INEFFICIENT LOCAL UPDATES</head><p>It is worth noting there is an interesting phenomenon about the convergence of periodic averaging SGD (PASGD). When the learning rate is fixed, PASGD with fine-tuned communication period has better test accuracy than both fully synchronous SGD and the adaptive method, while its training loss remains higher than the latter two methods (see Figure <ref type="figure">9</ref>, Figure <ref type="figure">10</ref>).</p><p>In particular, on CIFAR100 dataset, we observe about 5% improvement in test accuracy when &#964; = 5. To investigate this phenomenon, we evaluate the test accuracy for PASGD (&#964; = 15) in two frequencies: 1) every 135 iterations; 2) every 100 iterations. In the former case, the test accuracy is reported just after the averaging step. However, in the latter case, the test accuracy can come from either the synchronized/averaged model or local models, since 100 cannot be divided by 15.</p><p>From Figure <ref type="figure">14</ref>, it is clear that local model's accuracy is much lower than the synchronized model, even when the algorithm has converged. Thus, we conjecture that the improvement of test accuracy only happens on the synchronized model. That is, after averaging, the test accuracy will undergo a rapid increase but it decreases again in the following local steps due to noise in stochastic gradients. Such behavior may depend on the geometric structure of the loss surface of specific neural networks.</p><p>The observation also reveals that the local updates are inefficient as they reduces the accuracy and makes no progress. In this sense, it is necessary for PASGD to reduce the gradient variance by either decaying learning rate or decaying communication period. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C ASSUMPTIONS FOR CONVERGENCE ANALYSIS</head><p>The convergence analysis is conducted under the following assumptions, which are similar to the assumptions made in previous work on the analysis of PASGD <ref type="bibr">(Zhou &amp; Cong, 2017;</ref><ref type="bibr">Yu et al., 2018;</ref><ref type="bibr">Wang &amp; Joshi, 2018;</ref><ref type="bibr">Stich, 2018)</ref>.</p><p>In particular, we make no assumptions convexity of the objective function. We also remove the uniform bound assumption for the norm of stochastic gradients. Assumption 1 (Lipschitz smooth &amp; lower bound on F ). The objective function F (x) is differentiable and L-Lipschitz smooth, i.e., &#8711;F (x) -&#8711;F (y) &#8804; L xy . The function value is bounded below by a scalar F inf .</p><p>Assumption 2 (Unbiased estimation). The stochastic gradient evaluated on a mini-batch &#958; is an unbiased estimator of the full batch gradient E &#958;|x [g(x)] = &#8711;F (x). Assumption 3 (Bounded variance). The variance of stochastic gradient evaluated on a mini-batch &#958; is bounded as</p><p>where &#946; and &#963; 2 are non-negative constants and in inverse proportion to the mini-batch size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D PROOF OF THEOREM 2: ERROR-RUNTIME CONVERGENCE OF PASGD</head><p>Firstly, let us recall the error-analysis of PASGD. We adapt the theorem from <ref type="bibr">(Wang &amp; Joshi, 2018)</ref>.</p><p>Lemma 1 (Error-Convergence of PASGD <ref type="bibr">(Wang &amp; Joshi, 2018)</ref>). For PASGD, under Assumptions 1 to 3, if the learning rate satisfies &#951;L + &#951; 2 L 2 &#964; (&#964; -1) &#8804; 1 and all workers are initialized at the same point x 1 , then after K iterations, we have</p><p>where L is the Lipschtiz constant of the objective function, &#963; 2 is the variance bound of mini-batch stochastic gradients and x k denotes the averaged model at the k th iteration.</p><p>From the runtime analysis in Section 2, we know that the expected runtime per iteration of PASGD is</p><p>Accordingly, the total wall-clock time of training K iteration is</p><p>Then, directly substituting K = T /E[T P-Avg ] in (26), we complete the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E PROOF OF THEOREM 3: THE BEST COMMUNICATION PERIOD</head><p>Taking the derivative of the upper bound ( <ref type="formula">14</ref>) with respect to the communication period, we obtain</p><p>When the derivative equals to zero, the communication period is</p><p>Since the second derivative of ( <ref type="formula">14</ref>) is 4</p><p>then the optimal value obtained in (30) must be a global minimum.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F PROOF OF THEOREM 4: ERROR-CONVERGENCE OF ADAPTIVE COMMUNICATION SCHEME F.1 Notations</head><p>In order to faciliate the analysis, we would like to first introduce some useful notations. Define matrices X k , G k &#8712; R d&#215;m that concatenate all local models and gradients:</p><p>Besides, define matrix J = 11 /(1 1) where 1 denotes the column vector [1, 1, . . . , 1] . Unless otherwise stated, 1 is a size m column vector, and the matrix J and identity matrix I are of size m &#215; m, where m is the number of workers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F.2 Proof</head><p>Let us first focus on the j-th local update period, where j &#8712; {0, 1, . . . , R}. Without loss of generality, suppose the local index of the j th local update period starts from 1 and ends with &#964; j . Then, for the k-th local step in the interested period, we have the following lemma.</p><p>Lemma 2 (Lemma 1 in <ref type="bibr">(Wang &amp; Joshi, 2018)</ref>). For PASGD, under Assumptions 1 to 3, at the k-th iteration, we have the following bound for the objective value:</p><p>where x k denotes the averaged model at the k th iteration.</p><p>Taking the total expectation and summing over all iterates in the j-th local update period, we can obtain</p><p>Next, we are going to provide an upper bound for the last term in (35). Note that</p><p>where (39) follows the fact that all workers start from the same point at the beginning of each local update period, i.e., X 1 (I -J) = 0. Accordingly, we have</p><p>where the inequality (41) is due to the operator norm of (I-J) is less than 1. Furthermore, using the fact (a+b) 2 &#8804; 2a 2 +2b 2 , one can get</p><p>For the first term T 1 , since the stochastic gradients are unbiased, all cross terms are zero. Thus, combining with Assumption 3, we have</p><p>For the second term in (43), directly applying Jensen's inequality, we get</p><p>Substituting the bounds of T 1 and T 2 into (43),</p><p>Recall the upper bound (35), we further derive the following bound: 2 F [2&#946;(&#964; j -1) + &#964; j (&#964; j -1)] + &#951; 2 j m&#963; 2 &#964; j (&#964; j -1).</p><p>(54)</p><p>Plugging ( <ref type="formula">54</ref>) into ( <ref type="formula">35</ref>),</p><p>Note that when the learning rate satisfies:</p><p>we have</p><p>Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD Suppose l j = j-1 r=0 &#964; r + 1 is the first index in the j-th local update period. Without loss of generality, we substitute the local index by global index:</p><p>E &#8711;F (x lj +k-1 ) 2 + &#951; 2 j L&#963; 2 &#964; j 2m + &#951; 3 j L 2 &#963; 2 &#964; j (&#964; j -1) 2 .</p><p>(59)</p><p>Summing over all local periods from j = 0 to j = R, one can obtain</p><p>&#951; 3 j &#964; j (&#964; j -1).</p><p>(60)</p><p>After minor rearranging, it is easy to see In order to let the upper bound (62) converges to zero as R &#8594; &#8734;, a sufficient condition is</p><p>Here, we complete the proof of Theorem 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F.4 Simplified Result</head><p>We can obtain a simplified result when the learning rate is fixed. To be specific, we have</p><p>If we choose the total iterations K = R j=0 &#964; j , then</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>Note that in the error analysis, the subscripts of communication period and learning rate represent the index of local update periods rather than the index of the T0-length wall-clock time intervals as considered in Sections 4.1-4.3.</p></note>
		</body>
		</text>
</TEI>
