<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Communication Efficient Asynchronous Stochastic Gradient Descent</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>05/19/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10651733</idno>
					<idno type="doi">10.1109/INFOCOM55648.2025.11044686</idno>
					
					<author>Youssef Ahmed</author><author>Arnob Ghosh</author><author>Chih-Chun Wang</author><author>Ness B Shroff</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In this paper, we address the challenges of asynchronous gradient descent in distributed learning environments, particularly focusing on addressing the challenges of stale gradients and the need for extensive communication resources. We develop a novel communication efficient framework that incorporates a gradient evaluation algorithm to assess and utilize delayed gradients based on their quality, ensuring efficient and effective model updates while significantly reducing communication overhead. Our proposed algorithm requires agents to only send the norm of the gradients rather than the computed gradient. The server then decides whether to accept the gradient if the ratio between the norm of the gradient and the distance between the global model parameter and the local model parameter exceeds a certain threshold. With the proper choice of the threshold, we show that the convergence rate achieves the same order as the synchronous stochastic gradient without depending on the staleness value unlike most of the existing works. Given the computational complexity of the initial algorithm, we introduce a simplified variant that prioritizes the practical applicability without compromising on the convergence rates. Our simulations demonstrate that our proposed algorithms outperform existing state-of-the-art methods, offering improved convergence rates, stability, accuracy, and resource consumption.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Distributed Stochastic Gradient Descent (SGD) serves as the cornerstone for distributed optimization of machine learning models, with its synchronous and asynchronous variants offering different trade-offs in terms of computational efficiency and convergence properties. For example, synchronous SGD requires that all the workers (i.e., the distributed machines) compute gradients on the same model parameters. Naturally, this requires a large amount of communication overhead for synchronization especially for a large number of distributed machines. On the other hand, Asynchronous SGD (ASGD) relaxes the above limitation and allows that workers may compute gradients on different model parameters, thereby offering greater scalability and resource utilization. This makes ASGD suitable for large-scale distributed systems and heterogeneous computing environments where different workers may have varying speeds and capabilities. However, this advantage comes at the cost of introducing stale gradients-gradient updates computed using outdated versions of the model parameters. These stale gradients occur because some workers may use older parameters to calculate their gradients <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>. The stale gradient problem is a significant challenge in ASGD, as it degrades convergence properties and can lead to resource inefficiencies due to instability and increased computation requirements <ref type="bibr">[1]</ref>, <ref type="bibr">[3]</ref>- <ref type="bibr">[8]</ref>.</p><p>The stale gradients indeed affect the theoretical analysis of convergence rates of all the proposed ASGD algorithms <ref type="bibr">[4]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[9]</ref>- <ref type="bibr">[16]</ref>. In particular, all the above papers assume bounded staleness assumptions and the convergence rate depends on the staleness parameters which can be large in practice. Moreover, the advent of federated learning has further complicated the stale gradient issue, introducing additional layers of complexity such as device heterogeneity and Non-IID data distributions across clients. Federated learning's promise of collaborative model training without compromising data privacy necessitates innovative solutions to the staleness challenge, ensuring that the global model converges effectively despite asynchronous updates from a diverse set of clients <ref type="bibr">[17]</ref>, <ref type="bibr">[18]</ref>.</p><p>In addition to the convergence issue, the staleness poses another challenge, which is the potential for resource inflation due to instability <ref type="bibr">[3]</ref>. While asynchronous SGD is very efficient at utilizing computational resources by allowing parallel updates, this efficiency can be compromised due to increased demands placed on the system when instability arises. Specifically, as the objective value fluctuates due to stale gradients, it may initially decrease but then increase again, necessitating additional iterations to regain stability. In other words, using delayed gradients to update the model can sometimes deteriorate the objective value (i.e., increase the learning loss), requiring additional resources (e.g., computation power and communication rounds) to compensate for updating the parameters along the wrong direction of the gradient.</p><p>While the presence of delayed gradients in asynchronous SGD has traditionally been viewed as a hurdle for stable and efficient learning, recent insights suggest a more nuanced perspective. Specifically, under certain conditions, these delayed updates can introduce a form of momentum into the learning process, akin to the momentum term used in optimization algorithms to accelerate convergence <ref type="bibr">[19]</ref>. This phenomenon, explored in depth in works such as <ref type="bibr">[9]</ref>, hinges on the distribution of staleness across updates. When delays follow a specific statistical pattern, such as a geometric distribution, the cumulative effect of asynchronously applied gradients can mimic the behavior of explicit momentum, thereby enhancing the algorithm's performance. This counter-intuitive finding reveals that not all stale gradients are detrimental; indeed, empirical results suggest that under the right circumstances, they may be utilized to accelerate the learning process. Recently, <ref type="bibr">[20]</ref> proposed an algorithm that scales down the learning rate as the staleness increases. In particular, as the staleness value increases, such a value contribute very minutely towards the parameter update. <ref type="bibr">[20]</ref> showed that such an approach achieves a convergence rate devoid of any staleness parameter. However, the above algorithm requires communication of all the computed gradients between the worker and the server which can overcrowd the underlying bandwidth even when the contribution of the gradients might be small as they have high stale value. More importantly, perhaps, a gradient with a higher staleness value can contribute towards convergence more compared to a gradient with a lower staleness value. However, <ref type="bibr">[20]</ref> puts a lower weight on the gradients with a higher stale value. Hence, the empirical results indicate that the convergence performance is not comparable to the synchronous SGD. This raises the question: Can we develop an algorithm where stale gradients can be utilized effectively to improve performance in ASGD? In particular, can we develop a resource efficient ASGD algorithm with provable convergence rates that is not hindered by the staleness?</p><p>In this paper, we address the question by developing a novel ASGD algorithm. The key component of our algorithm is that the server only accepts the stale-gradients when the ratio of the norm of the stale-gradient, and the distance between the current model parameter and the outdated model parameter is above a certain threshold which we denote as 'high quality gradient'. Further, the worker does not need to send all the computed gradients, rather, sending the norm of the gradient is enough which is far less resource intensive. We show that by carefully designing the threshold, we can achieve the same order of convergence as that of the synchronous SGD. Since we only accept those gradients that contribute towards the convergence, our approach requires smaller communication resource. Here are our contributions in details: 1) We introduce an innovative quality metric that assesses the usefulness of delayed gradients. This metric determines whether to use a gradient for model updates based on its norm (rather than the entire gradient), and the amount of change between the delayed parameter and the recent parameter (rather than the staleness). Namely, instead of using the time stamps to decide whether to "accept" a recently computed gradient or not, we directly examine the "quality" of the computed gradient instead. Our approach may accept a gradient even when its time stamp is older if it is of 'high quality'. Further, if we are accepting a gradient we are not down-scaling the learning rate as existing approaches fully utilizing their values. By directly examining the quality of the update, not just the timeliness, this approach reduces the communication overhead, making the process more communication-efficient and staleness independent.</p><p>2) Based on the quality metric, we have developed a novel variation of the ASGD algorithm called Quality-Aware Asynchronous Stochastic Gradient Descent (QASY). This algorithm enhances the communication efficiency of ASGD by selectively utilizing useful delayed gradients for model updates. Additionally, QASY assesses the usefulness of gradients stored from previous iterations and incorporates only the beneficial ones in the model update. 3) We provide a detailed convergence analysis for QASY,</p><p>showing that the convergence rate and learning rate are independent of the staleness value and similar to the convergence rate of the synchronous SGD. In particular, we show that one can achieve convergence rate of O(1/ &#8730; T ) (where T is the number of iterations) in asynchronous SGD devoid of any staleness parameter. Hence, we bridge the gap between the the synchronous SGD and the asynchronous SGD in terms of convergence rate using smaller communication resources. Also, we propose simplified version of QASY, Simplified Quality-Aware Asynchronous Stochastic Gradient Descent (s-QASY), that has the same performance guarantees. It streamlines the process by focusing solely on the utilization of delayed gradients. This variant reduces the computational overhead further while retaining the core benefits of our gradient management strategy. 4) Through simulations, we validate the effectiveness of our proposed algorithms, comparing them with state-of-theart methods in asynchronous gradient descent. Our results confirm the superiority of our approaches in terms of convergence speed, communication rounds, and learning accuracy, marking a significant advancement over existing methodologies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. literature review</head><p>The nature of the staleness has been one of the core points to drive the convergence guarantees of ASGD. In the literature, several studies analyzed the ASGD, considering different structure for the staleness. In <ref type="bibr">[8]</ref>, and <ref type="bibr">[21]</ref> the authors assumed the staleness to be fixed among all workers, while in <ref type="bibr">[10]</ref>, <ref type="bibr">[13]</ref>, <ref type="bibr">[22]</ref>, the staleness is assumed to be changing but upper bounded. In <ref type="bibr">[23]</ref> the delay is assumed to be random with bounded expectation and in <ref type="bibr">[24]</ref> the delay is assumed to be growing Polynomially. The following approaches have been proposed to deal with the delayed gradients:</p><p>&#8226; Adaptive Learning Rate Methods: <ref type="bibr">[8]</ref>, <ref type="bibr">[17]</ref>, <ref type="bibr">[18]</ref>, <ref type="bibr">[25]</ref>, <ref type="bibr">[26]</ref> dynamically adjust the learning rate based on gradient staleness (or statically <ref type="bibr">[3]</ref>, <ref type="bibr">[27]</ref> based on the staleness upper bound) aiming to counteract the negative effects of delayed gradients. These methods attempt to balance the step size in gradient descent to ensure stable and effective convergence. However, this approach can be resource-intensive and may result in excessive gradient discounting, potentially hindering the learning process. &#8226; Gradient Manipulation Techniques: In this line of research, the delayed gradients are processed further to mitigate the effect of the delay. For example, in <ref type="bibr">[7]</ref>, the authors aim to correct stale gradients by adjusting them based on the delay, often requiring the computation of additional parameters like gradient norms and even approximations of the Hessian matrix. While in <ref type="bibr">[28]</ref>, the authors handled the delayed gradients by adopting SGD's acceleration techniques, such as variance reduction, Stochastic Coordinate Sampling, and Nesterov's Acceleration techniques. Despite their novelty, these methods are often limited to scenarios where the objective function and its derivatives exhibit certain smoothness properties, which may not always be the case in practical applications.</p><p>Although these studies introduce a stable version of ASGD, the performance did not improve <ref type="bibr">[20]</ref>. Specifically, the convergence of most of these approaches is mainly governed by the staleness bound or average as shown in table I. Also, many of their experiments use few workers, which may not reflect the algorithms' scalability and efficiency in larger distributed systems <ref type="bibr">[3]</ref>, <ref type="bibr">[21]</ref>.</p><p>On the other hand, synchronous SGD allows all workers to contribute to the update of the global model simultaneously by synchronizing the gradient computations across all workers at each iteration. This synchronization step guarantees that the model parameters are updated using gradients computed from the same version of the model, maintaining uniformity in the learning process.</p><p>While synchronous SGD has been widely used in training large-scale deep neural networks, it leads to bottlenecks due to the slow workers(stragglers) <ref type="bibr">[29]</ref>. Also, it requires very high communication bandwidth between computational workers to exchange gradients and parameters between all workers iteratively. The communication overhead can become a limiting factor, as the ratio of communication time to training time per epoch increases linearly with the number of workers <ref type="bibr">[4]</ref>, leading to scalability issues beyond a certain number of nodes. This is particularly relevant in the context of emerging 6G networks, which necessitate highly efficient communication protocols to handle the increased demands of distributed learning applications <ref type="bibr">[30]</ref>, <ref type="bibr">[31]</ref>.</p><p>For generally non-convex smooth functions the convergence is goverend by the term 1 &#8730; T . In Table <ref type="table">I</ref> we compare the relevant approaches that addressed the delayed gradient problem, both in distributed and federated learning with our proposed algorithms. As shown, we are able to provide a staleness free convergence rate without sophisticated restrictions on the learning rate or the function properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. PROBLEM SETUP</head><p>In many machine learning and optimization tasks, we aim to find the optimal set of parameters x * in R d that minimize a loss function <ref type="bibr">[19]</ref>, which can be formally stated as:</p><p>Here, f (x, &#958;) denotes the loss function evaluated at parameters x with respect to a random data point &#958;, which is independently sampled from the data distribution D. The global loss function F (x) represents the expected loss over the data distribution. Various loss functions are employed depending on the specific task. For regression tasks, common loss functions include the Mean Squared Error (MSE), defined as f (x, &#958;) = (y -&#375;(x)) 2 , and the Mean Absolute Error (MAE), given by f (x, &#958;) = |y -&#375;(x)|, where y &#8834; &#958; is the actual value (ground truth) of the target variable, and &#375;(x) is the predicted value of the target variable using paramter x. In multi-class classification settings, the Categorical Cross-Entropy Loss, f (x, &#958;) = -C c=1 y c log(&#375; c (x)), is often employed, where y c &#8834; &#958; is the actual binary label for class c, where y c &#8712; {0, 1} and only one class has the label 1 (onehot encoding), and &#375;c (x) is the predicted probability that the target variable belongs to class c computed at paramter x.</p><p>To solve this optimization problem efficiently, especially when dealing with large-scale data, we use a distributed learning system where multiple clients collaboratively optimize a global loss function under a central server's coordination. The system consists of:</p><p>&#8226; Server: A central entity coordinating training by aggregating gradients and updating global parameters. &#8226; Clients: I parallel workers that compute gradients using independently sampled IID batches from the dataset. In this distributed setup: 1) Each client i independently samples an IID batch &#958; i , with size B, from the entire dataset. 2) Based on the sampled batch &#958; i , each client i computes the average stochastic gradient g(x i , &#958; i ). Here, x i represents the version of the global parameters that the client i has, which might not be the most recent one at the server due to the asynchronous nature of communication.</p><p>3) The server collects these gradients from the clients and aggregates them to update the global parameters. For simplicity of notation, we drop the batch argument from the gradient symbol and use g(x &#964;i ) to represent the stochastic gradient computed by client i using parameter x &#964;i . x &#964;i is the version of the global model parameters available to client i at its local iteration &#964; i , corresponding to a certain global iteration n.</p><p>Algorithm Convergence Staleness Learning/Weighting Rate Objective Function Properties Standard ASGD [11] O &#964; T + &#963; 2 &#8730; T Upper-bounded Staleness-dependent L-Smooth Non-convex Standard Synchronous SGD [15] O I T + &#963; 2 &#8730; T Zero staleness via full synchrony Staleness-independent L-Smooth Non-convex Delay Compensation ASGD [7] O &#963; * &#8730; T Fixed Staleness-independent Smooth, L-Lipschitz activation, &#181;-strongly convex locally, bounded first, second and third derivatives Coherence-based ASGD [3] O &#964; &#8730; T + &#963; log T &#964; &#8730; T Upper-bounded Staleness-dependent L-smooth and &#181;-weakly convex Staleness adaptive ASGD [20] * O I T + &#963; &#8730; T Not relying on any staleness assumption Staleness-dependent L-Smooth Non-convex s-QASY (Proposed) O &#963; 2 &#8730; T Not relying on any staleness assumption Staleness-independent L-Smooth Non-convex QASY (Proposed) O &#963; 2 &#8730; T Not relying on any staleness assumption Staleness-independent L-Smooth Non-convex * Although this bound appears to be independent of staleness, it is inherently related to the number of workers. Increasing the number of workers leads to greater staleness, implying that staleness is still implicitly present. Furthermore, utilizing all stale gradients with a discounted learning rate results in excessive communication overhead without a corresponding improvement in learning performance, as demonstrated in the simulations.</p><p>TABLE I: Comparison of different asynchronous optimization approaches, where T is the number of iterations, I is the number of workers, &#964; is the staleness upper bound, &#963; * is the variance of the delay compensated gradient, and L denotes the Lipschitz and smoothness parameters of the objective function.</p><p>After the central server receives a gradient from client i and decides to update the global model parameters x n at iteration n, the update is performed as:</p><p>where &#951; denotes the global learning rate. The optimization process thus involves clients and the server working collaboratively to minimize the global loss function F (x). This distributed approach enhances computational efficiency and scalability, allowing the system to handle large datasets and complex models more effectively. The asynchronous nature of the system ensures that clients do not have to wait for the most recent global parameters, further improving the system's efficiency. Conversely, Fully Synchronous SGD (FSSGD) has been shown to consistently achieve higher accuracy due to its use of up-to-date gradients, aligning closely with the current state of the global model. However, this method comes at the cost of increased communication rounds for synchronization, necessitating more extensive resource utilization. While using delayed gradients relieves the stress of the communication overhead required by synchronous approaches, the asynchronous methods also cause the resources to inflate due to the instability presented by the delayed gradient. As a result, the need for a framework that estimates the quality of the gradients before using them to evaluate the model is crucial. Since the algorithms main goal is to improve the leaning performance while maintaining a reasonable resources consumption, along with the convergence rate, we use the cumulative time, which is the training time measured in time units (seconds, minutes,...etc) and the number of communication rounds between the workers and the server as performance metrics to evaluate our algorithms. One communication round is defined as the server communicating a parameter to the worker or the worker communicating a gradient to the server. <ref type="foot">1</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. GRADIENT QUALITY-AWARE ASYNCHRONOUS APPROACH</head><p>In this section, we introduce a Quality-based selection approach for delayed gradients, designed to enhance the communication efficiency and reliability of gradient processing in an asynchronous setting. The delayed gradients sent by the workers to the server may be helpful to minimize the objective function. On the other hand, some of the gradients can hinder the convergence leading to adversary like behavior.</p><p>Intuition: To illustrate our intuition, consider minimizing the function f (x) = x 2 using gradient descent starting from a point x n = 5, the traditional gradient descent update with a learning rate &#951; = 0.1 results in x n+1 = 4. Alternatively, we can use a gradient computed at another point y = 20. In this case the new point will be x n+1 = 5 -0.1 &#215; 2 &#215; 20 = 1, which is closer to the optimal value than the point derived from using x n . However, if we compute the gradient at point z = -10, it results in x n+1 = 5 -0.1 &#8226; 2 &#8226; (-10) = 7 which is farther from the optimal point than x t . This example supports the insight that not all delayed gradients are bad; indeed, under certain circumstances, they can be harnessed to expedite the convergence towards the minimum. Note that, the distance between x n and y is the same as the distance between x n and z, indicating that the distance between the points is not the only factor that decides the usefulness of the delayed gradient as considered in the earlier works <ref type="bibr">[20]</ref>. Indeed, the norm of the delayed gradient, smoothness, and the learning rate are also significant factors. For example, the norm of the gradient at point y = 20 is much higher compared to that of at point z = -10. Key component of our algorithm: Using the above intuition, we are now ready to discuss the key novel aspect of our proposed algorithm compare to the existing methods. We use the ratio between the norm of the gradient g(x &#964;i ) and the distance between the stale model parameter at and the current global model ||x &#964;i -x n || 2 as a factor to determine whether to accept the stale gradient or not. In particular, if the ratio is above a threshold &#952; we accept the gradient otherwise we reject it.</p><p>The above serves two purposes-i) The worker does not need to send the computed gradient immediately saving a lot of communication resource; rather, the agent only needs to send the norm of the gradient. ii) By carefully choosing the threshold &#952; (Section IV), we can achieve the state-of-the-art convergence rate O(1/ &#8730; T ) devoid of any staleness parameter. Hence, we can achieve the same convergence rate as that of the synchronous SGD while using less communication resources. Note that unlike <ref type="bibr">[8]</ref>, <ref type="bibr">[20]</ref> we do not accept all the gradients and then scale down the learning rate based on the stale value. Rather, we choose to accept those gradients that can provably contribute towards convergence without scaling down the learning rate. Hence, we achieve similar convergence rate without using large communication resources.</p><p>Algorithm: The QASY algorithm (Algorithm 1) starts by initializing the model parameters and setting up a mechanism to store gradients persistently. It operates in iterations, each beginning with receiving a notification from any worker i n that a gradient has been computed, accompanied by the norm of that gradient (line 8). The server then decides whether to accept the gradient for updating the model or to reject it, thereby saving a communication round. The handling of the gradient proceeds as follows:</p><p>1) Immediate Acceptance for Recent Gradients: If the received gradient g(x &#964;i n ) is recent (&#964; in = n), it is immediately accepted without further evaluation (line 9).</p><p>2) Quality Check for Delayed Gradients: The norm of the delayed gradient g(x &#964;i n ) with &#964; in &lt; n is assessed for quality. The algorithm checks if</p><p>the condition is satisfied, the worker sends the gradient as it is accepted and used for the model update (line 10). Otherwise, the worker is instructed to recompute the gradient based on the latest model parameters x n sent by the server (lines 22-24).</p><p>The update process in QASY begins once the server receives an accepted gradient. The server updates the model into a temporary variable x temp (line 12) and proceeds to the refinement step. In this refinement step, the server reviews gradients stored from previous iterations to identify any that are still relevant and impactful for updating the current model state (lines <ref type="bibr">[13]</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref><ref type="bibr">[16]</ref><ref type="bibr">[17]</ref>. This involves checking if a stored stochastic gradient g(x &#964;i ), when applied to x temp , would still satisfy the quality criterion based on the norm and threshold &#952;. By updating the model with these refined gradients, the learning process becomes more effective and precise. This allows workers who contributed quality gradients to proceed with computations reflecting enhanced collective insights, rather than just the most recent updates. Additionally, reusing gradients conserves resources since we do not need to compute any gradient.</p><p>After completing this refinement, the server finalizes the model update, resulting in the new model x n+1 (line 18). This updated model is then sent only to the worker that provided the accepted gradient (line 19), as detailed in Algorithm 1, hence, unlike the synchronized version, the algorithm does not send the updated model to every worker.</p><p>Despite the algorithm's effectiveness in handling gradients and improving learning convergence, QASY has a higher storage and computational cost cost as it needs to store earlier gradients and use those for model update. We also propose a simpler variant, simplified-Quality Aware Asynchronous SGD (s-QASY) that does not need to store earlier gradients. In other words, s-QASY skips the refinement step in Algorithm 1 (lines 13-17), which reduces computational complexity. We Wait for the next worker to finish computing the gradient, and denote this worker as i n .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>9:</head><p>Worker i n sends &#8741;g(x &#964;i n )&#8741;.</p><p>10:</p><p>Worker i n sends g(x &#964;i n ).</p><p>12:</p><p>S &#8592; S &#8746; {(i n , g(x &#964;i n ))}</p><p>13:</p><p>x temp = x temp -&#951; g(x &#964;i n ) 14:</p><p>for each (i, g(x &#964;i )) &#8712; S do 15:</p><p>x temp = x temp -&#951; g(x &#964;i n ) 17:</p><p>end if</p><p>end for 19:</p><p>x n+1 = x temp 20:</p><p>Store x n+1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>21:</head><p>Send x n+1 to worker i n .</p><p>22:</p><p>f lag &#8592; False</p><p>24: else 25: Send x n to worker i n . 26: end if 27:</p><p>end while 28: end for show that both the versions have a similar convergence rate though QASY performs slightly better empirically.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. CONVERGENCE ANALYSIS</head><p>We first prove the convergence of the s-QASY and subsequently, we prove the convergence rate of QASY.</p><p>We begin by stating the assumptions required for the convergence analysis:</p><p>&#8226; Assumption 1 (Boundedness and Smoothness): F is bounded and L-smooth, i.e., for all u, v &#8712; R d , &#8741;&#8711;F (u) -&#8711;F (v)&#8741; &#8804; L&#8741;u -v&#8741;.</p><p>&#8226; Assumption 2 (Unbiasedness and Bounded Variance):</p><p>For any x &#8712; R d , the gradient estimator g(x) satisfies:</p><p>, where B is the batch size. Our assumptions align with those commonly adopted in the literature on distributed learning frameworks <ref type="bibr">[3]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[10]</ref>, <ref type="bibr">[11]</ref>. Specifically, we do not require the bounded staleness assumption, which further emphasizes the robustness and generality of our approach. Now, we will establish an upper bound on the one iteration cost reduction when multiple delayed gradients are used every iteration. Specifically, in the next lemma, at any iteration n, we will consider applying K number of gradients stored at the set U n . This upper bound is general and applies to both s-QASY and QASY algorithms. For the s-QASY algorithm, K = 1, indicating that only one delayed gradient is applied per iteration. For the QASY algorithm, K can vary at each iteration, allowing the application of multiple delayed gradients.</p><p>Lemma 1. For the sequence {x n } generated by the following update rule:</p><p>where U n is the set of K gradients, the one iteration cost reduction can be upper bounded as follows:</p><p>The proof of this lemma can be found in Appendix VII-A. Note that Equation ( <ref type="formula">3</ref>), lists out the main components that determine the effectiveness of the delayed gradient. While the first and last terms are presumably unknown and can not be controlled by the server's communication decisions, the two middle terms are known by the server and can be fully utilized to improve the reduction. In the next theorems, we will design a threshold based on which we can ensure that the applied delayed gradient is as useful as the recent SGD vector.</p><p>Theorem 1. Given the sequence {x n } generated by s-QASY Algorithm, and under the previously stated assumptions and the threshold parameter &#952; = &#8730; 2L &#8730; 1-L&#951; , with B 2 &lt; T the following inequality holds for the step size</p><p>The proof of this Theorem can be found in Appendix VII-B.</p><p>Next, we will consider the implications of this result for the QASY algorithm, where the number of updates every iterations is not fixed and random.</p><p>Theorem 2. Given the sequence {x n } generated by QASY algorithm with &#952; =</p><p>T , and T &gt; B 2 , where M is the maximum number of updates in any iteration, then the following inequality hold:</p><p>The proof of this Theorem can be found in Appendix VII-C.</p><p>Remark 1. Both QASY, and s-QASY's convergence rate and step size, do not depend on the gradients' staleness. In particular, we obtain the same convergence rate O(1/ &#8730; T ) as the synchronous SGD using asynchronous method.</p><p>[20] achieved a similar convergence rate without the stale parameter, however, they used all the gradients. Thus, they require a lot of communication rounds as we observe in the empirical results. Even though theoretical bound in <ref type="bibr">[20]</ref> is similar to that of ours, we outperform them in the empirical evaluations.</p><p>We observe that as T increases, the threshold decreases. Intuitively, higher T indicates that we can recover from accepting wrong stale gradients, hence, we can accept more stale gradients. Hence, as T increases, the algorithm can be more aggressive. However, the maximum value of &#952; is L. Hence, even when T &#8594; &#8734;, it does not accept all the stale gradients unlike the existing approaches for asynchronous SGD.</p><p>Note that though the learning rate (and thus, the threshold) depends on the information of B and L, in the empirical results, we do not rely on those information. In this section we present the results of our experiments that used to assess the performance of our algorithms QASY and s-QASY. We consider a fully connected feedforward neural network with the following structure:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. SIMULATION RESULTS</head><p>1) Input Layer: 784 neurons, corresponding to the 28x28 pixel input images. 2) Hidden Layers:</p><p>&#8226; First Hidden Layer: 128 neurons, ReLU activation.</p><p>&#8226; Second Hidden Layer: 64 neurons, ReLU activation.</p><p>3) Output Layer: 10 neurons, softmax activation for multiclass classification. This model, implemented using TensorFlow Keras, is designed for image classification tasks, leveraging ReLU activation for non-linearity and computational efficiency, and</p><p>2 4 6 8 10 Cumulative Time (s) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Test Accuracy Test Accuracy vs Cumulative Time QASY Test Accuracy s-QASY Test Accuracy CSGD Test Accuracy AGSGD Test Accuracy 0 1000 2000 3000 4000 5000 Communication Rounds 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Test Accuracy Test Accuracy vs Communication Rounds QASY Test Accuracy s-QASY Test Accuracy CSGD Test Accuracy AGSGD Test Accuracy 2 4 6 8 10 Cumulative Time (s) 0.5 1.0 1.5 2.0 2.5 Training Loss Training Loss vs Cumulative Time QASY Training Loss s-QASY Training Loss CSGD Training Loss AGSGD Training Loss 0 1000 2000 3000 4000 5000 Communication Rounds 0.5 1.0 1.5 2.0 2.5 Training Loss Training Loss vs Communication Rounds QASY Training Loss s-QASY Training Loss CSGD Training Loss AGSGD Training Loss Fig. 4: Training loss comparison of AGSGD, QASY, s-QASY, and CSGD with a batch size of 100 vs cumulative time and communication rounds</p><p>softmax for outputting class probabilities. The MNIST dataset <ref type="bibr">[32]</ref> was used in our experiments with the categorical cross entropy loss.</p><p>Baseline Algorithms: As the severity of the staleness appears in the larger scale setup, we evaluated the performance across I = 1000 workers, with fixed computational delays evenly spaced from 1 to 10 seconds. However, it would be unfair to compare FSSGD <ref type="bibr">[15]</ref> (because of excessive communication rounds) with our schemes. Therefore, we introduce a more practical synchronous version Conservative Stochastic Gradient Descent (CSGD), which is the same as s-QASY except that &#952; = &#8734;. Thus, CSGD is a communication efficient version of the synchronous stochastic gradient descent, that only accepts the recent gradient and discards any delayed one (assuming 1/0 &#8805; &#8734;). We also consider when &#952; = 0 in s-QASY, where the server aggressively updates the model with any received gradient, so we call it Aggressive Stochastic Gradient Descent (AGSGD). In other words, ASGD is the traditional asynchronous stochastic gradient descent algorithm that accepts any stale gradient.</p><p>Learning Rate: The learning rate is set to be 0.001 for CSGD, and s-QASY, which is the best practical value based on our preliminary experiments, while QASY's learning rate is chosen to be 10 -4 . For ASGD, we used a staleness-scaled learning rate, as proposed in <ref type="bibr">[20]</ref> and <ref type="bibr">[8]</ref>, where the learning rate for a gradient delayed by k iterations is scaled by a factor of 0.001 k as it shows the most stable behaviour among other tested approaches. Also, we use a batch size of 100.</p><p>Threshold: The performance of QASY and s-QASY algorithms are mainly determined by the choice of the threshold &#952;. A lower threshold, similar to the behavior of the AGSGD algorithm, allows for the acceptance of all gradients regardless of their staleness, while a higher threshold mirrors the CSGD algorithm's strategy of considering only the most recent gradients. As the threshold is essentially a function of the hyperparameters (learning rate and smoothness), it can be optimized empirically using standard tuning techniques <ref type="bibr">[33]</ref>- <ref type="bibr">[35]</ref>. For instance, bi-level optimization <ref type="bibr">[33]</ref> iterates between updating the threshold in the outer loop and solving the training task in the inner loop. Alternatively, Bayesian optimization <ref type="bibr">[34]</ref> constructs a surrogate model (e.g., Gaussian Process) to guide the search toward promising hyper-parameter configurations based on observed performance. Methods like random or grid search <ref type="bibr">[35]</ref> can also be effective in lower-dimensional settings or under tighter computational budgets.</p><p>Figure <ref type="figure">2</ref> illustrates the empirical relationship between the threshold (&#952;) and the test accuracy of the QASY and s-QASY algorithms. Notably, both QASY and s-QASY demonstrate marked improvements in test accuracy when the threshold increases from zero, suggesting that even a modest degree of filtering on delayed gradients can significantly stabilize the training process. Optimal accuracy is achieved within a threshold window estimated to span from 600 to 800, highlighting an optimal trade-off where the algorithms benefit from a prudent mix of gradient exploitation and staleness mitigation. This confluence of precision in gradient utilization underpins the peak learning performance for these algorithms. Moreover, the performance of QASY and s-QASY closely aligns across various thresholds, indicating that both algorithms leverage gradient staleness to a similar extent and efficacy. However, as the threshold rate increases beyond the optimal range, a gradual decline in test accuracy is observed, followed by a plateau. This phenomenon suggests that there exists a threshold saturation point beyond which the accepting of stale gradients in fact contribute to reduce the accuracy gains, with performance asymptotically approaching that of the CSGD.</p><p>Performance: In Figures <ref type="figure">3</ref> and <ref type="figure">4</ref>, we compare our proposed QASY and s-QASY with two baseline algorithms in terms of the test accuracy and training loss against the actual training time and the number of communication rounds. To ensure a fair comparison, the training duration for all four algorithms is fixed at 10 seconds. This fixed training time allows us to evaluate the efficiency of each algorithm based on the number of communication rounds consumed. We use the best threshold for QASY and s-QASY as we obtained in Figure <ref type="figure">2</ref>. As shown in Figures, both QASY, and s-QASY outperform AGSGD and CSGD in terms of the training loss and test accuracy. Remarkably, our proposed algorithms reached a test accuracy of 87% and 79%, respectively. which out performs its synchronous counterpart, the CSGD, that only reach 70%. Also note that the ASGD has the worst performance, as the high staleness causes that most of the delayed gradient to lose their potential due to the excessive down-scaling of the learning rate. In addition to its poor performance, ASGD consumes almost 2 times the communication rounds that other algorithms use in their training. The performance of QASY is better compared to s-QASY however it comes at the cost of a slightly higher computational cost as we discussed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSION AND FUTURE WORK</head><p>In this paper, we introduced QASY, a new method to make learning in a distributed setting-where many computers work together-more effective, especially when updates happen at different times. We developed QASY to smartly choose which updates to use, focusing on their quality. We also created a simpler version, s-QASY, to make the process easier and less demanding.</p><p>Additionally, this work opens the door to further exploration into how our methods can benefit the network as a whole, by possibly reducing communication needs and enhancing overall efficiency. Future research could extend our approach to the federated learning setting, where data and resources vary widely across different locations. Applying our algorithms in such environments could offer valuable insights into their adaptability and effectiveness in handling data and resource heterogeneity, potentially leading to more robust and scalable learning solutions. Moreover, future work could extend the convergence analysis to encompass the strongly convex case, providing a more comprehensive understanding of our algorithms' performance under diverse mathematical conditions. VII. APPENDIX A. proof of Lemma 1 Proof. By considering the update rule in (2), and using the L-smoothness of F , we have for any n:</p><p>= &#10216;&#8711;F (x n ), -&#951;</p><p>(7) Decomposing the inner product: &#10216;&#8711;F (x n ), g(x i )&#8712;Un g(x i )&#10217; = g(x i )&#8712;Un &#10216;&#8711;F (x n ), g(x i )&#10217;.</p><p>Using the identity: &#10216;a, b&#10217; = 1 2 (&#8741;a&#8741; 2 + &#8741;b&#8741; 2 -&#8741;a -b&#8741; 2 ), we get:</p><p>Substituting back into the inequality:</p><p>Bounding the squared norm of the sum of the gradients:</p><p>Using the inequality (a + b) 2 &#8804; 2a 2 + 2b 2 :</p><p>Using L-smoothness:</p><p>Substituting the bound into the inequality we get (3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Proof of Theorem 1</head><p>Proof. We can get the one iteration cost reduction of s-QASY by substituting into equation 3 with K = 1, we get:</p><p>Since the accepted gradient g(x &#964;i n ) achieves the condition:</p><p>we can ensure that: &#951;L 2 &#8741;x n -x &#964;i n &#8741; 2 -&#951; 2 (1 -&#951;L)&#8741;g(x &#964;i n )&#8741; 2 &#8804; 0. Therefore, we can upper bound the expression by removing these two terms:</p><p>+ &#951;&#8741;&#8711;F (x &#964;i n ) -g(x &#964;i n )&#8741; 2 . Taking the expectation given i n , x &#964;i n and x n , we have:</p><p>By taking the full expectation, we get:</p><p>Summing from n = 0 to T -1 and dividing by T , we obtain:</p><p>Rearranging the terms, and substituting with &#951; value, we get the desired inequality:</p><p>C. Proof of Theorem 2 Proof. the QASY algorithm update rule can be written as:</p><p>where U n be the set of gradients used to update the model at iteration n, with each gradient satisfying the quality condition: &#8741;g(x i )&#8741; &#8741;x n -x i &#8741; &#8805; &#952;, So, using equation (3), we can write the one iteration cost reduction as:</p><p>Taking the expectation, given U n , and {x 0 , ...x n }, we get:</p><p>By taking full Expectation we get:</p><p>With similar steps to Theorem 1's proof, we can get equation <ref type="bibr">(5)</ref>.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Since gradients and parameters are vectors on the order of millions, sending a single real number, such as the gradient norm, is not considered a communication round.</p></note>
		</body>
		</text>
</TEI>
