In this paper, we address the challenges of asynchronous gradient descent in distributed learning environments, particularly focusing on addressing the challenges of stale gradients and the need for extensive communication resources. We develop a novel communication efficient framework that incorporates a gradient evaluation algorithm to assess and utilize delayed gradients based on their quality, ensuring efficient and effective model updates while significantly reducing communication overhead. Our proposed algorithm requires agents to only send the norm of the gradients rather than the computed gradient. The server then decides whether to accept the gradient if the ratio between the norm of the gradient and the distance between the global model parameter and the local model parameter exceeds a certain threshold. With the proper choice of the threshold, we show that the convergence rate achieves the same order as the synchronous stochastic gradient without depending on the staleness value unlike most of the existing works. Given the computational complexity of the initial algorithm, we introduce a simplified variant that prioritizes the practical applicability without compromising on the convergence rates. Our simulations demonstrate that our proposed algorithms outperform existing state-of-the-art methods, offering improved convergence rates, stability, accuracy, and resource consumption.
more »
« less
This content will become publicly available on May 19, 2026
Communication Efficient Asynchronous Stochastic Gradient Descent
In this paper, we address the challenges of asynchronous gradient descent in distributed learning environments, particularly focusing on addressing the challenges of stale gradients and the need for extensive communication resources. We develop a novel communication efficient framework that incorporates a gradient evaluation algorithm to assess and utilize delayed gradients based on their quality, ensuring efficient and effective model updates while significantly reducing communication overhead. Our proposed algorithm requires agents to only send the norm of the gradients rather than the computed gradient. The server then decides whether to accept the gradient if the ratio between the norm of the gradient and the distance between the global model parameter and the local model parameter exceeds a certain threshold. With the proper choice of the threshold, we show that the convergence rate achieves the same order as the synchronous stochastic gradient without depending on the staleness value unlike most of the existing works. Given the computational complexity of the initial algorithm, we introduce a simplified variant that prioritizes the practical applicability without compromising on the convergence rates. Our simulations demonstrate that our proposed algorithms outperform existing state-of-the-art methods, offering improved convergence rates, stability, accuracy, and resource consumption.
more »
« less
- PAR ID:
- 10651733
- Publisher / Repository:
- IEEE
- Date Published:
- Page Range / eLocation ID:
- 1 to 10
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)In this paper, we consider federated learning in wireless edge networks. Transmitting stochastic gradients (SG) or deep model's parameters over a limited-bandwidth wireless channel can incur large training latency and excessive power consumption. Hence, data compressing is often used to reduce the communication overhead. However, efficient communication requires the compression algorithm to satisfy the constraints imposed by the communication medium and take advantage of its characteristics, such as over-the-air computations inherent in wireless multiple-access channels (MAC), unreliable transmission and idle nodes in the edge network, limited transmission power, and preserving the privacy of data. To achieve these goals, we propose a novel framework based on Random Linear Coding (RLC) and develop efficient power management and channel usage techniques to manage the trade-offs between power consumption, communication bit-rate and convergence rate of federated learning over wireless MAC. We show that the proposed encoding/decoding results in an unbiased compression of SG, hence guaranteeing the convergence of the training algorithm without requiring error-feedback. Finally, through simulations, we show the superior performance of the proposed method over other existing techniques.more » « less
-
In this paper, we consider hybrid parallelism—a paradigm that em- ploys both Data Parallelism (DP) and Model Parallelism (MP)—to scale distributed training of large recommendation models. We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT filters the entities to be communicated across the network through a simple hard-thresholding function, allowing only the most relevant information to pass through. For communication efficient DP, DCT compresses the parameter gradients sent to the parameter server during model synchronization. The threshold is updated only once every few thousand iterations to reduce the computational overhead of compression. For communication efficient MP, DCT incorporates a novel technique to compress the activations and gradients sent across the network during the forward and backward propagation, respectively. This is done by identifying and updating only the most relevant neurons of the neural network for each training sample in the data. We evaluate DCT on publicly available natural language processing and recommender models and datasets, as well as recommendation systems used in production at Facebook. DCT reduces communication by at least 100× and 20× during DP and MP, respectively. The algorithm has been deployed in production, and it improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.more » « less
-
Despite the recent success of Graph Neural Networks (GNNs), training GNNs on large graphs remains challenging. The limited resource capacities of the existing servers, the dependency between nodes in a graph, and the privacy concern due to the centralized storage and model learning have spurred the need to design an effective distributed algorithm for GNN training. However, existing distributed GNN training methods impose either excessive communication costs or large memory overheads that hinders their scalability. To overcome these issues, we propose a communication-efficient distributed GNN training technique named (LLCG). To reduce the communication and memory overhead, each local machine in LLCG first trains a GNN on its local data by ignoring the dependency between nodes among different machines, then sends the locally trained model to the server for periodic model averaging. However, ignoring node dependency could result in significant performance degradation. To solve the performance degradation, we propose to apply on the server to refine the locally learned models. We rigorously analyze the convergence of distributed methods with periodic model averaging for training GNNs and show that naively applying periodic model averaging but ignoring the dependency between nodes will suffer from an irreducible residual error. However, this residual error can be eliminated by utilizing the proposed global corrections to entail fast convergence rate. Extensive experiments on real-world datasets show that LLCG can significantly improve the efficiency without hurting the performance. One-sentence Summary: We propose LLCG a communication efficient distributed algorithm for training GNNs.more » « less
-
This paper proposes a novel non-parametric multidimensional convex regression estimator which is designed to be robust to adversarial perturbations in the empirical measure. We minimize over convex functions the maximum (over Wasserstein perturbations of the empirical measure) of the absolute regression errors. The inner maximization is solved in closed form resulting in a regularization penalty involves the norm of the gradient. We show consistency of our estimator and a rate of convergence of order O˜(n−1/d), matching the bounds of alternative estimators based on square-loss minimization. Contrary to all of the existing results, our convergence rates hold without imposing compactness on the underlying domain and with no a priori bounds on the underlying convex function or its gradient norm.more » « less
An official website of the United States government
