skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Adaptive Gradient Communication via Critical Learning Regime Identification
Distributed model training suffers from communication bottlenecks due to frequent model updates transmitted across compute nodes. To alleviate these bottlenecks, practitioners use gradient compression techniques like sparsification, quantization, low rank updates etc. The techniques usually require choosing a static compression ratio, often requiring users to balance the trade-off between model accuracy and per-iteration speedup. In this work, we show that such performance degradation due to choosing a high compression ratio is not fundamental and that an adaptive compression strategy can reduce communication while maintaining final test accuracy.Inspired by recent findings on critical learning regimes, in which small gradient errors can have irrecoverable impact on model performance, we propose ACCORDION a simple yet effective adaptive compression algorithm. While ACCORDION maintains a high enough compression rate on average, it avoids detrimental impact by not compressing gradients too much whenever in critical learning regimes, detected by a simple gradient-norm based criterion. Our extensive experimental study over a number of machine learning tasks in distributed environments indicates that ACCORDION, maintains similar model accuracy to uncompressed training, yet achieves up to 5.5×better compression and up to 4.1×end-to-end speedup over static approaches. We show that ACCORDION also works for adjusting the batch size, another popular strategy for alleviating communication bottlenecks. Our code is available at https://github.com/uw-mad-dash/Accordion  more » « less
Award ID(s):
2003129
PAR ID:
10311453
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Smola, A.; Dimakis, A.; Stoica, I.
Date Published:
Journal Name:
Proceedings of Machine Learning and Systems
Volume:
3
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Distributed model training suffers from communication bottlenecks due to frequent model updates transmitted across compute nodes. To alleviate these bottlenecks, practitioners use gradient compression techniques like sparsification, quantization, or low-rank updates. The techniques usually require choosing a static compression ratio, often requiring users to balance the trade-off between model accuracy and per-iteration speedup. In this work, we show that such performance degradation due to choosing a high compression ratio is not fundamental. An adaptive compression strategy can reduce communication while maintaining final test accuracy. Inspired by recent findings on critical learning regimes, in which small gradient errors can have irrecoverable impact on model performance, we propose Accordion a simple yet effective adaptive compression algorithm. While Accordion maintains a high enough compression rate on average, it avoids over-compressing gradients whenever in critical learning regimes, detected by a simple gradient-norm based criterion. Our extensive experimental study over a number of machine learning tasks in distributed environments indicates that Accordion, maintains similar model accuracy to uncompressed training, yet achieves up to 5.5x better compression and up to 4.1x end-to-end speedup over static approaches. We show that Accordion also works for adjusting the batch size, another popular strategy for alleviating communication bottlenecks. 
    more » « less
  2. Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning, especially in bandwidth-limited settings and high-dimensional models. Gradient quantization is an effective way of reducing the number of bits required to communicate each model update, albeit at the cost of having a higher error floor due to the higher variance of the stochastic gradients. In this work, we propose an adaptive quantization strategy called AdaQuantFL that aims to achieve communication efficiency as well as a low error floor by changing the number of quantization levels during the course of training. Experiments on training deep neural networks show that our method can converge in much fewer communicated bits as compared to fixed quantization level setups, with little or no impact on training and test accuracy. 
    more » « less
  3. Large-scale machine learning training, in particular, distributed stochastic gradient descent, needs to be robust to inherent system variability such as node straggling and random communication delays. This work considers a distributed training framework where each worker node is allowed to perform local model updates and the resulting models are averaged periodically. We analyze the true speed of error convergence with respect to wall-clock time (instead of the number of iterations) and analyze how it is affected by the frequency of averaging. The main contribution is the design of ADACOMM, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor. Rigorous experiments on training deep neural networks show that ADACOMM can take 3x less time than fully synchronous SGD and still reach the same final training loss. 
    more » « less
  4. Distributed machine learning is primarily motivated by the promise of increased computation power for accelerating training and mitigating privacy concerns. Unlike machine learning on a single device, distributed machine learning requires collaboration and communication among the devices. This creates several new challenges: (1) the heavy communication overhead can be a bottleneck that slows down the training, and (2) the unreliable communication and weaker control over the remote entities make the distributed system vulnerable to systematic failures and malicious attacks. This paper presents a variant of stochastic gradient descent (SGD) with improved communication efficiency and security in distributed environments. Our contributions include (1) a new technique called error reset to adapt both infrequent synchronization and message compression for communication reduction in both synchronous and asynchronous training, (2) new score-based approaches for validating the updates, and (3) integration with both error reset and score-based validation. The proposed system provides communication reduction, both synchronous and asynchronous training, Byzantine tolerance, and local privacy preservation. We evaluate our techniques both theoretically and empirically. 
    more » « less
  5. Federated Learning (FL) has attracted increasing attention in recent years. A leading training algorithm in FL is local SGD, which updates the model parameter on each worker and averages model parameters across different workers only once in a while. Although it has fewer communication rounds than the classical parallel SGD, local SGD still has large communication overhead in each communication round for large machine learning models, such as deep neural networks. To address this issue, we propose a new communicationefficient distributed SGD method, which can significantly reduce the communication cost by the error-compensated double compression mechanism. Under the non-convex setting, our theoretical results show that our approach has better communication complexity than existing methods and enjoys the same linear speedup regarding the number of workers as the full-precision local SGD. Moreover, we propose a communication-efficient distributed SGD with momentum, which also has better communication complexity than existing methods and enjoys a linear speedup with respect to the number of workers. At last, extensive experiments are conducted to verify the performance of our proposed methods. 
    more » « less