Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-update SGD

Wang, Jianyu; Joshi, Gauri

Citation Details

Large-scale machine learning training, in particular, distributed stochastic gradient descent, needs to be robust to inherent system variability such as node straggling and random communication delays. This work considers a distributed training framework where each worker node is allowed to perform local model updates and the resulting models are averaged periodically. We analyze the true speed of error convergence with respect to wall-clock time (instead of the number of iterations) and analyze how it is affected by the frequency of averaging. The main contribution is the design of ADACOMM, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor. Rigorous experiments on training deep neural networks show that ADACOMM can take 3x less time than fully synchronous SGD and still reach the same final training loss. more »

Award ID(s):: 1850029

PAR ID:: 10137586

Author(s) / Creator(s):: Wang, Jianyu; Joshi, Gauri

Date Published:: 2019-04-01

Journal Name:: Systems and Machine Learning (SysML) Conference

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this