skip to main content


Title: Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models
We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways:(i) our objective formulation does not change the location of stationary points compared to the original optimization problem;(ii) we avoid convergence decelerations caused by pulling local workers descending to different local minima to each other (ie to the average of their parameters);(iii) our update by design breaks the curse of symmetry (the phenomenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscapes); and (iv) our approach is more communication efficient since it broadcasts only parameters of the leader rather than all workers. We provide theoretical analysis of the batch version of the proposed algorithm, which we call Leader Gradient Descent (LGD), and its stochastic variant (LSGD). Finally, we implement an asynchronous version of our algorithm and extend it to the multi-leader setting, where we form groups of workers, each represented by its own local leader (the best performer in a group), and update each worker with a corrective direction comprised of two attractive forces: one to the local, and one to the global leader (the best performer among all workers). The multi-leader setting is well-aligned with current hardware architecture, where local workers forming a group lie within a single computational node and …  more » « less
Award ID(s):
1838061
NSF-PAR ID:
10330131
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Advances in neural information processing systems
Volume:
32
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Efficient allocation of tasks to workers is a central problem in crowdsourcing. In this paper, we consider a special setting inspired by spatial crowdsourcing platforms where both workers and tasks arrive dynamically. Additionally, we assume all tasks are heterogeneous and each worker-task assignment brings a known reward. The natural challenge lies in how to incorporate the uncertainty in the arrivals from both workers and tasks into our online allocation policy such that the total expected reward is maximized. To formulate this, we assume the arrival patterns of worker “types” and task “types” can be predicted from historical data. Specifically, we consider a finite time horizon T and assume that in each time-step, a single worker and task are sampled (i.e., “arrive”) from two respective distributions independently, and that this sampling process repeats identically and independently for the entire T online time-steps. Our model, called Online Task Assignment with Two-Sided Arrival (OTA-TSA), is a significant generalization of the classical online task assignment where the set of tasks is assumed to be available offline. For the general version of OTA-TSA, we present an optimal non-adaptive algorithm which achieves an online competitive ratio of 0.295. For the special case of OTA-TSA where the reward is a function of just the worker type, we present an improved algorithm (which is adaptive) and achieves a competitive ratio of at least 0.343. On the hardness side, along with showing that the ratio obtained by our non-adaptive algorithm is the best possible among all non-adaptive algorithms, we further show that no (adaptive) algorithm can achieve a ratio better than 0.581 (unconditionally), even for the special case of OTA-TSA with homogenous tasks (i.e., all rewards are the same). At the heart of our analysis lies a new technical tool (which is a refined notion of the birth-death process), called the two-stage birth-death process, which may be of independent interest. Finally, we perform numerical experiments on two real-world datasets obtained from crowdsourcing platforms to complement our theoretical results. 
    more » « less
  2. Federated Learning (FL) has attracted increasing attention in recent years. A leading training algorithm in FL is local SGD, which updates the model parameter on each worker and averages model parameters across different workers only once in a while. Although it has fewer communication rounds than the classical parallel SGD, local SGD still has large communication overhead in each communication round for large machine learning models, such as deep neural networks. To address this issue, we propose a new communicationefficient distributed SGD method, which can significantly reduce the communication cost by the error-compensated double compression mechanism. Under the non-convex setting, our theoretical results show that our approach has better communication complexity than existing methods and enjoys the same linear speedup regarding the number of workers as the full-precision local SGD. Moreover, we propose a communication-efficient distributed SGD with momentum, which also has better communication complexity than existing methods and enjoys a linear speedup with respect to the number of workers. At last, extensive experiments are conducted to verify the performance of our proposed methods. 
    more » « less
  3. n this paper, we study the fundamental problem of leader election in the mobile telephone model: a recently introduced variation of the classical telephone model modified to better describe the local peer-to-peer-communication services implemented in many popular smartphone operating systems. In more detail, the mobile telephone model differs from the classical telephone model in three ways: (1) each devicecan participate in at most one connection per round; (2) the network topology can undergo a parameterized rate of change; and (3) devices can advertise a parameterized number of bits to their neighbors in each round before connection attempts are initiated. We begin by describing and analyzing a new leader election algorithm in this model that works under the harshest possible parameter assumptions: maximum rate of topology changes and no advertising bits. We then apply this result to resolve an open question from [Ghaffari, 2016] on the efficiency of PUSH-PULL rumor spreading under these conditions. We then turn our attention to the slightly easier case where devices can advertise a single bit in each round. We demonstrate a large gap in time complexity between these zero bit and one bit cases. In more detail, we describe and analyze a new algorithm that solves leader election with a time complexity that includes the parameter bounding topology changes. For all values of this parameter, this algorithm is faster than the previous result, with a gap that grows quickly as the parameter increases (indicating lower rates of change). We conclude by describing and analyzing a modified version of this algorithm that does not require the assumption that all devices start during the same round. This new version has a similar time complexity (the rounds required differ only by a polylogarithmic factor),but now requires slightly larger advertisement tags. 
    more » « less
  4. We study distributed stochastic gradient descent (SGD) in the master-worker architecture under Byzantine attacks. We consider the heterogeneous data model, where different workers may have different local datasets, and we do not make any probabilistic assumptions on data generation. At the core of our algorithm, we use the polynomial-time outlier-filtering procedure for robust mean estimation proposed by Steinhardt et al. (ITCS 2018) to filter-out corrupt gradients. In order to be able to apply their filtering procedure in our heterogeneous data setting where workers compute stochastic gradients, we derive a new matrix concentration result, which may be of independent interest. We provide convergence analyses for smooth strongly-convex and non-convex objectives and show that our convergence rates match that of vanilla SGD in the Byzantine-free setting. In order to bound the heterogeneity, we assume that the gradients at different workers have bounded deviation from each other, and we also provide concrete bounds on this deviation in the statistical heterogeneous data model. 
    more » « less
  5. We study distributed stochastic gradient descent (SGD) in the master-worker architecture under Byzantine at- tacks. We consider the heterogeneous data model, where different workers may have different local datasets, and we do not make any probabilistic assumptions on data generation. At the core of our algorithm, we use the polynomial-time outlier-filtering procedure for robust mean estimation proposed by Steinhardt et al. (ITCS 2018) to filter-out corrupt gradients. In order to be able to apply their filtering procedure in our heterogeneous data setting where workers compute stochastic gradients, we derive a new matrix concentration result, which may be of independent interest. We provide convergence analyses for smooth strongly- convex and non-convex objectives and show that our convergence rates match that of vanilla SGD in the Byzantine-free setting. In order to bound the heterogeneity, we assume that the gradients at different workers have bounded deviation from each other, and we also provide concrete bounds on this deviation in the statistical heterogeneous data model. 
    more » « less