skip to main content


Title: Hybrid Local SGD for Federated Learning with Heterogeneous Communications
Communication is a key bottleneck in federated learning where a large number of edge devices collaboratively learn a model under the orchestration of a central server without sharing their own training data. While local SGD has been proposed to reduce the number of FL rounds and become the algorithm of choice for FL, its total communication cost is still prohibitive when each device needs to communicate with the remote server repeatedly for many times over bandwidth-limited networks. In light of both device-to-device (D2D) and device-to-server (D2S) cooperation opportunities in modern communication networks, this paper proposes a new federated optimization algorithm dubbed hybrid local SGD (HL-SGD) in FL settings where devices are grouped into a set of disjoint clusters with high D2D communication bandwidth. HL-SGD subsumes previous proposed algorithms such as local SGD and gossip SGD and enables us to strike the best balance between model accuracy and runtime. We analyze the convergence of HL-SGD in the presence of heterogeneous data for general nonconvex settings. We also perform extensive experiments and show that the use of hybrid model aggregation via D2D and D2S communications in HL-SGD can largely speed up the training time of federated learning.  more » « less
Award ID(s):
2106761
NSF-PAR ID:
10427005
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Conference on Learning Representations
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Federated learning (FL) is a highly pursued machine learning technique that can train a model centrally while keeping data distributed. Distributed computation makes FL attractive for bandwidth limited applications especially in wireless communications. There can be a large number of distributed edge devices connected to a central parameter server (PS) and iteratively download/upload data from/to the PS. Due to limited bandwidth, only a subset of connected devices can be scheduled in each round. There are usually millions of parameters in the state-of-art machine learning models such as deep learning, resulting in a high computation complexity as well as a high communication burden on collecting/distributing data for training. To improve communication efficiency and make the training model converge faster, we propose a new scheduling policy and power allocation scheme using non-orthogonal multiple access (NOMA) settings to maximize the weighted sum data rate under practical constraints during the entire learning process. NOMA allows multiple users to transmit on the same channel simultaneously. The user scheduling problem is transformed into a maximum-weight independent set problem that can be solved using graph theory. Simulation results show that the proposed scheduling and power allocation scheme can help achieve a higher FL testing accuracy in NOMA based wireless networks than other existing schemes within the same learning time. 
    more » « less
  2. Federated learning (FL) has attracted increasing attention as a promising technique to drive a vast number of edge devices with artificial intelligence. However, it is very challenging to guarantee the efficiency of a FL system in practice due to the heterogeneous computation resources on different devices. To improve the efficiency of FL systems in the real world, asynchronous FL (AFL) and semi-asynchronous FL (SAFL) methods are proposed such that the server does not need to wait for stragglers. However, existing AFL and SAFL systems suffer from poor accuracy and low efficiency in realistic settings where the data is non-IID distributed across devices and the on-device resources are extremely heterogeneous. In this work, we propose FedSEA - a semi-asynchronous FL framework for extremely heterogeneous devices. We theoretically disclose that the unbalanced aggregation frequency is a root cause of accuracy drop in SAFL. Based on this analysis, we design a training configuration scheduler to balance the aggregation frequency of devices such that the accuracy can be improved. To improve the efficiency of the system in realistic settings where the devices have dynamic on-device resource availability, we design a scheduler that can efficiently predict the arriving time of local updates from devices and adjust the synchronization time point according to the devices' predicted arriving time. We also consider the extremely heterogeneous settings where there exist extremely lagging devices that take hundreds of times as long as the training time of the other devices. In the real world, there might be even some extreme stragglers which are not capable of training the global model. To enable these devices to join in training without impairing the systematic efficiency, Fed-SEA enables these extreme stragglers to conduct local training on much smaller models. Our experiments show that compared with status quo approaches, FedSEA improves the inference accuracy by 44.34% and reduces the systematic time cost and local training time cost by 87.02× and 792.9×. FedSEA also reduces the energy consumption of the devices with extremely limited resources by 752.9×. 
    more » « less
  3. Artificial Intelligence (AI) is moving towards the edge. Training an AI model for edge computing on a centralized server increases latency, and the privacy of edge users is jeopardized due to private data transfer through a less secure communication channels. Additionally, existing high-power computing systems are battling with memory and data transfer bottlenecks between the processor and memory. Federated Learning (FL) is a collaborative AI learning paradigm for distributed local devices that operates without transferring local data. Local participant devices share the updated network parameters with the central server instead of sending the original data. The central server updates the global AI model and deploys the model to the local clients. As the local data resides only on the edge, these devices need to be protected from cyberattacks. The Federated Intrusion Detection System (FIDS) could be a viable system to protect edge devices as opposed to a centralized protection system. However, on-device training of the model in resource constrained devices may suffer from excessive power drain, in addition to memory and area overhead. In this work we present a memristor based system for AI training on edge devices. Memristor devices are ideal candidates for processing in memory, as their dynamic resistance properties allow them to perform multiply-add operations in parallel in the analog domain with extreme efficiency. Alternatively, existing CMOS-based PIM systems are typically developed for edge inference based on pretrained weights, and are not equipped for on-chip training. We show the effectiveness of the system, where successful learning and recognition is achieved completely within edge devices. The classification accuracy of the memristor system shows negligible loss when compared a software implementation. To the best of our knowledge, this first demonstration of a memristor based federated learning system. We demonstrate the effectiveness of this system as an intrusion detection platform for edge devices, although given the flexibility of the learning algorithm, it could be used to enhance many types of on board leaning and classification applications. 
    more » « less
  4. Federated learning (FL) involves training a model over massive distributed devices, while keeping the training data localized and private. This form of collaborative learning exposes new tradeoffs among model convergence speed, model accuracy, balance across clients, and communication cost, with new challenges including: (1) straggler problem—where clients lag due to data or (computing and network) resource heterogeneity, and (2) communication bottleneck—where a large number of clients communicate their local updates to a central server and bottleneck the server. Many existing FL methods focus on optimizing along only one single dimension of the tradeoff space. Existing solutions use asynchronous model updating or tiering-based, synchronous mechanisms to tackle the straggler problem. However, asynchronous methods can easily create a communication bottleneck, while tiering may introduce biases that favor faster tiers with shorter response latencies. To address these issues, we present FedAT, a novel Federated learning system with Asynchronous Tiers under Non-i.i.d. training data. FedAT synergistically combines synchronous, intra-tier training and asynchronous, cross-tier training. By bridging the synchronous and asynchronous training through tiering, FedAT minimizes the straggler effect with improved convergence speed and test accuracy. FedAT uses a straggler-aware, weighted aggregation heuristic to steer and balance the training across clients for further accuracy improvement. FedAT compresses uplink and downlink communications using an efficient, polyline-encoding-based compression algorithm, which minimizes the communication cost. Results show that FedAT improves the prediction performance by up to 21.09% and reduces the communication cost by up to 8.5×, compared to state-of-the-art FL methods. 
    more » « less
  5. Federated Learning (FL) has attracted increasing attention in recent years. A leading training algorithm in FL is local SGD, which updates the model parameter on each worker and averages model parameters across different workers only once in a while. Although it has fewer communication rounds than the classical parallel SGD, local SGD still has large communication overhead in each communication round for large machine learning models, such as deep neural networks. To address this issue, we propose a new communicationefficient distributed SGD method, which can significantly reduce the communication cost by the error-compensated double compression mechanism. Under the non-convex setting, our theoretical results show that our approach has better communication complexity than existing methods and enjoys the same linear speedup regarding the number of workers as the full-precision local SGD. Moreover, we propose a communication-efficient distributed SGD with momentum, which also has better communication complexity than existing methods and enjoys a linear speedup with respect to the number of workers. At last, extensive experiments are conducted to verify the performance of our proposed methods. 
    more » « less