skip to main content


Title: Blink: Fast and Generic Collectives for Distributed ML
Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this issue, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for hybrid, and faster, data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8× faster model synchronization (AllReduce), and reduce end-to-end DNN training time for image classification tasks by up to 40%.  more » « less
Award ID(s):
1838733
NSF-PAR ID:
10175841
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Conference on Machine Learning and Systems (MLSys)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we consider hybrid parallelism—a paradigm that em- ploys both Data Parallelism (DP) and Model Parallelism (MP)—to scale distributed training of large recommendation models. We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT filters the entities to be communicated across the network through a simple hard-thresholding function, allowing only the most relevant information to pass through. For communication efficient DP, DCT compresses the parameter gradients sent to the parameter server during model synchronization. The threshold is updated only once every few thousand iterations to reduce the computational overhead of compression. For communication efficient MP, DCT incorporates a novel technique to compress the activations and gradients sent across the network during the forward and backward propagation, respectively. This is done by identifying and updating only the most relevant neurons of the neural network for each training sample in the data. We evaluate DCT on publicly available natural language processing and recommender models and datasets, as well as recommendation systems used in production at Facebook. DCT reduces communication by at least 100× and 20× during DP and MP, respectively. The algorithm has been deployed in production, and it improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance. 
    more » « less
  2. Fully decentralized model training for on-road vehicles can leverage crowdsourced data while not depending on central servers, infrastructure or Internet coverage. However, under unreliable wireless communication and short contact duration, model sharing among peer vehicles may suffer severe losses thus fail frequently. To address these challenges, we propose “RoADTrain”, a route-assisted decentralized peer model training approach that carefully chooses vehicles with high chances of successful model sharing. It bounds the per round communication time yet retains model performance under vehicle mobility and unreliable communication. Based on shared route information, a connected cluster of vehicles can estimate and embed the link reliability and contact duration information into the communication topology. We decompose the topology into subgraphs supporting parallel communication, and identify a subset of them with the highest algebraic connectivity that can maximize the speed of the information flow in the cluster with high model sharing successes, thus accelerating model training in the cluster. We conduct extensive evaluation on driving decision making models using the popular CARLA simulator. RoADTrain achieves comparable driving success rates and 1.2−4.5× faster convergence than representative decentralized learning methods that always succeed in model sharing (e.g., SGP), and significantly outperforms other benchmarks that consider losses by 17−27% in the hardest driving conditions. These demonstrate that route sharing enables shrewd selection of vehicles for model sharing, thus better model performance and faster convergence against wireless losses and mobility. 
    more » « less
  3. Data parallel frameworks become essential for training machine learning models. The classic Bulk Synchronous Parallel (BSP) model updates the model parameters through pre-defined synchronization barriers. However, when a worker computes significantly slower than other workers, waiting for the slow worker will lead to excessive waste of computing resources. In this paper, we propose a novel proactive data-parallel (PDP) framework. PDP enables the parameter server to initiate the update of the model parameter. That is, we can perform the update at any time without pre-defined update points. PDP not only initiates the update but also determines when to update. The global decision on the frequency of updates will accelerate the training. We further propose asynchronous PDP to reduce the idle time caused by synchronizing parameter updates. We theoretically prove the convergence property of asynchronous PDP. We implement a distributed PDP framework and evaluate PDP with several popular machine learning algorithms including Multilayer Perceptron, Convolutional Neural Network, K-means, and Gaussian Mixture Model. Our evaluation shows that PDP can achieve up to 20X speedup over the BSP model and scale to large clusters. 
    more » « less
  4. null (Ed.)
    Federated learning (FL) is a highly pursued machine learning technique that can train a model centrally while keeping data distributed. Distributed computation makes FL attractive for bandwidth limited applications especially in wireless communications. There can be a large number of distributed edge devices connected to a central parameter server (PS) and iteratively download/upload data from/to the PS. Due to limited bandwidth, only a subset of connected devices can be scheduled in each round. There are usually millions of parameters in the state-of-art machine learning models such as deep learning, resulting in a high computation complexity as well as a high communication burden on collecting/distributing data for training. To improve communication efficiency and make the training model converge faster, we propose a new scheduling policy and power allocation scheme using non-orthogonal multiple access (NOMA) settings to maximize the weighted sum data rate under practical constraints during the entire learning process. NOMA allows multiple users to transmit on the same channel simultaneously. The user scheduling problem is transformed into a maximum-weight independent set problem that can be solved using graph theory. Simulation results show that the proposed scheduling and power allocation scheme can help achieve a higher FL testing accuracy in NOMA based wireless networks than other existing schemes within the same learning time. 
    more » « less
  5. Decision trees and tree ensembles are popular supervised learning models on tabular data. Two recent research trends on tree models stand out: (1) bigger and deeper models with many trees, and (2) scalable distributed training frameworks. However, existing implementations on distributed systems are IO-bound leaving CPU cores underutilized. They also only find best node-splitting conditions approximately due to row-based data partitioning scheme. In this paper, we target the exact training of tree models by effectively utilizing the available CPU cores. The resulting system called TreeServer adopts a column-based data partitioning scheme to minimize communication, and a node-centric task-based engine to fully explore the CPU parallelism. Experiments show that TreeServer is up to 10x faster than models in Spark MLlib. We also showcase TreeServer's high training throughput by using it to build big "deep forest" models. 
    more » « less