In this paper, we consider hybrid parallelism—a paradigm that em- ploys both Data Parallelism (DP) and Model Parallelism (MP)—to scale distributed training of large recommendation models. We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT filters the entities to be communicated across the network through a simple hard-thresholding function, allowing only the most relevant information to pass through. For communication efficient DP, DCT compresses the parameter gradients sent to the parameter server during model synchronization. The threshold is updated only once every few thousand iterations to reduce the computational overhead of compression. For communication efficient MP, DCT incorporates a novel technique to compress the activations and gradients sent across the network during the forward and backward propagation, respectively. This is done by identifying and updating only the most relevant neurons of the neural network for each training sample in the data. We evaluate DCT on publicly available natural language processing and recommender models and datasets, as well as recommendation systems used in production at Facebook. DCT reduces communication by at least 100× and 20× during DP and MP, respectively. The algorithm has been deployed in production, and it improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
more »
« less
Blink: Fast and Generic Collectives for Distributed ML
Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this issue, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for hybrid, and faster, data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8× faster model synchronization (AllReduce), and reduce end-to-end DNN training time for image classification tasks by up to 40%.
more »
« less
- Award ID(s):
- 1838733
- PAR ID:
- 10175841
- Date Published:
- Journal Name:
- Conference on Machine Learning and Systems (MLSys)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments.more » « less
-
Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments.more » « less
-
Data parallel frameworks become essential for training machine learning models. The classic Bulk Synchronous Parallel (BSP) model updates the model parameters through pre-defined synchronization barriers. However, when a worker computes significantly slower than other workers, waiting for the slow worker will lead to excessive waste of computing resources. In this paper, we propose a novel proactive data-parallel (PDP) framework. PDP enables the parameter server to initiate the update of the model parameter. That is, we can perform the update at any time without pre-defined update points. PDP not only initiates the update but also determines when to update. The global decision on the frequency of updates will accelerate the training. We further propose asynchronous PDP to reduce the idle time caused by synchronizing parameter updates. We theoretically prove the convergence property of asynchronous PDP. We implement a distributed PDP framework and evaluate PDP with several popular machine learning algorithms including Multilayer Perceptron, Convolutional Neural Network, K-means, and Gaussian Mixture Model. Our evaluation shows that PDP can achieve up to 20X speedup over the BSP model and scale to large clusters.more » « less
-
Fully decentralized model training for on-road vehicles can leverage crowdsourced data while not depending on central servers, infrastructure or Internet coverage. However, under unreliable wireless communication and short contact duration, model sharing among peer vehicles may suffer severe losses thus fail frequently. To address these challenges, we propose “RoADTrain”, a route-assisted decentralized peer model training approach that carefully chooses vehicles with high chances of successful model sharing. It bounds the per round communication time yet retains model performance under vehicle mobility and unreliable communication. Based on shared route information, a connected cluster of vehicles can estimate and embed the link reliability and contact duration information into the communication topology. We decompose the topology into subgraphs supporting parallel communication, and identify a subset of them with the highest algebraic connectivity that can maximize the speed of the information flow in the cluster with high model sharing successes, thus accelerating model training in the cluster. We conduct extensive evaluation on driving decision making models using the popular CARLA simulator. RoADTrain achieves comparable driving success rates and 1.2−4.5× faster convergence than representative decentralized learning methods that always succeed in model sharing (e.g., SGP), and significantly outperforms other benchmarks that consider losses by 17−27% in the hardest driving conditions. These demonstrate that route sharing enables shrewd selection of vehicles for model sharing, thus better model performance and faster convergence against wireless losses and mobility.more » « less
An official website of the United States government

