Blink: Fast and Generic Collectives for Distributed ML

Wang, G; Venkataraman, S; Phanishayee, A; Thelin, J; Devanur, N; Stoica, I

Citation Details

Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this issue, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for hybrid, and faster, data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8× faster model synchronization (AllReduce), and reduce end-to-end DNN training time for image classification tasks by up to 40%. more »

Award ID(s):: 1838733

NSF-PAR ID:: 10175841

Author(s) / Creator(s):: Wang, G; Venkataraman, S; Phanishayee, A; Thelin, J; Devanur, N; Stoica, I

Date Published:: 2020-03-01

Journal Name:: Conference on Machine Learning and Systems (MLSys)

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this