Efficient large-scale language model training on GPU clusters using megatron-LM

Narayanan, Deepak; Shoeybi, Mohammad; Casper, Jared; LeGresley, Patrick; Patwary, Mostofa; Korthikanti, Vijay; Vainbrand, Dmitri; Kashinkunti, Prethvi; Bernauer, Julie; Catanzaro, Bryan; Phanishayee, Amar; Zaharia, Matei

doi:10.1145/3458817.3476209

Citation Details

Efficient large-scale language model training on GPU clusters using megatron-LM

Large language models have led to state-of-the-art accuracies across several tasks. However, training these models efficiently is challenging because: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to scaling issues at thousands of GPUs. In this paper, we show how tensor, pipeline, and data parallelism can be composed to scale to thousands of GPUs. We propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs (per-GPU throughput of 52% of theoretical peak). more »

Award ID(s):: 1651570

PAR ID:: 10327313

Author(s) / Creator(s):: Narayanan, Deepak; Shoeybi, Mohammad; Casper, Jared; LeGresley, Patrick; Patwary, Mostofa; Korthikanti, Vijay; Vainbrand, Dmitri; Kashinkunti, Prethvi; Bernauer, Julie; Catanzaro, Bryan; Phanishayee, Amar; Zaharia, Matei

Date Published:: 2021-11-13

Journal Name:: Supercomputing 2021

Page Range / eLocation ID:: 1 to 15

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3458817.3476209

More Like this