null
(Ed.)
The size of Transformer models is growing at an
unprecedented rate. It has taken less than one
year to reach trillion-level parameters since the
release of GPT-3 (175B). Training such models
requires both substantial engineering efforts and
enormous computing resources, which are luxuries
most research teams cannot afford. In this
paper, we propose PipeTransformer, which
leverages automated elastic pipelining for efficient
distributed training of Transformer models.
In PipeTransformer, we design an adaptive
on the fly freeze algorithm that can identify and
freeze some layers gradually during training, and
an elastic pipelining system that can dynamically
allocate resources to train the remaining active
layers. More specifically, PipeTransformer
automatically excludes frozen layers from the
pipeline, packs active layers into fewer GPUs,
and forks more replicas to increase data-parallel
width. We evaluate PipeTransformer using
Vision Transformer (ViT) on ImageNet and
BERT on SQuAD and GLUE datasets. Our results
show that compared to the state-of-the-art baseline,
PipeTransformer attains up to 2:83-
fold speedup without losing accuracy. We also
provide various performance analyses for a more
comprehensive understanding of our algorithmic
and system-wise design. Finally, we have modularized
our training system with flexible APIs
and made the source code publicly available at
https://DistML.ai.
more »
« less