Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe
more »
« less
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
The size of Transformer models is growing at an unprecedented rate. It has taken less than one year to reach trillion-level parameters since the release of GPT-3 (175B). Training such models requires both substantial engineering efforts and enormous computing resources, which are luxuries most research teams cannot afford. In this paper, we propose PipeTransformer, which leverages automated elastic pipelining for efficient distributed training of Transformer models. In PipeTransformer, we design an adaptive on the fly freeze algorithm that can identify and freeze some layers gradually during training, and an elastic pipelining system that can dynamically allocate resources to train the remaining active layers. More specifically, PipeTransformer automatically excludes frozen layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width. We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on SQuAD and GLUE datasets. Our results show that compared to the state-of-the-art baseline, PipeTransformer attains up to 2:83- fold speedup without losing accuracy. We also provide various performance analyses for a more comprehensive understanding of our algorithmic and system-wise design. Finally, we have modularized our training system with flexible APIs and made the source code publicly available at https://DistML.ai.
more »
« less
- PAR ID:
- 10272378
- Date Published:
- Journal Name:
- International Conference on Machine Learning
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best 8B scale instruction-tuned linear RNN model. We also find that the distilled model has natural length extrapolation, showing almost perfect accuracy in the needle-in-a-haystack test at 20x the distillation length.more » « less
-
null (Ed.)With the growth in penetration number and power level of renewable energy resources, the need for a compact and high efficient solid state transformer becomes more important. The aim of this paper is to design a compact solid state transformer for microgrid application. The proposed transformer has four ports integrated on a single common core. Thus, it can integrate different renewable energy resources and energy storage systems. The transformer is operating at 50kHz switching frequency, and each port can handle 25kW rated power. In this paper, the ports are chosen to represent a realistic industrial microgrid model consisting of grid, energy storage system, photovoltaic system, and load. The grid port is designed to operate at 4160V AC, while the other three ports operate at 400V. Moreover, the grid, energy storage, and photovoltaic ports are active ports with dual active bridge topologies, while the load port is a passive port with full bridge rectifier one. In this paper, an extensive and complete design and modeling of the entire solid state transformer is presented. The proposed design is first validated with simulation results, and then the proposed transformer is implemented. Some preliminary experimental tests are also performed and the obtained results are reported.more » « less
-
DNN training is extremely time-consuming, necessitating efficient multi-accelerator parallelization. Current approaches to parallelizing training primarily use intra-batch parallelization, where a single iteration of training is split over the available workers, but suffer from diminishing returns at higher worker counts. We present PipeDream, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible. Unlike traditional pipelining, DNN training is bi-directional, where a forward pass through the computation graph is followed by a backward pass that uses state and intermediate data computed during the forward pass. Naïve pipelining can thus result in mismatches in state versions used in the forward and backward passes, or excessive pipeline flushes and lower hardware efficiency. To address these challenges, PipeDream versions model parameters for numerically correct gradient computations, and schedules forward and backward passes of different minibatches concurrently on different workers with minimal pipeline stalls. PipeDream also automatically partitions DNN layers among workers to balance work and minimize communication. Extensive experimentation with a range of DNN tasks, models, and hardware configurations shows that PipeDream trains models to high accuracy up to 5.3X faster than commonly used intra-batch parallelism techniques.more » « less
-
null (Ed.)In this paper, design of a compact high frequency four-port transformer for a Solid-State Transformer (SST) arrangement is presented. Unlike other SSTs, the four-port system integrates three active sources and a load port with galvanic isolation via a single transformer core. In addition to this feature, one of the three source ports is designed to operate at Medium Voltage (MV) 7.2kV for direct connection to 4.16kV AC grid, while other ports nominal voltages are rated at 400V. The transformer is designed to operate at 50kHz and to supply 25kW/port. Thus, the proposed system connects the MV grid, Energy Storage System (ESS), PV, and DC load to each other on a single common transformer core. Based on the system power demand and availability of renewable energy resources, utility and energy storage ports can either supply or draw power, while PV port can only supply power, maintaining the required demand for the load. This work focuses mainly on the High Frequency Transformer (HFT) design. An extensive study is carried out to obtain the optimal, compact, cost effective, and high efficiency model. Modeling, mathematical, and simulation results are derived and presented to demonstrate the viability of this design.more » « less