<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Memory-Efficient Pipeline-Parallel DNN Training</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>05/22/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10327318</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of Machine Learning Research</title>
<idno>2640-3498</idno>
<biblScope unit="volume">139</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Deepak Narayanan</author><author>Amar Phanishayee</author><author>Kaiyu Shi</author><author>Xie Chen</author><author>Matei Zaharia</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models. However, parameters and activations for such large models often do not fit in the memory of a single accelerator device; this means that it is necessary to distribute training of large models over multiple accelerators. In this work, we propose PipeDream-2BW, a system that supports memory-efficient pipeline parallelism. PipeDream-2BW uses a novel pipelining and weight gradient coalescing strategy, combined with the double buffering of weights, to ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism. In addition, PipeDream-2BW automatically partitions the model over the available hardware resources, while respecting hardware constraints such as memory capacities of accelerators and interconnect topologies. PipeDream-2BW can accelerate the training of large GPT and BERT language models by up to 20x with similar final model accuracy.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In the quest to achieve higher accuracy across a range of tasks, DNN models have grown in size, often by scaling up the number of parameters in existing architectures <ref type="bibr">(Devlin et al., 2018;</ref><ref type="bibr">Radford et al., 2018;</ref><ref type="bibr">2019;</ref><ref type="bibr">Brown et al., 2020)</ref>. It is challenging to train large models with billions of parameters. Modern accelerators have limited memory, which means that the model parameters and intermediate outputs that need to be in accelerator memory during training might not fit on a single accelerator. One of the solutions researchers and practitioners have turned to is model-parallel training <ref type="bibr">(Dean et al., 2012;</ref><ref type="bibr">Chilimbi et al., 2014)</ref>, where a model is partitioned over multiple accelerator devices. How-Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). ever, model parallelism, when traditionally deployed, can either lead to resource under-utilization <ref type="bibr">(Narayanan et al., 2019)</ref> or high communication overhead with good scaling only within a multi-GPU server <ref type="bibr">(Shoeybi et al., 2019)</ref>, and consequently an increase in training time and dollar cost.</p><p>Recent work has proposed pipelined model parallelism to accelerate model-parallel training.</p><p>For example, GPipe <ref type="bibr">(Huang et al., 2019)</ref> and PipeDream <ref type="bibr">(Harlap et al., 2018;</ref><ref type="bibr">Narayanan et al., 2019)</ref> push multiple inputs in sequence through a series of workers that each manage one model partition, allowing different workers to process different inputs in parallel. Na&#239;ve pipelining can harm model convergence due to inconsistent weight versions between the forward and backward passes of a particular input. Existing techniques trade off memory footprint and throughput in different ways to avoid this. GPipe maintains a single weight version, but has periodic pipeline flushes where the pipeline is drained of inputs to update weights (Figure <ref type="figure">1a</ref>); these flushes limit overall throughput as resources are idle. PipeDream does not periodically flush the pipeline but stores multiple weight versions, which increases throughput but also increases the memory footprint, making the training of large models infeasible due to memory constraints. Efficient training of large models requires an approach with both high throughput and low memory footprint.</p><p>Additionally, the performance of a pipeline-parallel system is dependent on how DNN model operators are partitioned over workers. This is challenging for three reasons:</p><p>&#8226; Memory Capacity Constraints: Parameters and intermediate activations associated with a model partition need to fit in the main device memory of the accelerator.</p><p>&#8226; Heterogeneous Network Interconnects: Training deployments today feature heterogeneous network topologies, with higher-bandwidth links between devices on the same server.</p><p>&#8226; Large Search Space for Operator Placement: As model sizes increase, splitting an operator graph becomes computationally expensive since the number of distinct partitionings is exponential in the model size.</p><p>In this paper, we introduce PipeDream-2BW, a system for efficient pipeline-parallel training of DNN models with billions of parameters. PipeDream-2BW achieves high through-put and low memory footprint using two key contributions. First, we propose double-buffered weight updates (2BW), a technique that reduces the memory footprint of training while avoiding pipeline flushes. We leverage the fact that every input's generated gradient does not need to be applied to weights immediately, and instead can be accumulated into a "coalesced" gradient to limit the number of weight versions maintained. Instead of flushing the pipeline before using newly updated weights, 2BW uses the new weights for inputs newly admitted into the pipeline, while using the previous weight version, called the shadow version, for already in-flight inputs. This double buffering of weights at each worker yields a pipelining scheme with higher throughput than GPipe (no pipeline flushes) and better memory efficiency than PipeDream (2 weight versions, versus worst case of d in PipeDream for a depth-d pipeline).</p><p>2BW introduces a constant weight delay term of 1, consistent across stages, while updating weights (weight update equation of W (t+1) = W (t) -&#957; &#8226; &#8711;f (W (t-1) )), which we show has empirically similar model convergence to vanilla weight updates ( &#167;5.1). We also present a variant of 2BW (called PipeDream-Flush) that trades off throughput for even lower memory footprint and vanilla semantics (weight update equation of</p><p>Second, PipeDream-2BW provides a planning algorithm that yields effective parallelization schemes for many of today's large model architectures. PipeDream-2BW's planner partitions DNN operators over the available workers while taking into account the memory capacities of the accelerator devices, and addresses the three challenges highlighted earlier.</p><p>PipeDream-2BW's planner exploits the repetitive structure of large DNNs, e.g., transformer layers in BERT <ref type="bibr">(Devlin et al., 2018)</ref>, to explore the space of schedules where each stage in the pipeline is replicated equally. This choice reduces the size of the search space explored drastically compared to existing work like PipeDream and FlexFlow <ref type="bibr">(Jia et al., 2018)</ref>, while still providing effective model splits in practice. PipeDream-2BW's planner determines the size of each model partition, batch size, and whether to use memorysaving optimizations like activation recomputation <ref type="bibr">(Chen et al., 2016;</ref><ref type="bibr">Griewank &amp; Walther, 2000)</ref>. PipeDream-2BW's planner considers the impact of these decisions on both throughput and memory footprint, unlike PipeDream and FlexFlow. Finally, the planner tries to ensure expensive communication stays on high-speed intra-server interconnects.</p><p>We find that the Adam optimizer with 2BW has a similar training loss trajectory to vanilla Adam with the same batch size, with similar accuracy on downstream finetuning tasks. PipeDream-2BW achieves end-to-end speedups of 1.3&#215; to 20&#215; for various GPT models compared to an optimized model-parallel baseline. PipeDream-2BW is up to 3.2&#215; faster than GPipe, and is able to train large transformer models that vanilla PipeDream cannot fit in memory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background</head><p>In this section, we provide a brief overview of related techniques for distributed training of DNN models.</p><p>Data Parallelism. Data parallelism is used to scale up model training. With data parallelism <ref type="bibr">(Xing et al., 2015)</ref>, every worker has a copy of the entire model and the input dataset is sharded across workers. Data parallelism cannot be used to train large models that do not fit on a single worker, but can be used on smaller model partitions.</p><p>Model Parallelism. Model parallelism is used traditionally to train large models that do not fit on a single worker. With model parallelism <ref type="bibr">(Dean et al., 2012;</ref><ref type="bibr">Chilimbi et al., 2014)</ref>, the weight parameters in a model are split over available workers, with intermediate activations and gradients communicated across workers. Inter-layer model parallelism underutilizes resources since at most a single worker is active at any point in time. Tensor (intra-layer) model parallelism <ref type="bibr">(Shoeybi et al., 2019)</ref> leads to expensive all-to-all communication in the critical path, limiting the number of model partitions to the number of GPUs in a single server. FlexFlow <ref type="bibr">(Jia et al., 2018)</ref> shows how to split a model graph using model and data parallelism, but still suffers from poor resource utilization when model parallelism is used.</p><p>Pipeline Parallelism. To address the shortcomings of model parallelism, recent work like PipeDream and GPipe have proposed pipeline parallelism. With pipeline parallelism, multiple inputs (instead of 1) are injected into a pipeline composed of inter-layer model partitions. This ensures that compute resources are better utilized. However, naive pipelining can lead to weight version mismatches between forward and backward passes for a particular input. Specifically, if weight updates are immediately applied to the latest weight version, then an input might see weight updates in the backward pass that it did not see in the forward pass, leading to incorrect gradient computations.</p><p>GPipe maintains a single version of the model's weights. An input batch is split into smaller microbatches. Weight gradients are accumulated and not applied immediately, and the pipeline is periodically flushed to ensure that multiple weight versions do not need to be maintained. GPipe provides weight update semantics similar to data parallelism. Figure <ref type="figure">1a</ref> shows a timeline of GPipe execution. The periodic pipeline flushes can be expensive, limiting throughput. One way to mitigate this overhead is to perform additional accumulation within the pipeline, but this is not always practical: a) at large scale factors, the minimum supported batch size is large (proportional to the scale factor), and large batch sizes affect convergence for all models (e.g., Megatron <ref type="bibr">(Shoeybi et al., 2019)</ref> uses a batch size of 1024 for BERT and 512 for GPT with 512 GPUs), b) GPipe needs to maintain activation stashes proportional to the batch size. After:</p><p>1 2 3 4 1 5 2 6 3 7 4 8 5 1 2 3 1 4 2 5 3 6 4 7 5 8 1 2 1 3 2 4 3 5 4 6 5 7 6 <ref type="table">1 1 2 2 3 3 4 4 5 5 6 6 7 7</ref> Backward Pass Forward Pass Time (b) PipeDream. PipeDream uses a weight stashing scheme to ensure that the same weight version is used in both the forward and backward passes for the same input (Figure <ref type="figure">1b</ref>). The total number of weight versions stashed is d in the worst case, where d is the pipeline depth, which is too high for large models. With PipeDream's default weight update semantics, weight updates at each stage have different delay terms, and no accumulation is performed within the pipeline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">PipeDream-2BW System Design</head><p>PipeDream-2BW uses memory-efficient pipeline parallelism to train large models that do not fit on a single accelerator. Its double-buffered weight update (2BW) and flush mechanisms ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism. PipeDream-2BW splits models into stages over multiple workers, and replicates each stage an equal number of times (with dataparallel updates across replicas of the same stage). Such parallel pipelines work well for models where each layer is repeated a fixed number of times (e.g., transformer models).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Double-Buffered Weight Updates (2BW)</head><p>PipeDream-2BW uses a novel double-buffered weight update (2BW) scheme in conjunction with 1F1B scheduling <ref type="bibr">(Narayanan et al., 2019)</ref>, where each worker alternates between forward and backward passes for different inputs, to ensure that the same weight version is used in both the forward and the backward pass for a particular input (Figure <ref type="figure">2</ref>). 2BW has a lower memory footprint than PipeDream and GPipe, and also avoids GPipe's expensive pipeline flushes.</p><p>Gradients are computed at the granularity of smaller microbatches. For any input microbatch, PipeDream-2BW uses the same weight version for an input's forward and backward passes. Updates are accumulated over multiple microbatches before being applied at the granularity of a batch, limiting the number of weight versions generated and maintained. Figure <ref type="figure">2</ref> shows an example timeline of 2BW. PipeDream-2BW generates a new weight version once every m microbatches (m &#8805; d, the pipeline depth). For simplicity, we will initially assume that m = d (d = 4 in Figure <ref type="figure">2</ref>). A new weight version cannot be used immediately. In particular, in-flight inputs cannot use the newest weight version for their backward passes (for example, input 7 on worker 3 at t = 21), since the forward pass for these inputs was already initiated using an older weight version on a different stage. Thus, newly generated weight versions need to be buffered for future use. However, the total number of weight versions that need to be maintained is at most 2, since the weight version used to generate a new weight version can immediately be discarded (no future inputs that pass through that stage use the old weight version any longer). For example, in Figure <ref type="figure">2</ref>, each worker can discard W (0) i once they are done processing the backward pass for input 8 since all subsequent inputs use a later weight version for both their forward and backward passes.</p><p>The weight version a given input microbatch k (1-indexed) uses is max( (k -1)/m -1, 0), where m is the number of microbatches in a batch (4 in Figure <ref type="figure">2</ref>). This weight version is the same for both the forward and backward passes for input k. m can be any number &#8805; d; additional gradient accumulation (larger m) increases the global batch size.    input activations for microbatch size b for a pipeline stage.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Memory</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#119905; = 21 Time</head><p>In comparison, GPipe needs to checkpoint potentially a much larger number of input activations -proportional to the total number of microbatches accumulated within the pipeline before applying a weight update (m). With activation recomputation, GPipe's memory footprint with a per-GPU microbatch size b is</p><p>for even small b for most models <ref type="bibr">(Jain et al., 2018)</ref>, the memory savings from maintaining one fewer weight version is small. To achieve high throughput, GPipe must use a large value of m to amortize away the cost of pipeline flushes; at such high m, its memory footprint is higher than PipeDream-2BW. Additionally, due to its higher memory footprint, GPipe must always use activation recomputation. Activation recomputation, however, reduces throughput by about 33%, and should be avoided if possible.</p><p>Semantics. We can also formalize the semantics of 2BW. We denote W (t) as the weight version after t batches of size B. &#8711;f (W ) is the gradient averaged over the B samples in the batch. Vanilla minibatch SGD (f is the loss function, &#957; is the learning rate) then has the following weight update equation:</p><p>). 2BW's weight update semantics (with a delay term of 1 across all stages) are almost unchanged:</p><p>We show that this delay term does not affect model convergence significantly in &#167;5.1. Intuitively, the parameters of the model do not change significantly across single iterations, so W (t) &#8776; W (t-1) . The semantics with a replication factor greater than 1 is similar, with the batch size multiplied by the number of replicas (as with regular data parallelism). Other momentum-based optimizers such as Adam can be similarly analyzed (momentum term uses a weight gradient computed on a 1-stale weight version instead of latest version). Extra shadow variables are not needed. For example, m t in minibatch SGD with momentum can be computed as (ignoring bias corrections)</p><p>). The final weight update equation is then</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Weight Updates with Flushes (PipeDream-Flush)</head><p>We also propose a second memory-efficient pipeline schedule called PipeDream-Flush. It has lower memory footprint than 2BW and vanilla optimizer semantics, at the cost of lower throughput. This schedule reuses the 1F1B schedule from PipeDream <ref type="bibr">(Narayanan et al., 2019)</ref>, but maintains a single weight version and introduces periodic pipeline flushes to ensure consistent weight versions across weight updates. Timelines for PipeDream-Flush and GPipe with 2 pipeline stages are shown in Figure <ref type="figure">3</ref>.</p><p>Memory Footprint. With PipeDream-Flush, the total number of in-flight "active" input activations is less than or equal to the pipeline depth, giving it lower memory footprint than GPipe, which has to maintain input activations proportional to the number of microbatches over which gradients are averaged (m). PipeDream-Flush's memory footprint is also lower than PipeDream-2BW since it only needs to maintain a single weight version (versus 2 with PipeDream-2BW).</p><p>Semantics. Periodic pipeline flushes ensure that weight updates can be performed with gradients computed using the latest weight version. This results in weight updates of the form</p><p>). We compare 2BW's statistical efficiency (rate of model convergence) to the vanilla semantics of PipeDream-Flush, GPipe, and data parallelism, in &#167;5.1. Stage replicas can be placed on the same server so that expensive all-reduce updates are between GPUs on the same server with high-bandwidth interconnects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Equi-replicated</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Planner</head><p>PipeDream-2BW's planner determines how to split a model over the available compute devices by exhaustively searching over the reduced search space of all possible parallelpipeline configurations. The planner also determines whether memory-saving optimizations should be deployed, and the per-GPU microbatch size and degree of gradient accumulation, given a maximum safe global batch size verified to not compromise model convergence (e.g., determined from past hyperparameter sweeps without pipelining).</p><p>PipeDream-2BW's planner uses a cost model for the compute times and memory footprints of individual blocks in the model. Time and memory cost functions allow PipeDream-2BW to reason about the impact of pipeline width / depth and memory-saving optimizations (such as activation recomputation) on throughput and memory footprint. For example, a deeper configuration has additional memory capacity, allowing for a larger maximum per-GPU microbatch size; this can increase the arithmetic intensity (number of floating point operations performed per memory load) of kernels <ref type="bibr">(Jouppi et al., 2017)</ref>, and consequently throughput. Communication times for tensors can be estimated by dividing the size of the tensor by the respective bandwidth. Expensive communication (e.g., large tensors, or all-reduce communication needed to coalesce weight gradients across stage replicas) can be placed on high-bandwidth links within the server by orienting pipelines appropriately.</p><p>Profiling for cost modeling can be done in two ways: endto-end for each distinct configuration, or extrapolating from an individual block's measurements. End-to-end profiling is cheap (2 to 3 minutes per configuration), which means total profiling time is still a couple of hours (compared to the days to weeks needed for model training). Optimal configurations can be reused for a given server and model deployment. We describe how per-block time and memory measurements can be extrapolated in Appendix &#167;A -this is even cheaper, but provides less accurate cost estimates.</p><p>The highest-throughput-configuration is chosen that also fits within the memory capacity of the target accelerators.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Activation Recomputation</head><p>Activation recomputation is a common technique <ref type="bibr">(Huang et al., 2019;</ref><ref type="bibr">Chen et al., 2016;</ref><ref type="bibr">Griewank &amp; Walther, 2000)</ref> that trades off extra computation for a lower memory footprint. With activation recomputation, activation stashes are not left materialized on the device between forward and backward passes; instead, only input activations on each stage are stashed, and the remaining activations needed in the backward pass are recomputed when required by rerunning the forward pass. Activation recomputation trades off extra computation for a lower memory footprint.</p><p>Activation recomputation is useful for two reasons: it can enable larger per-GPU microbatch sizes to fit in memory, which can improve device throughput by increasing the arithmetic intensity of kernel. It can also enable the training of large models. Concretely, in some cases, the target accelerator device does not have sufficient memory capacity to store full activation stashes for all in-flight microbatches. This is especially true for deep pipelines, since the number of in-flight inputs is proportional to the depth of the pipeline <ref type="bibr">(Narayanan et al., 2019)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Partitioning Algorithm</head><p>Putting it all together, given a total memory capacity M , PipeDream-2BW's planner first determines the largest per-GPU microbatch size that fits on a given worker (and the corresponding throughput) with and without each memorysavings optimization deployed using a memory cost function. The partitioning algorithm also verifies that the resulting global batch size is lower than the maximum safe batch size B . Each memory-savings optimization can be integrated into PipeDream-2BW's planner by specifying a corresponding throughput and memory cost function.</p><p>PipeDream-2BW's planner then sweeps all (w, d) values to determine the best pipeline configuration for a given model and hardware deployment. Configurations with memory footprint higher than the memory capacity M of the device (modeled by the MEMORY(.) cost function) are discarded.</p><p>Gradient accumulation can be used to increase the batch size to B. The partitioning algorithm aims to pick a configuration that has a high compute-to-communication ratio, while accounting for the communication time across stages in the same pipeline and across replicated stages (modeled by the THROUGHPUT(.) cost function). The full algorithm is shown in Appendix &#167;A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Evaluation</head><p>In this section, we show that the Adam optimizer with 2BW has similar semantics to vanilla Adam, and that PipeDream-2BW and PipeDream-Flush are able to train large models faster than existing model-parallel approaches including Megatron <ref type="bibr">(Shoeybi et al., 2019)</ref>, and existing pipelining approaches like GPipe <ref type="bibr">(Huang et al., 2019)</ref>.</p><p>Hardware. We show results on two different hardware setups on AWS: eight 8&#215;V100 servers (64 GPUs) with NVLink and 16GB of per-GPU memory, and a single 8&#215;V100 server. We use p3.16xlarge instances.</p><p>Implementation. Our implementation uses PyTorch and is adapted from the Megatron repository (meg); we verified that single-worker performance with this implementation achieves about 45 TFLOPS on a 355M-parameter GPT model and is competitive with existing state-of-the-art open  source implementations from NVIDIA (nvi). All results shown are with mixed precision.</p><p>Models. We evaluate PipeDream-2BW on BERT <ref type="bibr">(Devlin et al., 2018)</ref> and GPT <ref type="bibr">(Radford et al., 2019)</ref>, large transformer-based language models used for a number of NLP applications. In particular, most of our experiments are performed with GPT models with 1.3, 2.2, and 3.9 billion parameters, with similar layer dimensions to those used in the Megatron paper <ref type="bibr">(Shoeybi et al., 2019)</ref>.</p><p>Baselines. We compare PipeDream-2BW to two types of baselines: (a) model parallelism without pipelining (tensor model parallelism used in Megatron, and inter-layer model parallelism); and (b) GPipe (we extend GPipe to use parallel pipelines, and refer to this enhanced version as GPipe in the rest of this paper), which performs pipeline parallelism. We do not compare to PipeDream or data parallelism for the entire model since they cannot fit the above models in memory when using 16-GB V100 GPUs. With 64 GPUs, we use data parallelism across stages to scale up training.</p><p>Main Takeaways. We make the following observations:</p><p>&#8226; Quality of Convergence: 2BW weight update semantics yield pre-trained models which produce comparable accuracy on downstream finetuning tasks to vanilla Adam (GPipe and PipeDream-Flush) with the same batch size.</p><p>&#8226; Comparison to Model Parallelism: PipeDream-2BW is able to train a 3.8 billion-parameter GPT model up to 20&#215; faster compared to non-pipelining approaches.</p><p>&#8226; Comparison to Other Pipelined Approaches: PipeDream-2BW is up to 3.2&#215; faster than GPipe.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Quality of Convergence of 2BW</head><p>We pre-trained 355M-parameter BERT and GPT models with vanilla Adam and Adam with 2BW; we then finetuned the resulting BERT models. We note that GPipe, PipeDream-Flush, and DP have identical semantics, and hence are equivalent baselines ("Vanilla"). To provide a fair comparison, we use the same hyperparameters, including batch size, used by Megatron <ref type="bibr">(Shoeybi et al., 2019)</ref> to train these BERT and GPT models. For BERT, we use a batch size of 1024, and for GPT, we use a batch size of 512. We use the Adam optimizer with standard hyperparameters (learning rate of 10 -4 with initial warmup and subsequent linear decay, maximum sequence length of 512), and mixed precision. We used the OpenWebText dataset (ope) for pretraining. Figure <ref type="figure">5</ref> shows the training and validation loss for the two models.</p><p>The training and validation losses for the 2BW runs track the vanilla runs almost identically after the first 100k iterations (when the model is changing more rapidly and the delay term matters more).</p><p>To further validate the quality of the pre-trained model, we finetuned the pre-trained vanilla and 2BW BERT models on downstream MNLI and RACE tasks <ref type="bibr">(Wang et al., 2019;</ref><ref type="bibr">Lai et al., 2017)</ref>. Both pre-training and fine-tuning were performed with the same hyperparameter and training setups, and we did not perform hyperparameter tuning for eitherour goal here is to show that 2BW has nearly identical semantics to the corresponding vanilla optimizer. As shown in Table <ref type="table">1</ref>, the accuracy on each of these tasks is similar after finetuning. We also evaluated the vanilla and 2BW GPT models on the Wikitext-103 test dataset and got similar test perplexities <ref type="bibr">(19.28 vs. 19.56)</ref>; test perplexities match exactly when is run for 20% fewer iterations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Throughput</head><p>Figure <ref type="figure">6</ref> shows the throughputs of various PipeDream-2BW, PipeDream-Flush, and baseline configurations using 8 and 64 V100s with a sequence length of 512 for various large GPT models. Results with BERT models are similar and included in Appendix &#167;B.1. We compare to two different forms of model parallelism, as well as GPipe. Data parallelism is not a viable baseline for these large models due to its high memory overhead. In these experiments, we use activation recomputation, and the largest per-GPU microbatch size that fits on the 16-GB V100 GPUs. We use the best configuration recommended by PipeDream-2BW's planner for all comparisons: 8-deep configurations for the model with 2.2 billion parameters, and 16-deep configurations for the model with 3.8 billion parameters. For each model, we show two different batch sizes to show the impact of batch    size on throughput for approaches that use periodic flushes.</p><p>Model Parallelism without Pipelining: We compare against two model parallelism approaches: tensor model parallelism used by Megatron <ref type="bibr">(Shoeybi et al., 2019)</ref> where each layer is divided among all model-parallel workers, and inter-layer model parallelism where layers are sharded over the workers but inputs are not pipelined. On a single node, PipeDream-2BW is faster than tensor MP by 1.3&#215;. This grows to 20&#215; on 64 GPUs for the model with 3.8 billion parameters, when the all-to-all communication used by tensor MP needs to be performed across servers, which is expensive using AWS instances (bandwidth across multi-GPU servers is much lower than the bandwidth within server).</p><p>Compared to inter-layer MP, pipelining with flushes increases throughput by up to 4.1&#215; for small batch sizes, and by up to 5.3&#215; for large batch sizes, on the 2.2-billion model. 2BW is up to 6.1&#215; faster than inter-layer MP.</p><p>GPipe: PipeDream-2BW outperforms corresponding GPipe configurations at the same global batch size by up to 3.2&#215; due to the lack of periodic pipeline flushes. GPipe natively has high memory footprint due to a large number of activation stashes: consequently, the maximum number of microbatches it can admit is small, leading to a larger pipeline bubble and 2.1&#215; worse throughput than PipeDream-Flush at low batch sizes, and 3&#215; at high batch sizes.</p><p>PipeDream-Flush and PipeDream-2BW:   faster. With more gradient accumulation (batch size of 2048), this speedup drops to 15%. However, high g is not always practical. Both PipeDream-Flush and PipeDream-2BW have weight updates with a batch size of b</p><p>where the total number of workers is w &#8226; d. For a large number of workers ( <ref type="formula">64</ref>), the batch size is high even with g = 1, m = d, making additional gradient accumulation infeasible (batch size cannot scale to &#8734; without affecting model convergence). Indeed, systems like Megatron <ref type="bibr">(Shoeybi et al., 2019)</ref>, that train large transformer models using 512 GPUs, show state-of-the-art results across tasks using a global batch size &#8804; 1024.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Memory Footprint</head><p>We measured the worst-case memory footprint of different systems on a GPT model, shown in Figure <ref type="figure">7</ref>. GPipe runs out of memory at a batch size of 64, due to a larger number of activation stashes from its all-forward-all-backward schedule, even with activation recomputation (worst case of m input activation stashes with activation recomputation, compared to d for PipeDream-Flush). PipeDream-Flush has a slightly higher memory footprint compared to interlayer model parallelism, since it needs to maintain activation stashes for more in-flight microbatches. PipeDream-2BW has a higher memory footprint than PipeDream-Flush due to an additional weight version (but still lower than GPipe's).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Planning Decisions</head><p>In this sub-section, we analyze the implications of pipeline depth and width on performance. We show experiments demonstrating the impact of activation recomputation on performance in Appendix &#167;B.2. Figure <ref type="figure">8</ref> shows the throughputs of two PipeDream-2BW configurations for different batch sizes. We highlight relevant takeaways below. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Inter-Stage Communication:</head><p>As the global batch size increases with gradient accumulation, throughput for each configuration increases due to less communication across stage replicas. This is especially true for configurations with communication across servers (w &gt; 8, d &lt; 8 for 8-GPU servers, e.g. d = 4) where inter-stage all-to-all communication is cross-node and more expensive.</p><p>Compute-Communication Ratio: Increasing the pipeline depth decreases the amount of computation in each pipeline stage while keeping the number of bytes communicated between stages constant. This makes the pipeline more communication-bound, decreasing throughput.</p><p>Maximum Per-GPU Microbatch Size: Increasing the pipeline depth increases the maximum microbatch size that fits in GPU memory. This leads to possibly higher arithmetic intensity and throughput. In Figure <ref type="figure">8</ref>, we show throughput for two microbatch sizes for the d = 8 configuration; the larger microbatch size (b = 32) has higher throughput. Smaller pipeline depths cannot fit large microbatch sizes.</p><p>Maximum Model Size: Deeper pipelines support the training of larger models. We show the empirically measured maximum model size that can be trained with 2BW using different values of d in Figure <ref type="figure">9</ref>.</p><p>These observations illustrate the complexity in picking a configuration. For example, increasing pipeline depth leads to two effects (decreased compute-communication ratio within the pipeline and increased arithmetic intensity) that have opposing effects on throughput. PipeDream-2BW's planner automates this process for each combination of model, batch size, and number of GPUs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.">Maximum Model Size Supported</head><p>Figure <ref type="figure">9</ref> shows the empirically measured maximum model size supported by various pipeline depths while using 2BW.</p><p>As can be seen in the figure, deeper configurations provide additional memory capacity. PipeDream-2BW is able to train models of up to almost 30 billion parameters using 64 16-GB GPUs. As a point of comparison, Megatron-LM <ref type="bibr">(Shoeybi et al., 2019)</ref> was able to train a model with 8.3 billion parameters with 8 32-GB GPUs (2&#215; more memory).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Related Work and Discussion</head><p>In this section, we expand on work related to PipeDream-2BW, and place PipeDream-2BW's speedups in context.</p><p>Model Parallelism in Real Deployments. NVIDIA used a custom intra-layer model parallelism scheme in its Megatron system <ref type="bibr">(Shoeybi et al., 2019)</ref>  Pipeline Parallelism. We discussed the existing approaches to pipeline parallelism in &#167;2, and showed quantitative comparisons in &#167;5.2. PipeDream-2BW trains large models up to 3.2&#215; faster than GPipe at low batch sizes, due to a lack of periodic pipeline flushes, and lower memory footprint that allows more input microbatches to be pushed into the pipeline. PipeDream cannot train these large models. PipeDream-2BW's lower memory footprint does come with tradeoffs, however -PipeDream-2BW accumulates weight gradients over multiple microbatches, increasing the minimum batch size that PipeDream-2BW supports. Thus, for models that only support very small batch sizes, PipeDream-2BW, PipeDream-Flush, and GPipe, which perform gradient accumulation within the pipeline, may not be viable.</p><p>PipeMare <ref type="bibr">(Yang et al., 2019)</ref> uses asynchronous pipeline parallelism to provide high throughput (no pipeline flushes) with asynchronous weight update semantics. PipeMare offers two theoretically-motivated techniques to ensure good statistical efficiency. In contrast, PipeDream-2BW and all the baselines we compare against in the paper (traditional data parallel training, PipeDream, GPipe), use synchronous execution where the weights used for computation during forward propagation are the same as those used during backward propagation. PipeDream-2BW's double buffered weight updates use a 1-stale gradient update that does not require any hyperparameter tuning to generate comparable results. PipeMare also does not describe how computation should be partitioned among the available workers.</p><p>Memory-Saving Optimizations. A rich line of work attempts to decrease the memory footprint of DNN training. Gist <ref type="bibr">(Jain et al., 2018)</ref> employs lossless and lossy layerspecific encoding schemes to compress stashed activations. Systems such as Checkmate <ref type="bibr">(Jain et al., 2020)</ref> systematically determine when activation recomputation <ref type="bibr">(Chen et al., 2016;</ref><ref type="bibr">Griewank &amp; Walther, 2000)</ref> should be performed.</p><p>DeepSpeed <ref type="bibr">(Rajbhandari et al., 2019)</ref> partitions optimizer state over data-parallel replicas instead of replicating it, using a technique called ZeRO. Such orthogonal optimizations can be combined and incorporated in PipeDream-2BW.</p><p>Planning Algorithms. PipeDream, DAPPLE <ref type="bibr">(Fan et al., 2021)</ref>, and FlexFlow <ref type="bibr">(Jia et al., 2018)</ref> use planning algorithms to partition operator graphs over multiple accelerators to maximize throughput. Unfortunately, these planners do not exploit the repetitive nature of modern transformerbased models. For example, PipeDream's planner explores O(n 3 m 2 ) configurations (assuming n layers in the model and m workers). Furthermore, these planners do not consider the effect of memory-saving optimizations, which are critical for training large models efficiently (e.g., always applying activation recomputation can make the system 1.33&#215; slower). PipeDream-2BW's planner, on the other hand, performs an exhaustive search of a much reduced search space since it only considers parallel pipelines (all possible (w, d) pairs with m workers is O(m 2 )). Given this small number of explored configurations, Bagpipe's planner takes a fraction of a second with a closed-form cost model; PipeDream's partitioning algorithm with the same cost model takes about 30 minutes for large models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>In this work, we proposed and implemented PipeDream-2BW, a system for memory-efficient pipeline-parallel training that achieves high throughput, low memory footprint, and data parallelism-like semantics through a novel weight update double buffering strategy called 2BW. PipeDream-2BW also uses a planner to determine how to partition a model's operator graph over training resources in a memoryaware way. PipeDream-2BW accelerates the training of models with billions of trainable parameters by up to 20&#215; compared to model-parallel baselines, and by up to 3.2&#215; compared to GPipe, on commodity hardware.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Planner, Additional Details</head><p>For every possible configuration of width and depth, PipeDream-2BW's planner explores the benefit of pipelining and each space-saving optimization. For example, with activation recomputation as a target memory-savings optimization, PipeDream-2BW considers three possible executions:</p><p>&#8226; Model and data parallelism without pipelining (with the largest per-GPU microbatch size that fits in memory).</p><p>&#8226; Hybrid parallelism with pipelining and without activation recomputation (all required weight versions and activation stashes in memory for in-flight microbatches).</p><p>&#8226; Hybrid parallelism with pipelining and recomputation.</p><p>PipeDream-2BW's planner estimates the throughput and memory footprint of each of these possible executions using a cost model. PipeDream-2BW's planner then tries to find the configuration with highest throughput that also fits in main device memory of the accelerators used (memory capacity provided as input). In this section, we show one such cost model for throughput and memory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1. Closed-Form Cost Functions</head><p>In our experiments, we used profile-based cost functions that run configurations end-to-end for a couple of hundred iterations. However, performance of different parallel configurations can also be estimated using closed-form expressions that use more fine-grained profile information (e.g., time and memory footprint of each transformer block). We present one such cost model here.</p><p>A.1.1. THROUGHPUT(.) COST FUNCTION</p><p>The throughput of various hybrid-parallel setups with and without pipelining can be modeled using the times of forward and backward passes obtained from a simple profiling step. Let b be the largest per-GPU microbatch size without additional weight and activation versions, and b be the largest per-GPU microbatch size that can fit on the device when multiple versions are needed (b &#8804; b). As before, w and d are the pipeline width and depth. </p><p>With pipelining, computation of different stages can be overlapped. A microbatch of size b can then be processed every t seconds, where t is given by the expression:</p><p>With activation recomputation, the number of floating point operations increases, since forward passes need to be repeated to recompute the activation stashes needed in the backward pass. We use a constant multiplier c extra to represent this. c extra = 4/3 is a reasonable value for this constant, since the backward pass typically takes twice as long as the forward pass. c extra can also be measured empirically. Arithmetic intensity might also increase, which is captured by T comp i (.) being a function of the microbatch size b. Communication time remains unchanged from before. Every b inputs can now be processed in time t, where t is given by,</p><p>The throughput in samples per second of each of these setups is then the corresponding per-GPU microbatch size (b or b ) divided by t.</p><p>Estimating T comp (.). The time to communicate weight gradients across stage replicas can be computed similarly given a bandwidth function bwdth width (w), and the number of bytes communicated during all-reduce. The number of byes communicated in an all-reduction can either be explicitly measured, or estimated using a closed-form expression <ref type="bibr">(Narayanan et al., 2019)</ref>.</p><p>bwdth depth (d) and bwdth width (w) represent the bandwidths for inter-stage and intra-stage communication. These bandwidth functions can respect hierarchical network topologies. For example, if w is less than the number of workers in a single server, communication can be performed entirely within a server, using the higher intra-server bandwidth. With Activation Recomputation. With activation recomputation, the total number of activation versions in GPU memory at any point in time is 1. This means that the PipeDream-2BW memory footprint with d stages is:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2. Partitioning Algorithm</head><p>We show pseudocode for the full partitioning algorithm in Algorithm 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Evaluation, Additional Graphs</head><p>In this section, we present additional results we could not fit in the main paper due to space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1. Throughput and Memory Footprint with BERT Models</head><p>We also ran PipeDream-2BW on two BERT models: one with 2.2 billion parameters, and another with 3.8 billion parameters. Figure <ref type="figure">10</ref> compares PipeDream-2BW's throughput against the same baselines as before, and Figure <ref type="figure">11</ref> compares PipeDream-2BW's memory footprint for these BERT models. We see that results are similar to GPT. One point of difference is that GPipe does not run out of memory at the batch size of 64 (for GPT, only a batch size of 32 fits in memory, leading to a larger pipeline bubble); however, GPipe still has higher memory footprint compared to all other baselines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2. Impact of Activation Recomputation</head><p>Figure <ref type="figure">12</ref> shows the effect of activation recomputation on throughput for various GPT models. For a given per-</p></div></body>
		</text>
</TEI>
