<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches</title></titleStmt>
			<publicationStmt>
				<publisher>USENIX</publisher>
				<date>04/17/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10469312</idno>
					<idno type="doi"></idno>
					
					<author>Aashaka Shah</author><author>Vijay Chidambaram</author><author>Meghan Cowan</author><author>Saeed Maleki</author><author>Madan Musuvathi</author><author>Todd Mytkowicz</author><author>Jacob Nelson</author><author>Olli Saarikivi</author><author>Rachee Singh</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Machine learning models are increasingly being trained across multiple GPUs and servers. In this setting, data is transferred between GPUs using communication collectives such as ALLTOALL and ALLREDUCE, which can become a significant bottleneck in training large models. Thus, it is important to use efficient algorithms for collective communication. We develop TACCL, a tool that enables algorithm designers to guide a synthesizer into automatically generating algorithms for a given hardware configuration and communication collective. TACCL uses a novel communication sketch abstraction to get crucial information from the designer to significantly reduce the search space and guide the synthesizer towards better algorithms. TACCL also uses a novel encoding of the problem that allows it to scale beyond single-node topologies. We use TACCL to synthesize algorithms for three collectives and two hardware topologies: DGX-2 and NDv2. We demonstrate that the algorithms synthesized by TACCL outperform the Nvidia Collective Communication Library (NCCL) by up to 6.7x. We also show that TACCL can speed up end-to-end training of Transformer-XL and BERT models by 11%–2.3x for different batch sizes.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Machine-learning models have been dramatically increasing in size over the past few years. For example, the language model MT-NLG has 530 billion parameters <ref type="bibr">[31]</ref> and the Switch-C mixture-of-experts model has 1.6 trillion parameters <ref type="bibr">[18]</ref>. Model sizes are expected to further grow to increase model accuracy and perform more complex tasks. These models are too large for the resources of a single GPU and have to be distributed across multiple servers, each with several GPUs, using different parallelism strategies like data, model, pipeline, and expert parallelism <ref type="bibr">[18,</ref><ref type="bibr">27,</ref><ref type="bibr">43]</ref> for training and inference. Intermediate data and parameters of the model at each GPU are accumulated, shuffled, and transferred over the network between other GPUs for distributed machine learning, depending on the type of parallelism strategy used.</p><p>The inter-GPU communication bottleneck. Recent work has shown that GPU idle time spent waiting for network communication can be significant in practice <ref type="bibr">[2,</ref><ref type="bibr">19,</ref><ref type="bibr">26,</ref><ref type="bibr">28]</ref>. For instance, BERT <ref type="bibr">[15]</ref> and DeepLight <ref type="bibr">[14]</ref> spent 11% and 63% of time, respectively, with GPUs idle on a 100 Gbps Ethernet cluster of P100 GPUs <ref type="bibr">[2]</ref>. Newer generations of faster GPUs will only make this problem worse. This inefficient use of GPUs shows that there is significant model performance to be gained by optimizing inter-GPU communication.</p><p>Collective communication primitives and algorithms. Efficient communication between GPUs is the key to enabling fast distributed ML training and inference. Modern GPU systems use message passing interface (MPI)-based collective communication primitives, such as ALLREDUCE, ALL-GATHER, and ALLTOALL to perform inter-GPU communication (Figure <ref type="figure">2</ref> in &#167;2). Collective algorithms implement collective communication primitives. They route data along various paths in the network and schedule the necessary computation (e.g., a sum in ALLREDUCE) while optimizing for latency and bandwidth characteristics of each link in the network. For example, a common collective algorithm for ALLGATHER (all GPUs gather data from all GPUs) is a Ring algorithm, in which all GPUs are logically arranged in a ring and each GPU receives data from its predecessor in the ring and sends a previously received data to its successor. Inefficiencies in collective communication algorithms can cause poor network utilization, causing GPUs to remain idle until inter-GPU transfers complete <ref type="bibr">[53]</ref>, and thus reducing the overall efficiency of distributed training and inference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Challenges in designing GPU communication algorithms.</head><p>Designing algorithms for efficient collective communication on GPU topologies is challenging. First, these algorithms have to strike the right balance between latency and bandwidth optimality. For instance, the commonly used Ring algorithm for ALLREDUCE is not efficient for small input sizes as it has a high latency. Second, GPU communication algorithms have to manage the heterogeneity of connectivity in the underlying topology. For instance, GPUs within a machine (also referred to as a node) are usually connected using fast NVLinks <ref type="bibr">[38]</ref> (up to 300 GBps aggregate bidirectional bandwidth per GPU) while GPUs across nodes are connected using slow Infini-</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background and Motivation</head><p>Collective communication in distributed ML workloads. Multi-GPU ML workloads typically communicate using MPIstyle collectives like ALLGATHER, ALLTOALL, and ALLRE-DUCE shown in Figure <ref type="figure">2</ref>. These primitives capture the application's intent behind the communication, thus allowing collective communication libraries to optimize for specific hardware configurations. In ALLGATHER, every GPU receives the data buffers of all other GPUs (left diagram in Figure <ref type="figure">2</ref>). In ALL-TOALL, every GPU receives different parts, or chunks, of the data buffers present on all GPUs. This effectively transposes the data chunk from buffer index to GPU index as can be seen in center diagram in Figure <ref type="figure">2</ref>. In ALLREDUCE, every GPU ends up with a data buffer that has the results of performing a point-wise computation (e.g., sum in right diagram in Figure <ref type="figure">2</ref>) over the same data index of all GPUs.</p><p>The parallelism strategy for the distributed ML workload determines which collective communication primitive is used. Data parallelism and some tensor model parallelisms <ref type="bibr">[43]</ref> make use of the ALLREDUCE collective to aggregate gradients and intermediate data respectively from multiple GPUs. Expert parallelism <ref type="bibr">[18,</ref><ref type="bibr">27]</ref> and common deep learning recommendation models (DLRM) <ref type="bibr">[32]</ref> make use of the ALL-TOALL collective to shuffle intermediate data between experts and embedding lookup data between GPUs respectively. DL-RMs <ref type="bibr">[32]</ref> also make use of the ALLGATHER collective and another REDUCESCATTER collective to perform embedding lookups from embedding tables sharded over multiple GPUs. Existing approaches to collective algorithms. Collective algorithms must be designed considering the target input sizes and the heterogeneity of the target topology. However, most collective communication libraries used for distributed ML today, including the state-of-the-art NCCL, use pre-defined templates of collective algorithms superimposed onto a target topology. For example, for collectives like ALLGATHER and REDUCESCATTER, NCCL identifies rings in the target topology and uses the Ring algorithm. For n GPUs, this algorithm requires n -1 link transfer steps per data chunk and is not ideal for smaller data sizes where link transfer latencies dominate. Further, this algorithm treats the slow inter-node and fast intra-node links similarly, scheduling equal number of data transfers across both. The communication is thus bottlenecked on the slower inter-node links, when it could have benefitted by sending more node-local data (i.e. data of GPUs local to the node) over the faster intra-node links instead.</p><p>For the ALLTOALL collective, NCCL implements the collective algorithm as peer-to-peer data transfers between all pairs of GPUs. This algorithm is topology-agnostic and often inefficient. For the ALLREDUCE collective, NCCL chooses between two algorithms -Double-Binary-Tree <ref type="bibr">[34]</ref> and Ring. This decision is made according to the communication input size and number of nodes, but might not be most accurate, as it is based on hardcoded latency and bandwidth profiling done previously by Nvidia on their machines.</p><p>Designing efficient collective algorithms requires careful analysis of the topology and its performance with different buffer sizes. Recent work <ref type="bibr">[9,</ref><ref type="bibr">51]</ref> has shown that synthesis is a promising approach for generating collective algorithms for different topologies and to achieve bandwidth and latency optimality. However, scaling these approaches to multi-node (i.e. multi-machine) distributed GPU topologies has been a challenge. We measured the synthesis time for ALLGATHER and ALLTOALL collectives on topologies of two Azure NDv2 nodes and two Nvidia DGX2 nodes (Figure <ref type="figure">5</ref>) using SCCL <ref type="bibr">[9,</ref><ref type="bibr">30]</ref>. We modified the codebase to include both topologies and attempted to synthesize the collectives with a 24-hour time limit set for each synthesis query. Given a 24-hour time limit, SCCL's pareto-optimal solver strategy did not finish synthesis for any combination of collective and topology. The only algorithm that SCCL could synthesize within the time limit was a latency optimal algorithm for ALLGATHER on two NDv2 nodes.</p><p>Low-effort inputs from algorithm designers. The search space of possible algorithms to implement a collective is intractably large and cannot be explored via brute-force. Deciding whether or not to route data chunks from n GPUs over l links in a topology has O(2 n&#215;l ) combinations. As we scale to multi-node topologies, n as well as l will also scale, increasing the exponent quadratically. The search space explodes further if we consider the problem of ordering data sends at each link along with deciding routing for the data. We argue that high-level inputs from a human algorithm designer help reduce the search space to make algorithm synthesis more tractable. In the most extreme case, the designer would hand-write the entire algorithm. However, handcrafting data routing and scheduling over links to implement a collective is complex and requires many design choices. Instead, designers only provide input in the form of a communication sketch around which TACCL synthesizes an algorithm. Our goal is to ensure that providing inputs is a low-effort activity, but can discard large parts of the search space to achieve improvements in running-time of the synthesis engine.</p><p>Synthesis technique. TACCL synthesizes a collective algorithm by deciding the route that each data chunk in the collective should take in the topology as well as the ordering of chunks at every link. Even with communication sketches which reduces the search space for the synthesizer, this decision problem is NP-hard and the complexity increases exponentially with number of GPUs. To make the problem more tractable, we first relax the synthesis problem to solve just the routing of all data chunks and then heuristically order chunks sent over the same links according to bandwidth constraints. TACCL's synthesizer design along with communication sketches help TACCL synthesize efficient collectives for multi-node topologies.</p><p>CPU memory over potentially shared PCIe links. Further, virtualization obscures the true PCIe topology (all 8 GPUs and the NIC appear directly connected to one CPU) and NUMA node and GPU IDs are not assigned consistently from VM to VM. This means that, without additional information, software cannot avoid contention over shared PCIe links, creating interference and high variance in performance.</p><p>To determine the PCIe topology, TACCL's profiler sends bandwidth and latency probes between the two CPUs, between pairs of GPUs, and between CPUs and the NIC. It answers the following questions:</p><p>&#8226; Which CPU is nearest to the NIC? We answer this using the latency of loopback operations between the NIC and each CPU.</p><p>&#8226; Which GPUs share a PCIe switch? We find all pairs of GPUs that get low bandwidth in a simultaneous copy to the CPU, indicating contention.</p><p>&#8226; Which GPUs share a PCIe switch with the NIC? We find which GPUs get low GPU to CPU bandwidth while the CPU is doing a loopback with the NIC. The CPU in this case is the one that is closer to the NIC. With this profiling information we were able to deduce the PCIe topology (Figure <ref type="figure">5b</ref>). Each CPU has two PCIe switches connecting to two GPUs each, and the Infiniband NIC is connected to one of these switches. Additionally, by running the profiler on every new NDv2 VM TACCL is able to select one of the NVLink topology's four automorphisms and set the CUDA_VISIBLE_DEVICES environment variable such that the NIC is always placed close to GPU 0.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">TACCL Synthesizer</head><p>Once the user has written a communication sketch, they are ready to call TACCL's synthesizer. This section describes the synthesis process TACCL uses, as well as additional hyperparameters available to the user.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Problem Formulation</head><p>GPUs participating in a communication collective partition their initial data into C equal chunks where C is a hyperparameter selected by the user. TACCL's synthesizer routes and schedules these chunks. Given a communication sketch and a collective, the synthesizer decides chunk transfer schedules across every link in the network, such that each chunk reaches its destination GPUs as specified by the collective.</p><p>TACCL encodes this problem as a mixed integer linear program (MILP) with binary and continuous decision variables. The encoding has a continuous variable called start_time for every chunk and GPU to indicate when a chunk is available at a GPU. A binary variable is_sent for all chunk and link pairs denotes if a chunk is sent over a link. Another continuous variable send_time indicates when a chunk is sent over a link.</p><p>The encoding has bandwidth and correctness constraints to ensure the correctness of a chunk transfer schedule. The objective of the MILP is to minimize time which is a continuous variable indicating the maximum time among all chunks that must reach their destination GPUs. Details of these variables and constraints are in Appendix B.</p><p>Additionally, TACCL's synthesizer also decides if it should merge some chunks and transfer them contiguously as one large buffer over a link. Sending n chunks contiguously in one send instruction over a link requires paying only one &#945; latency cost whereas sending n chunks one after the other requires paying n &#215; &#945; latency costs. Note that this does not change the &#946; bandwidth cost. However, sending n chunks separately over a link enables TACCL to order them such that subsequent dependent sends from the destination of the link could be scheduled earlier. TACCL's synthesizer navigates this tradeoff to minimize the time. TACCL uses this feature only for IB transfers due to their high &#945; cost and ignores it for NVLinks due to their lower latency.</p><p>MILP problems in general are NP-hard. Luckily, there are solvers such as Gurobi <ref type="bibr">[20]</ref> that apply heuristics to solve MILPs in a feasible way. However, this requires careful consideration regarding the number of variables and constraints in the formulation. In TACCL's formulation, transferring chunks over a link cannot overlap and an ordering among them is required. Therefore, potentially a binary decision is needed for every pair of chunks that may traverse a link. If we assume there are C chunks for a collective problem, there are O(C 2 ) such decisions per link. Moreover, as the number of nodes increase, the number of links increase linearly (larger topology) and the number of chunks for a collective increases linearly (ALLGATHER) or even quadratically (ALLTOALL). This large set of variables and constraints leads to infeasible solver time and memory requirements.</p><p>To solve this problem, we divide the synthesis into three parts. First, the synthesizer solves an optimization problem to determine the path used by every chunk without fixing any ordering among chunks, then it heuristically orders the chunks over every link, and finally, it solves another optimization problem to determine chunk contiguity. Complete formal descriptions of each step are in Appendix B.</p><p>Step 1: Routing solves a MILP for finding the path of each chunk independent of other chunks, allowing chunks sent over a link to overlap. The objective of this MILP is to minimize the time, which we constrain to be the maximum of two sets of variables. (1) for each link, the number of chunks that traverse that link multiplied by the transfer time of a chunk over that link. (2) for the path of each chunk, the summation of transfer times of the chunk along every link in the path. Note that this is only a lower bound on the time since we do not consider link contention or chunk ordering. TACCL also constrains each chunk's path to be via GPU ranks that are on the shortest paths from their sources to their destinations using the links the user decided to include in the logical topology. If the communication sketch specifies an algorithm symmetry, TACCL adds the constraints for the symmetric sends. Replacing switches with switch-hyperedges is also applied in this step. For each switch-hyperedge, a user-provided policy on the number of unique connections to/form a switch is applied (see Section 5.2).</p><p>TACCL uses Gurobi <ref type="bibr">[20]</ref> to solve this MILP and the solution gives every chunk a start_time for each GPU along its path. Clearly this step solves chunk routing, but only partially solves the chunk scheduling and contiguity problem and requires follow-up steps (explained next) to account for ordering the chunks sent over a link as well as minimizing &#945; costs of sends. However, by using this technique, TACCL's synthesizer is able to reduce binary variables needed from O(C 2 ) to O(C) per link.</p><p>Step 2: Heuristic Ordering decides the chunk ordering sent on each link based on a heuristic. Note that this step is not an MILP and solely solved by a greedy algorithm. Regardless of when each chunk becomes available at a GPU, this step assigns a total order on the chunks sent over a link l = (src, dst). This is decided by two heuristic functions. (1) chunks which need to traverse the longest path from src to their final GPU, have higher priority. (2) In case there is tie in (1), chunks which have traversed the shortest path from their initial GPU to src, have higher priority. This ordering will be used in Step 3 to assign the final schedules.</p><p>Step 3: Contiguity and Exact Scheduling solves an MILP problem to decide which chunks to send contiguously and gives the exact schedule. The path to be taken by chunks and their ordering over links have already been determined by the previous steps which are added as constraints to this MILP. The start_time and send_time variables are reassigned in this step by considering both the &#945; and &#946; costs for each transfer. In this step, the synthesizer allows either sending one chunk at a time or sending multiple chunks contiguously. This offers a trade-off between (1) sending the chunks that are available at the same time for a link according to the ordering in step 2 so that the subsequent sends can be scheduled earlier or (2) sending the chunks contiguously in one send instruction to save the latency cost. The objective of this MILP is to minimize the total time by enforcing all constraints which in TACCL solved by Gurobi <ref type="bibr">[20]</ref>. The solution gives the exact schedule for each chunk. The details of these constraints and their formulation are in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Synthesizer Hyperparameters</head><p>TACCL's synthesizer has some additional parameters that control the synthesis process. These are provided by the user to the synthesizer (see Figure <ref type="figure">1</ref>) through the communication sketch. Details of each parameter is described in Appendix A.</p><p>Buffer Size. TACCL needs the size of input/output buffers of a collective for the &#945;-&#946; cost model. In ML workloads the input/output buffer size is a known fixed value.</p><p>Chunk Partitioning. The data buffer at each GPU at the start of the collective can be partitioned into multiple equal chunks. Each chunk is considered as an atomic scheduling unit by the synthesizer and different chunks of the same data buffer can be routed over different links. The semantics of a collective forces a minimum number of chunks such as ALLTOALL which needs at least as many chunks as the number of GPU for each buffer. On one hand, using the minimum number of chunks is often times ideal for finding latency-optimal algorithms. On the other hand, providing a higher number of chunks allows the synthesizer to better utilize the links that might be idle otherwise which is better for finding bandwidthoptimal algorithms. Switch-Hyperedge Policy. TACCL can enforce policies for the number of connections established over a set of links in a switch-hyperedge by counting links utilized for data transfer and setting this count as a part of the MILP objective. The uc-max policy will maximize the number of connections, which performs best for small data sizes, while uc-min will minimize the number of connections, which works well when the data size is large and congestion is a concern.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Synthesizing combining collectives</head><p>TACCL synthesizes combining collectives (i.e., collectives that combine chunks like REDUCESCATTER and ALLRE-DUCE) by utilizing synthesis of non-combining collectives, similar to the technique used by SCCL <ref type="bibr">[9]</ref>. REDUCESCATTER can be implemented as an "inverse" of ALLGATHER-a send from a source GPU in ALLGATHER is instead received and reduced on the source GPU. However, simply inverting the sends does not work -a GPU may simultaneous send on different links in an ALLGATHER, but it cannot reduce all receives together in the inverse case. We thus order the inverse sends using heuristic ordering followed by contiguity encoding in order to synthesize REDUCESCATTER. ALLREDUCE is synthesized directly by concatenating REDUCESCATTER with an ALLGATHER algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">TACCL runtime</head><p>The input to TACCL runtime 2 is a TACCL-EF program, which is an XML format for representing collective algorithms. TACCL-EF programs operate on three buffers: input, output and scratch. For each buffer, the program specifies the number of chunks it will be sliced into such that all chunks are equal size. Every step of the algorithm is expressed in terms of these chunks.</p><p>The program is divided into a set of GPU programs made up of threadblocks. Each threadblock is made up of a series of steps that are executed sequentially, with each step specifying an instruction and operands as indices into the input/output/scratch buffers. The current instruction set includes sends, receives (with optional reduction), and local copies. To simplify the implementation of TACCL runtime, each threadblock can send to and receive from at most one GPU. Additionally, threadblocks within a GPU can synchronize by indicating that one step depends on another step, which will cause the interpreter to wait until the dependency has completed before executing the dependent step.</p><p>The TACCL runtime extends NCCL and it is backward compatible with its API. Therefore, integrating TACCL runtime into machine learning frameworks such as PyTorch is a single line change wherein that change swaps the third-party NCCL library for TACCL runtime. This allows TACCL to dynamically swap in collective algorithms generated for any training/inference workload using torch.distributed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Lowering to TACCL runtime</head><p>To target TACCL-EF, abstract algorithms are lowered to the executable format. The sets of sends operating on abstract chunks that comprise the steps of the algorithm are transformed into pairs of send and receive operations operating on concrete buffer indices. Furthermore, these operations are placed sequentially into threadblocks and any necessary dependencies recorded between them.</p><p>Buffer allocation. Input and output buffers are preallocated by the user and passed to the collective. Scratch buffers are allocated by the TACCL runtime per TACCL-EF. Chunks are indices in the input, output and scratch buffers. For chunks that are common for both the input and the output buffers (e.g. as in ALLGATHER) a local copy from input to the output buffer is performed at the end.</p><p>Instruction generation. The operations of the abstract algorithm are split into two instructions for the sender and receiver GPU, and chunks are translated into buffer references and indices according to the buffer allocation.</p><p>Dependency insertion. TACCL transforms a synthesized algorithm into the asynchronous execution model of TACCL-EF and dependencies for each buffer index are inserted to 2 Link to code: <ref type="url">https://github.com/microsoft/msccl</ref> ensure that the data dependencies present in the abstract algorithm are honored.</p><p>Threadblock allocation. Instructions are grouped such that all of them are either sending to at most one GPU and/or receiving from at most another GPU (possibly different). Order of the instructions inside a group should follow the order of the abstract algorithm. TACCL allocates a threadblock for each group of instructions.</p><p>Instances. NCCL and consequently TACCL runtime cannot saturate the bandwidth of a link in a topology using a single threadblock. Thus, TACCL generates multiple instances of the algorithm to maximize the performance. This is done by subdividing each chunk into n subchunks that follow the same path as the parent chunk. All groups of instructions and their threadblocks are duplicated n times and executed in parallel. &#167;7.2 explores the performance implications of choices of n.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Evaluation</head><p>We evaluate algorithms obtained with TACCL for ALL-GATHER, ALLTOALL, and ALLREDUCE collectives on a cluster of 32 GPUs comprised of two Nvidia DGX-2 nodes or upto four Azure NDv2 nodes. To compare performances, algorithm bandwidth <ref type="bibr">[33]</ref> measurement is used which is calculated by input buffer size divided by execution time. We synthesize TACCL algorithms by exploring different communication sketches and compare them against the popular Nvidia Collective Communication Library (NCCL) (v.2.8.4-1). This section analyzes how different communication sketches impact the performance of the algorithms synthesized by TACCL. In particular, we perform ablation studies by varying the inter-node connections in the logical topology, changing synthesizer hyperparameters, and changing the number of instances used when lowering to TACCL-EF. To evaluate how TACCL's speedups translate to end-to-end performance, we use algorithms generated by TACCL in two large language models, Transformer-XL and BERT. Finally, we discuss the synthesis time required by TACCL to generate these algorithms.</p><p>We believe our focus on up to 32 GPUs covers a large section of important use cases: in an internal cluster of DGX-2 nodes at Microsoft, the sum of GPUs in jobs of at most 32 was 93.7% of all jobs in the second half of 2021. We also use algorithms synthesized by TACCL for ALL-TOALL and ALLREDUCE collectives for training an internal Microsoft's mixture-of-experts workload on two NDv2 nodes. The ALLTOALL and ALLREDUCE sizes required for this model are &#8776; 6MB and &#8776; 256MB, respectively. TACCL improves the end-to-end throughput of this model by 17%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Standalone Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4">Synthesis Time</head><p>Table <ref type="table">2</ref> shows the total time it takes for TACCL to synthesize algorithms for different collectives using some of the communication sketches mentioned in Section 7.1. In most cases synthesis takes from seconds to a few minutes, making it amenable to a human-in-the-loop approach. When synthesizing an ALLTOALL collective using some communication sketches, TACCL's contiguity encoding may take more time in proving the optimality of a feasible solution. We put a time limit of 30 minutes on the contiguity encoding in these cases. The contiguity encoding for sketch ndv2-sk-1 reaches this timeout, but a feasible solution was already found in 4min 14s. We have also been able to synthesize an ALLGATHER for 80 GPUs (10 NDv2 nodes) in under 8 minutes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Related Work</head><p>The MPI standard provides a set of collective communication algorithms that enable efficient distributed computations of interconnected nodes <ref type="bibr">[16]</ref>. The HPC community has focused on the efficient implementation of these MPI collective algorithms <ref type="bibr">[40,</ref><ref type="bibr">50]</ref> and demonstrated how to build optimized algorithms for specific interconnects, like mesh, hypercube, or fat-tree <ref type="bibr">[7,</ref><ref type="bibr">8,</ref><ref type="bibr">41]</ref>. In contrast to TACCL, these prior works assume homogeneous interconnects and are often only focused on bandwidth optimality. Hybrid algorithms <ref type="bibr">[7,</ref><ref type="bibr">10]</ref> combine bandwidth-and latency-optimal algorithms based on input sizes, but only qfor mesh networks.</p><p>NCCL <ref type="bibr">[37]</ref> is a GPU implementation of a subset of the standard MPI collectives, optimized for NVLINK and Infiniband interconnects. While NCCL uses the topology of GPU connections and NIC placement along with buffer size to decide between two main types of communication algorithms -Ring and Tree, it is agnostic to the exact performance profile of the links, and thus (as we show) is often multiple times slower than TACCL's topology aware collectives.</p><p>Recent works like SCCL <ref type="bibr">[9]</ref>, Blink <ref type="bibr">[51]</ref>, and Plink <ref type="bibr">[29]</ref> specialize algorithms for the underlying topology. SCCL solves an integer programming encoding based on discrete-time values in the form of steps and rounds of the algorithm in order to achieve the pareto-frontier of latency-and bandwidth-optimal algorithms. SCCL is able to synthesize a novel pareto-optimal ALLGATHER algorithm for an Nvidia DGX1 node, but its restrictive formulation constrains it to only only synthesize algorithms for single-node topologies. TACCL on the other hand synthesizes collective algorithms for multi-node topologies. Blink uses a heuristic spanning-tree packing algorithm to maximize bandwidth utilization within a node and a hierarchical approach across. Blink has good performance over NCCL in the case when NCCL cannot create rings spanning all GPUs inside a node. TACCL, on the other hand, outperforms NCCL when using the entire node of GPUs. Plink constructs a logical topology based on bandwidth and latency probes of the physical topology to avoid oversubscribed and congested links and searches for a reasonable clustering of nodes for a two-level hierarchical reduction strategy. Plink builds that hierarchical reduction from known primitives and does not search over the space of possible algorithms.</p><p>There are also hierarchical approaches to implement collectives <ref type="bibr">[12,</ref><ref type="bibr">29,</ref><ref type="bibr">42,</ref><ref type="bibr">51]</ref>. For example, Horovod <ref type="bibr">[42]</ref> implements an ALLREDUCE by a local ReduceScatter, a global ALLREDUCE, and then a local ALLGATHER. These methods do not search over possible algorithms, but instead pick from a known set of decompositions. Concurrent to our work, Ningning et al. <ref type="bibr">[52]</ref> use syntax guided synthesis to combine base MPI primitives among a subset of nodes to hierarchically generate larger MPI primitives for the entire network. In contrast, TACCL uses a fine grained approach for algorithm synthesis while using communication sketches for scalability. Combining these two complementary approaches is an interesting opportunity for future work.</p><p>Program sketching <ref type="bibr">[24,</ref><ref type="bibr">47,</ref><ref type="bibr">49]</ref> is a popular technique that has been applied to a variety of problems from synthesizing stencil computations <ref type="bibr">[48]</ref>, converting hand drawings to images <ref type="bibr">[17]</ref> to social media recommendations <ref type="bibr">[11]</ref>. Our work builds on this body of work to use sketching to effectively search a large space of communication algorithms.</p><p>Lastly, network flow problems have used linear programming to solve routing and scheduling problems for traffic engineering <ref type="bibr">[22,</ref><ref type="bibr">23,</ref><ref type="bibr">25,</ref><ref type="bibr">44,</ref><ref type="bibr">46]</ref> and topology engineering <ref type="bibr">[45]</ref>. These techniques, however, cannot be used for generating collective algorithms since communication collectives do not follow all flow properties. Non-source GPUs in a collective can send the same chunk over different links in parallel while having received that chunk only once, which violates an important flow-conservation property used extensively in network flow problem literature. TACCL on the other hand makes use of communication sketches and an encoding relaxation technique to solve a continuous-time integer linear programming that faithfully models communication collectives.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">Conclusion and Future Work</head><p>TACCL is a topology and input-size aware collective communication library for multi-node distributed machine learning training and inference. TACCL uses user-provided communication sketches to guide synthesis of collective algorithms. Using a three-step technique of relaxed routing, heuristic ordering, and contiguity and exact scheduling, TACCL generates efficient collectives for multi-node topologies. We also make some brief observations about TACCL below: Scalability. TACCL can synthesize algorithms for largescale nodes -we have been able to synthesize an ALLGATHER algorithm for 8 Azure NDv2 nodes using TACCL in under 5 minutes. As compared to NCCL, this algorithm has upto 1.7&#215; higher algorithm bandwidth for different data sizes. We also evaluated TACCL's synthesis for 8 Nvidia DGX-2 nodes (128 GPUs) and found a solution in around 11 hours. While TACCL scales to multi-node topologies, the synthesis technique is still based on solving an NP-hard problem that grows exponentially with a quadratic power with scale. As a future work, we would like to scale TACCL further by hierarchically composing synthesized algorithms.</p><p>Generality across different topologies. Apart from hierarchical topologies like Nvidia DGX-2 and Azure NDv2, TACCL can also be applied to non-hierarchical topologies like a 2D-Torus. We were able to synthesize an ALLGATHER algorithm for a 2D 6&#215;8 Torus using TACCL. We made use of the symmetry attribute in communication sketches to explore synthesis for this topology. However, the amount of exploration we can do with different communication sketches may be more limited in these cases than for hierarchical topologies.</p><p>Exploring communication sketches. Communication sketches have proven effective in narrowing the search space of algorithms. Interestingly, different communication sketches can optimize different ranges of input sizes. Communication sketches reflect the intuition of developers, and by intelligently exploring the space of communication sketches we can obtain a range of collective algorithms with different performance characteristics. Learning an automated controller for exploring communication sketches is an interesting direction for collective algorithm synthesis in the future.</p><p>To conclude, TACCL uses the abstraction of communication sketches and a novel problem formulation to generate efficient algorithms for collectives like ALLGATHER, ALL-TOALL, and ALLREDUCE. The algorithms thus generated are up-to 6.7&#215; faster than the state-of-the-art NCCL and result in 11% -2.4&#215; faster end-to-end training time. along with a communication sketch provided by a humanin-the-loop. A communication sketch comprises of a logical topology, switch-hyperedge strategy, symmetry information, input size, and other hyperparameters. Listing 1 gives an example of how users can provide a communication sketch input to the TACCL synthesizer. Here, we show an example of the communication sketch dgx2-sk-1 used in the evaluation to synthesize an ALLGATHER algorithm for 2 Nvidia DGX-2 nodes (each node has 16 GPUs and 8 NICs, every two GPUs in the node share a NIC).</p><p>The sketch annotates the NVSwitch in each node and sets a uc-min switch-hyperedge strategy. Further, the inter-node sketch fixes the sender and receiver GPUs in a node for internode data transfers. In our example, the odd-numbered GPUs sharing a NIC are chosen as senders and the even-numbered GPUs are chosen as receivers for inter-node communication. The user also annotates how the inter-node relay GPUs would split the inter-node bandwidth using a beta_split attribute. Since only a single GPU per NIC is chosen in our example to perform inter-node send and similarly receive, the bandwidth is not split. Optionally, the user can also map chunks to sender GPUs so that only mapped GPUs are used for inter-node transfers for the chunk. The chunk_to_relay_map attribute defines the parameters for the mapping function. The communication sketch also allows users to play with rotational symmetry for data routing. Given a symmetry offset and a group size, a chunk transfer over a link is set to be equivalent to a rotationally symmetric chunk over a rotationally symmetric link. In our example of the symmetry_offset attribute, using <ref type="bibr">[2,</ref><ref type="bibr">16]</ref> fixes an intra-node symmetry with an offset of two, and using <ref type="bibr">[16,</ref><ref type="bibr">32]</ref> fixes a symmetric data transfer pattern between the two DGX-2 nodes. Hyperparameters like input data partitioning and input size can also be provided via the communication sketch.</p><p>Listing 1: Example sketch dgx2-sk-1 for ALLGATHER { // sketch for intra-node policy "intranode_sketch": { "strategy": "switch", "switches": [[0, <ref type="bibr">1,</ref><ref type="bibr">2,</ref><ref type="bibr">3,</ref><ref type="bibr">4,</ref><ref type="bibr">5,</ref><ref type="bibr">6,</ref><ref type="bibr">7,</ref><ref type="bibr">8,</ref><ref type="bibr">9,</ref><ref type="bibr">10,</ref><ref type="bibr">11,</ref><ref type="bibr">12,</ref><ref type="bibr">13,</ref><ref type="bibr">14,</ref><ref type="bibr">15]</ref>], "switch_hyperedge_strategy": ["uc-min"] }, // sketch for communication policy between any two nodes "internode_sketch": { "strategy": "relay", "internode_conn": {"1" : [0], "3" : <ref type="bibr">[2]</ref>, "5" : <ref type="bibr">[4]</ref>, "7" : <ref type="bibr">[6]</ref>, "9" : <ref type="bibr">[8]</ref>, "11" : <ref type="bibr">[10]</ref>, "13" : <ref type="bibr">[12]</ref>, "15" : [14]}, // "i": [j1, j2] implies GPU i in a node will only send data to GPU j1 and j2 of another node "beta_split": {"1": 1, "3": 1, "5": 1, "7" :</p><p>1, "9" : 1, "11" : 1, "13" : 1, "15" : 1}, // "i": n implies inter-node sends from a GPU i of a node will use 1/n-th of the inter-node bandwidth "chunk_to_relay_map": [2,1] // maps chunk to a sender relay GPU. [r1,r2] means chunk c will be send to another node via GPU (rp//r1)*r1 + r2, where rp is the precondition GPU for chunk c }, // enforces rotational symmetry. // [(o,g), ..]: o is the rotational offset and g is the group size for the rotational symmetry. // : eg. send(c,src,r) == send( (c + o)%g, (src + o)%g, (r + o)%g) "symmetry_offsets": [ <ref type="bibr">[2,</ref><ref type="bibr">16]</ref>, <ref type="bibr">[16,</ref><ref type="bibr">32]</ref>], "hyperparameters": { "input_chunkup": 2, // Data at each GPU is partitioned into 2 chunks that can be independently routed "input_size": "1M" } }</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B TACCL Synthesizer in Detail</head><p>As explained in Section 5, TACCL's synthesizer has routing, heuristic ordering, and contiguity and exact scheduling stages. We provide a detailed description of each of these stages in this section. We first formally introduce some terms that we will use later. Let C denote the set of chunks that are required to be routed in the algorithm for collective coll.</p><p>Let R denote the set of GPU ranks involved in coll. Let coll.precondition and coll.postcondition denote the precondition and post-condition of the collective respectively.The tuple (c, r) &#8712; coll.precondition, c &#8712; C , r &#8712; R , if chunk c is present at rank r at the start of the collective. Similarly, the (c, r) &#8712; coll.postcondition if chunk c has to be present at rank r at the end of the collective. Further, let L denote the set of links, such that (r1, r2) &#8712; L,r1 &#8712; R , r2 &#8712; R if there exists a link from rank r1 to rank r2 in the logical topology determined by the topology and communication sketch. Let S send r denote the set of switched destinations for rank r, such that dst &#8712; S send r if link (r, dst) is a part of a switch-hyperedge. Similarly, S recv r denotes the set of switched sources for rank r, such that src &#8712; S recv r if link (src, r) is a part of a switch-hyperedge. &#945;(r1, r2), &#946;(r1, r2) are the alpha and beta costs respectively of the link (r1, r2) &#8712; L. The term lat(r1, r2) is the sum of &#945;(r1, r2) and &#946;(r1, r2) cost of the link, which denotes the total transfer cost of a single chunk over link (r1, r2). Table <ref type="table">3</ref> lists the variables that the TACCL's synthesizer solves for. We will describe each variable in detail in this section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1 Routing</head><p>The main aim of the routing stage is to give us the path that every chunk takes in the collective. Our objective is to minimize the time (denoted by continuous variable time) it takes to reach the post-condition of the collective.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Minimize time</head><p>(</p><p>The time taken for the collective algorithm is the latest time at which a chunk becomes available on a rank that is in the post-condition of the collective. We use a continuous variable start <ref type="bibr">[c, r]</ref> to denote the time that chunk c becomes available on rank r, and end up with the following constraints for time time &#8805; start[c, r] &#8704;(c, r) &#8712; coll.postcondition <ref type="bibr">(2)</ref> For chunks on ranks that belong to the collective's precondition, we set the start time to zero.</p><p>We also add correctness constraints in our formulation for routing -chunks are sent from a GPU rank only after they have been received on that rank. We introduce a continuous variable send[c, src, r] to denote the time of sending chunk c from rank src to rank r and add the following constraint to our formulation:</p><p>We use a binary variable is_sent[c, src, r] to indicate if chunk c is sent over the link (src, r) in our algorithm. We note that the routing stage does not strictly respect bandwidth constraints of any link -the generated solution may send two chunks simultaneously over a link at the time cost of one chunk. The chunk start time on a rank will be determined only by the chunk send time on the source, independent of other chunk transfers on the link (eq. 5). LHS&#8594;RHS in the equation signifies an indicator constraint, i.e., if LHS is 1, RHS will hold.</p><p>is_sent</p><p>Instead of bandwidth constraints, this encoding uses relaxed bandwidth constraints. They are expressed by aggregating the link transfer time of all chunks sent over a link and using it to to lower bound the total time of the algorithm (eq. 6). For switched connections, the total time is lower bounded by the sum of link transfer times of all chunks sent over all switched outgoing links from a source, and also by the sum of link transfer times for chunks received from all incoming links to a destination (eq. 7 and eq. 8).</p><p>Based on the communication sketch, we also add constraints for uc-max and uc-min strategies for switchhyperedges to maximize and minimize the number of links utilized in a switch respectively. We introduce a new binary variable is_util[src, r] for links (src, r) that are a part of a switch-hyperedge. This variable is 1 if any chunk is sent over link (src, r), and 0 otherwise.(eq. 9 and eq. 10). According to the switch-hyperedge strategy, we add this variable, weighted with a small constant &#947;, to the objective function (eq. 11). &#947; is negative for uc-max and positive for uc-min.</p><p>We also add symmetry constraints according to the symmetry offsets provided by user in the communication sketch. For a chunk c and link (src, r), we identify a rotationally symmetric chunk &#265; and link ( &#349; rc, r) and add the following constraints:  Further, for chunks that start on one node and have a final destination on another node, we add inter-node transfer constraints which specify that at least one inter-node link will be used to transfer that chunk.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 Ordering Heuristics</head><p>We start the heuristic ordering by determining the paths each chunk takes using the solution of the path encoding. We then consider the first link in every path as a candidate for scheduling a chunk transfer. Using heuristics like chunk-with-shortestpath-until-now-first and chunk-with-longest-path-from-nowfirst, we select a path (and thus a chunk) which should be scheduled in this round. We keep a running estimate of link time, which is the earliest time at which a chunk can be scheduled over the link. We also keep a running estimate of chunk time, which is the earliest time at which the next link transfer can be scheduled for a chunk. At the start, the link time for every link is 0 and the chunk time for every chunk is 0. When a path is chosen in the first round, the chunk associated with the path is scheduled to traverse the first link in the path. The link time of that link increases by link latency and chunk time of that chunk increases by link latency. The link candidate from the selected path is also updated to be the next link in the path.</p><p>For the next rounds, we decide which path's candidate link to schedule next using the tracked link and chunk times along with the scheduling heuristics. This keeps going until we have scheduled a data transfer over all the links in all the paths. We find that the best heuristics differ for architectures with NVLinks and those with NVSwitches, in terms of whether to start selecting links to schedule in the same order as the paths or in the opposite order of the paths. The heuristic ordering has the following three outputs:</p><p>&#8226; chunk_order(r 1 , r 2 ), an ordered list of chunks transferred along each link (r 1 , r 2 ). If chunk c 1 is present before chunk c 2 in chunk_order(r 1 , r 2 ), it denotes that c 1 is scheduled to be sent before c 2 over link (r 1 , r 2 ).</p><p>&#8226; switch_send_order(r), an ordering on the chunks sent from a switch source r to any of the switch destinations dsts. If (c 1 , dst 1 ) is present before tuple (c 2 , dst 2 ) in switch_send_order(r), it means that a send of c 1 over link (r, dst 1 ) should be scheduled before a send of chunk c 2 over link (r, dst 2 ).</p><p>&#8226; switch_recv_order(r), an ordering on the chunks received on a switch destination r from any of the switch sources srcs. If (c 1 , src 1 ) is present before tuple (c 2 , src 2 ) in switch_recv_order(r), it means that a receive of c 1 over link (src 1 , r) should be scheduled before a receive of chunk c 2 over link (src 2 , r).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 Contiguity and Exact Scheduling</head><p>Finally, we describe the formulation for the contiguity and exact scheduling stage. Given the link and switch ordering from the heuristic ordering stage, the aim of this stage is to find the sweet spot in the trade-off between lower link latency by sending multiple data chunks contiguously as a big data chunk and reduced pipelining benefits due to the big data-chunk transfer. We provide the main set of constraints in our formulation below, leaving out other less important constraints.</p><p>Our objective is still to minimize the time of the collective and constraints eq. 1-eq. 4 must still hold in this formulation. We add a new binary variable is_together(c 1 , c 2 , r) for all chunks c 1 and c 2 that are sent over the same link to rank r. If is_together(c 1 , c 2 , r) is 1, chunks c 1 and c 2 are sent as a single data-chunk over a link to rank r. </p><p>We also add strict bandwidth constraints for this formulation, allowing only one data chunk per link transfer time</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_0"><p>BackendThe synthesizer described above generates an abstract algorithm that specifies the order in which the nodes communicate the various chunks. The goal of the backend is to implement this abstract algorithm. To do so, we extend NCCL<ref type="bibr">[37]</ref> with an interpreter which we call TACCL runtime. While any communication algorithm can be trivially implemented using NCCL's point-to-point sends and receives, TACCL runtime enables us to execute the entire algorithm in a single kernel launch, eliminating multiple launch overheads. In addition, by reusing NCCL transport mechanisms, TACCL runtime is able to support all of NCCL's communication backends such as IB, Ethernet, NVLink, and PCIe.</p></note>
		</body>
		</text>
</TEI>
