<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>11/11/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10508744</idno>
					<idno type="doi">10.1145/3581784</idno>
					
					<author>Hongkuan Zhou</author><author>Da Zheng</author><author>Xiang Song</author><author>George Karypis</author><author>Viktor Prasanna</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Memory-based Temporal Graph Neural Networks are powerful tools in dynamic graph representation learning and have demonstrated superior performance in many real-world applications. However, their node memory favors smaller batch sizes to capture more dependencies in graph events and needs to be maintained synchronously across all trainers. As a result, existing frameworks suffer from accuracy loss when scaling to multiple GPUs. Even worse, the tremendous overhead of synchronizing the node memory makes it impractical to deploy the solution in GPU clusters. In this work, we propose DistTGL — an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters. DistTGL has three improvements over existing solutions: an enhanced TGNN model, a novel training algorithm, and an optimized system. In experiments, DistTGL achieves near-linear convergence speedup, outperforming the state-of-the-art single-machine method by 14.5% in accuracy and 10.17× in training throughput.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Many real world graphs contain important time domain information. For example, in recommender systems, user interests and global trends both change over time; in fraud detection, the time between two consecutive transactions often marks out suspicious activities. In spatial-temporal applications such as trac and weather prediction, temporal and spatial information are equally important. Recently, along with the success of Graph Neural Networks (GNNs) in static graph representation learning, researchers have designed Temporal Graph Neural Networks (TGNNs) <ref type="bibr">[9,</ref><ref type="bibr">11,</ref><ref type="bibr">14,</ref><ref type="bibr">15,</ref><ref type="bibr">18,</ref><ref type="bibr">23]</ref> to exploit temporal information in dynamic graphs. On various dynamic graphs including social network graphs, trac graphs, and knowledge graphs, TGNNs have demonstrated superior accuracy on various downstream tasks such as temporal link prediction and dynamic node classication, substantially outperforming static GNNs and other traditional methods <ref type="bibr">[14,</ref><ref type="bibr">23]</ref>. Depending on whether the timestamps of graph events are discrete or continuous, dynamic graphs can be classied into Discrete Time Dynamic Graphs (DTDGs) and Continuous Time Dynamic Graphs (CTDGs). In this work, we focus on the more general and challenging TGNNs on CTDGs.</p><p>On dynamic graphs, the number of related events on each node increases as time evolves. When this number is large, neither temporal attention-based aggregation nor historical neighbor sampling methods allow TGNNs to capture the entire history. To compensate for the lost history, researchers have designed Memory-based Temporal Graph Neural Networks (M-TGNNs) <ref type="bibr">[9,</ref><ref type="bibr">14,</ref><ref type="bibr">18,</ref><ref type="bibr">20]</ref> that maintain node-level memory vectors to summarize independent node history. The node memory in M-TGNNs not only allows the aggregator to gather information from fewer historical neighbors but also enlarges the receptive eld because the node memory vectors already contain information multiple hops away. As a result, stateof-the-art M-TGNN TGN <ref type="bibr">[14]</ref> only requires a single GNN layer  with some recent neighbors as supporting nodes. In the benchmark in TGL <ref type="bibr">[30]</ref>, M-TGNNs ll out the top ranks both in accuracy and training time.</p><p>Despite the success of M-TGNNs, it is dicult to deploy them to large-scale production applications due to their poor scalability. The auxiliary node memory creates temporal dependencies and requires training mini-batches to be small and scheduled in chronological order. Specically, there are two major challenges in exploiting data parallelism in M-TGNN training. First, simply increasing the batch size loses the temporal dependency information between events and leads to information loss (please refer to Section 2.1.1 for more details). Figure <ref type="figure">2</ref>(a) shows that the accuracy decreases as the batch size increases on the GDELT dataset. On smaller datasets, this decrease in accuracy is usually observed for relatively small batch sizes around 10 2 -10 3 edges <ref type="bibr">[14]</ref>, which are not big enough to appreciate the speedup provided by multi-GPU data parallelism. Second, all the trainers need to access and maintain a unied version of node memory, leading to enormous amount of remote trac in distributed systems. Unlike static GNN training, the remote accesses to the node memory (typically hundreds of megabytes per minibatch) have strict temporal dependencies. Due to these excess and interdependent remote accesses, distributed training is often slower than single-machine training. Figure <ref type="figure">2</ref>(b) shows the case when the node memory is distributed to all machines where each machine owns a unique equally-sized portion. Furthermore, the remedy to cross-machine trac in static GNN training <ref type="bibr">[2,</ref><ref type="bibr">26,</ref><ref type="bibr">27]</ref> -graph partitioning technique METIS <ref type="bibr">[8]</ref>, is not applicable to dynamic graphs. As a result, on both small-and large-scale datasets, the training time of the state-of-the-art M-TGNN framework <ref type="bibr">[30]</ref> using 8 GPUs on a single node is 10 100&#8677; slower than state-of-theart distributed static GNNs <ref type="bibr">[25,</ref><ref type="bibr">27]</ref>, with an unsatisfactory 2-3&#8677; speedup over a single GPU.</p><p>In this work, we propose DistTGL -an ecient and scalable solution to train M-TGNNs on distributed GPU clusters. DistTGL improves the existing M-TGNN training solutions from three perspectives:</p><p>&#8226; Model: We enhance the node memory in M-TGNNs by adding additional static node memory, which improves both the accuracy and convergence rate. &#8226; Algorithm: We design a novel training algorithm to overcome the challenges of accuracy loss and communication overhead in distributed scenarios.</p><p>&#8226; System: We build an optimized system adopting prefetching and pipelining techniques to minimize the mini-batch generation overhead. serialize the memory operations on the node memory and eciently execute them by an independent daemon process, avoiding complex and expensive synchronizations. &#8226; In experiments, DistTGL achieves near-linear speedup when scaling to multiple GPUs in convergence rate, outperforming state-of-the-art single machine method <ref type="bibr">[30]</ref> by more than 10&#8677; (see Figure <ref type="figure">1</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">BACKGROUND</head><p>Given a dynamic graph, TGNNs aim at embedding the contextual, structural, and temporal information of a given node at a given timestamp into a low-dimensional vector. M-TGNNs rely on the node memory and temporal graph attention to generate these vectors. We rst explain the basic propagation rules in M-TGNNs. For the rest of this paper, unless stated otherwise, we denote scalar as lower case letter G, vector as bold lower case letter x, and matrix as bold upper case letter ^. We denote row-wise concatenation of vectors (or matrices) using double vertical bar within curly brackets {x ||~}.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Memory-Based Temporal Graph Neural Networks</head><p>M-TGNNs <ref type="bibr">[9,</ref><ref type="bibr">14,</ref><ref type="bibr">18,</ref><ref type="bibr">20]</ref> maintain dynamic node-level vectors to track the node history. TGN <ref type="bibr">[14]</ref> proposes a general framework for dierent M-TGNN variants and supports dierent types of graph events. Here, we introduce TGN on the most common dynamic graphs with graph events of edges appearing. For a dynamic graph G(V, E), its graph events could be represented </p><p>)</p><p>where (&#8226;) is the time encoding <ref type="bibr">[23]</ref>, C D is the timestamp when s D is last updated, and e DE is the edge feature. Then, we use an update function UPDT to update the node memory of node D and node E,</p><p>The update function can be implemented using any sequence model. In TGN-attn <ref type="bibr">[14]</ref>, UPDT(&#8226;) is implemented as GRU cells. Since the UPDT function is only called when a related graph event occurs, the lengths of the hidden status of dierent nodes in the graph are dierent. In backward propagation, the learnable parameters ] and b are trained within each GRU cell (the gradients do not ow back to previous GRU cells, like in the Back-Propagation-Through-Time algorithm).</p><p>After updating the node memory, a one-layer temporal attention layer <ref type="bibr">[23]</ref> gathers and aggregates information from the node memory of the most recent neighbors Y F , F 2 N E to compute the dynamic node embedding h E for node E. If dynamic or static node features are available, they can be combined with the node memory.</p><p>)</p><p>where t is the time dierences of the current timestamp with the last updated time of the node memory of F 2 N E , and K EF is the matrix of edge features connecting nodes E and F 2 N E . Most TGNNs are self-supervised using the temporal edges as ground truth information, where the updates to node memory are delayed by one iteration due to the information leak problem <ref type="bibr">[14]</ref>. Specically, the mails are cached for the supporting nodes, and the output embeddings are computed using Equation 4-7 before their node memory is updated using Equation 3. This reversed computation order needs to be implemented both in training and at inference to avoid the information leak problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.1">Batched M-TGNN Training.</head><p>Since the training of M-TGNNs needs to be synchronized with the node memory, the training samples need to be scheduled chronologically. Theoretically, the node memory of a node needs to be immediately updated after a relevant graph event occurs on that node so that later dependent nodes can use this up-to-date node memory in the message passing process. Without changing the algorithm, we can process consecutive graph events that do not have overlapping nodes in batches by updating their node memory in parallel. However, this limits the batch size to no more than a few graph events on most dynamic graphs. In practice, the tiny batch size is computationally infeasible on modern hardware, such as GPU, intended for highly paralleled programs. To solve this problem, M-TGNNs process the incoming graph events in larger xed-size batches and update the node memory for the nodes that have new mails once per batch to reduce the computation time. Let {m D } be the set of mails generated at node D in a batch of graph events, s D is then updated using a COMB(&#8226;) function Note that the mails {m D } is not using the up-to-date node memory (since it is not computed yet) but using the outdated node memory at the last batch of graph events. In TGN-attn, the COMB(&#8226;) function simply outputs the most recent mail. This batching approach both updates the node memory in batch and computes the attention-based message passing in batch. The batched update to node memory causes two types of inaccuracy in the node memory -staleness and information loss (Figure <ref type="figure">3</ref>). The staleness in the node memory refers to the problem where the node memory is not up-to-date due to the reversed computation order to avoid the information leak problem. The information loss in the node memory refers to the node memory not being updated by the mails that are ltered out by the COMB(&#8226;) function as well as the inaccuracy of the mails due to using outdated node memory. When the batch size is increased, both the staleness and information loss in the node memory increase, resulting in lower accuracy <ref type="bibr">[14]</ref>. Besides these two types of inaccuracy, another common inaccuracy in sequence models is the inaccuracy due to not re-computing the hidden embeddings when the weights are updated, which generally does not aect the performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Related Works</head><p>Dynamic graph representation learning plays an important role in many real-world problems. Many discrete TGNNs <ref type="bibr">[6,</ref><ref type="bibr">7,</ref><ref type="bibr">11,</ref><ref type="bibr">15]</ref>, continuous TGNNs <ref type="bibr">[14,</ref><ref type="bibr">18,</ref><ref type="bibr">20,</ref><ref type="bibr">23]</ref>, and non-GNN methods <ref type="bibr">[17,</ref><ref type="bibr">21]</ref> are proposed to learn node embeddings on dynamic graphs. There are many existing works that accelerate the message passing scheme in GNNs on a single node <ref type="bibr">[5,</ref><ref type="bibr">19]</ref> and on distributed GPU clusters <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">[25]</ref><ref type="bibr">[26]</ref><ref type="bibr">[27]</ref>. In discrete TGNNs, the propagation within a graph snapshot is the same as static GNNs where these existing methods can be directly applied to. There are also some existing works that specialize in discrete TGNNs on a single GPU <ref type="bibr">[24,</ref><ref type="bibr">28]</ref> and distributed systems <ref type="bibr">[3]</ref>. However, these methods do not apply to continuous M-TGNNs due to the unique propagation rule of M-TGNNs. Accelerating continuous M-TGNNs is challenging due to the aforementioned antithesis between training speed and accuracy. Distributed M-TGNN training is even more challenging due to the high volume of data synchronization. There are a few works that accelerate M-TGNNs training. TGL <ref type="bibr">[30]</ref> proposes a general framework for single-node multiple-GPU continuous TGNNs. However, TGL does not support distributed GPU clusters. The speedup of TGL on multiple GPUs in a single machine is also unsatisfactory, only achieving 2 3&#8677; speedup on 8 GPUs. EDGE <ref type="bibr">[4]</ref> proposes to speedup the training by replacing the dynamic node memory of active nodes with static learnable node memory, gambling on the chance that active nodes have stable embeddings. To the best of our knowledge, there is no existing work for M-TGNN training that achieves near-linear scalability on For simplicity and easier understanding, we draw the reads and writes to the node memory at the beginning and end of each training iteration. In our optimized system, they have performed asynchronously with the training iterations and are fully overlapped with the GPU computation. Please refer to Figure <ref type="figure">7</ref> for more details on the three parallel training strategies.</p><p>single-node multiple-GPU, or operates on distributed GPU clusters.</p><p>For the inference task, TGOpt <ref type="bibr">[22]</ref> proposes to accelerate TGNN inference by de-duplication, memorization, and pre-computation.</p><p>Another work <ref type="bibr">[29]</ref> proposes a system-architecture co-design that accelerates M-TGNN inference on FPGAs. Unfortunately, these techniques do not apply to M-TGNN training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">DISTTGL</head><p>We propose DistTGL -an ecient and scalable solution to train M-TGNNs on distributed GPU clusters. DistTGL achieves scalability through improvements from three perspectives: model, algorithm, and system. From the model perspective, we introduce the static node memory that explicitly separates the time irrelevant node information. From the algorithm perspective, we propose two novel parallel training strategies and a method to determine the best combination of these strategies on any given dataset and hardware conguration. From the system perspective, we design an ecient system to reduce and overlap mini-batch generation overhead with GPU training. We introduce these improvements in the three following subsections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">M-TGNN Model with Static Node Memory</head><p>M-TGNNs rely on node memory to summarize the node history.</p><p>Previous work <ref type="bibr">[4]</ref> argues that the node memory of nodes with active interactions is static. While this may be true on some evolving graphs like citation graphs, it fails on the dynamic graphs where high-frequency information is important, such as in fraud detection <ref type="bibr">[16]</ref>. Figure <ref type="figure">5</ref> shows the comparison of the accuracy in the temporal link prediction task that predicts destination nodes from source nodes using static and dynamic node memory. We do not observe any noticeable inclination on higher degree nodes favors static node memory or vice versa. We also observe similar results on the other datasets used in this work.</p><p>We believe that a general TGNN model should be able to capture both the dynamic and static node information of all nodes. In DistTGL, we separate the static and dynamic node memory and capture them explicitly. DistTGL keeps the original GRU node memory on all nodes to capture the dynamic node information and implements an additional mechanism to capture the static node information. There are two major benets brought by this additional static node history. First, it enhances the capability of M-TGNNs to capture node history with burst interactions. Due to the batching of updating the node memory, if a node interacts with others many times in a short time period, it is inevitable that the COMB(&#8226;) function used in the dynamic node memory would lter out most of these interactions, resulting in a loss of high-frequency information. The static node memory, combined with the time encoding <ref type="bibr">[23]</ref>     </p><p>(b) Epoch Parallelism  Itr. 0 Itr. 1 Itr. 2</p><p>Figure <ref type="figure">7</ref>: Overview of mini-batch parallelism, epoch parallelism, and memory parallelism on three trainer processes. The "R" and "W" denote read and write operations to the shared node memory. In epoch parallelism, the arrows denote cross-process communication to send mini-batch data. In memory parallelism, the arrows denote cross-process communication to send the updated node memory.</p><p>the temporal attention aggregator, could boost the performance in such cases. Second, the static node memory explicitly separates the information irrelevant to batch sizes, which improves the performance of data parallelized training. Since the static node memory is irrelevant with time, all graph events can be used to supervise the training process, allowing it to capture all static information regardless of batching. In this work, since most dynamic graphs do not have node features, we use learnable node embeddings pre-trained with the same task as the static node memory due to its simplicity. The pre-training of these embeddings can be easily done in any well-optimized distributed static GNN frameworks <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">[25]</ref><ref type="bibr">[26]</ref><ref type="bibr">[27]</ref>.</p><p>Note that the static node memory is similar to learnable weights in the M-TGNN models and does not include any information in the test set. On the other hand, the dynamic node memory contains information in the test set and would cause information leaks if not handled properly. DistTGL also supports other kinds of learnable or non-learnable static node memory, such as co-trained embedding tables or even node embeddings generated by static GNNs. Figure <ref type="figure">6</ref> shows the two datasets which have the most signicant improvement with pre-trained static node memory. On a single GPU, our improved model achieves remarkably better accuracy on both datasets and a smoother convergence curve on the Flights dataset (we do not show the curves for multi-GPU for a clearer visualization). On the MOOC dataset, our model with static node memory also improves the scalability in convergence on multiple-GPU using epoch parallelism (which will be introduced later in Section 3.2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Parallel Training Algorithm</head><p>A straightforward approach to train M-TGNNs in parallel is to process the graph events in large global batches and distribute them to multiple trainers, which is used by TGL <ref type="bibr">[30]</ref> in the setting of multiple GPUs on a single node. We refer to this approach as the mini-batch parallelism, which relaxes the inter-batch dependencies in node memory. However, the key to achieving good accuracy in multi-GPU M-TGNN training is to maintain the temporal dependency when the graph events are processed in large batches. To solve this problem, we propose two novel parallel training strategies -epoch parallelism and memory parallelism. Epoch parallelism relaxes the dependencies in the node memory due to weight updates and trains dierent epochs simultaneously on dierent trainers. Memory parallelism trades space for accuracy by maintaining multiple copies of the node memory at dierent timestamps. In the rest of this section, we rst introduce the three types of parallelism and their advantages and disadvantages. Then, we discuss how to design an optimal training algorithm given any task specications and hardware congurations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.2.1</head><p>Mini-Batch Parallelism. Mini-batch parallelism simply trains a large global batch on multiple trainers in parallel. On = GPUs, a global batch of graph events is evenly divided into = local batches where each GPU is responsible for computing the output embeddings of one local batch. Figure <ref type="figure">7</ref>(a) shows the case when a global batch is divided into three local batches on three trainers. Since the global mini-batches are generated in chronological order, we also split them into local mini-batches chronologically and ignore the intra-dependency within each global mini-batch. Specically, these = trainers rst fetch the node memory and cached mails of the assigned root nodes and their supporting nodes. Then, they compute the forward and backward propagation and update the model weights. Before they use the computed node memory to update the node memory and cached mails, they need to make sure all trainers have nished the fetch operations to avoid Write-After-Read (WAR) hazard. Note that ideally, the node memory and cached mails should be updated for both the root and supporting nodes so that we do not need to re-compute Equation 3 when these supporting nodes are referenced again in later batches. However, to ensure the model weights can receive enough feedback in the backward propagation, we do not update the node memory and cached mails of the supporting nodes and re-compute them when they are referenced later. Because the fetch and update of the node memory are done simultaneously in all trainers, the node embeddings generated for later graph events in the global batch cannot perceive the earlier graph events, incurring both staleness and information loss in the node memory. In addition, mini-batch parallelism requires all trainers to maintain the same copy of node memory, which leads to enormous communication overhead on distributed systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2">Epoch</head><p>Parallelism. Epoch parallelism leverages data parallelism by training dierent epochs simultaneously using only one copy of the node memory. In the vanilla M-TGNN training, selfsupervised by temporal edges on a single GPU, we rst sample some negative destination nodes for the root nodes in mini-batch 8. We then collect the supporting nodes for all positive and negative root nodes and fetch their node memory and cached mails. In the later epochs, for the same root nodes in mini-batch 8, we sample dierent sets of negative destination nodes and follow the same procedure to get their node memory and cached mails. To train on the same mini-batches in dierent epochs in parallel on = trainers, we ignore the dierence in node memory due to weight updates in the last = 1 epochs. Thus, we can prepare one set of inputs of the positive nodes and = sets of inputs of the negative nodes and train them in parallel. Note that these mini-batches need to be scheduled in dierent iterations so that the gradients of positive nodes are not simply multiplied by =. This scheduling increases the variance of the gradients of the sampled mini-batches, as the same set of positive nodes is learned for = consecutive iterations. The left part of Figure <ref type="figure">7</ref>(b) shows the case when applying epoch parallelism to three trainers. In each iteration, trainer P0 fetches the node memory and cached mails for one positive mini-batch and three negative mini-batches. After P0 nishes one iteration, it writes to the node memory and sends the prepared mini-batches (one positive mini-batch and the two unused negative mini-batches) to P1. P1 receives the mini-batches from P0 and sends them (one positive mini-batch and the unused one negative mini-batch) to P2 after the computation. Note that only P0 needs to write back the updated node memory to the global copy of node memory in the main memory. Although the node memory of this mini-batch in P1 and P2 is updated using a more recent version of the weights, writing them to the global copy would lead to Read-After-Write (RAW) hazards with later training iterations. We also tried a nergrained updating policy which updates nodes that do not have this RAW hazard in P1 and P2. However, it does not outperform the original policy. To reduce the cross-trainer communication, we further optimize the algorithm by reordering the mini-bathes so that each trainer works on the same positive samples (with dierent negative samples) for = consecutive iterations (see the right part in Figure <ref type="figure">7</ref>(b)). However, epoch parallelism still requires all trainers to access the same node memory, which is impractical on distributed systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.3">Memory</head><p>Parallelism. Memory parallelism trades space for time by training dierent time segments of the dynamic graph simultaneously using separate copies of node memory. The left part in Figure <ref type="figure">7</ref>(c) shows the case when applying memory parallelism on a dynamic graph with 6 mini-batches with three trainers and three copies of node memory. Each trainer is only responsible for one-third of the whole dynamic graph, i.e., a time segment of two consecutive mini-batches. In every iteration, each trainer needs to fetch its own node memory and cached mails. The design on the left requires the intermediate node memory to be transferred across the processes after the trainers nish their time segments. For example, P0 needs to send the node memory of all the nodes in the graph to P1 after iteration 1, which is expensive in distributed systems. To solve this problem, we reorder the mini-batches across the trainer (see the right part in Figure <ref type="figure">7(c</ref>)) so that each trainer trains sequentially on all the segments using its own node memory. Since each trainer owns its individual node memory, there is no synchronization of the node memory across the trainers, making it the only suitable strategy for distributed systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.4">Optimal Training Algorithm.</head><p>The aforementioned three parallelization strategies all have their own unique characteristics. We summarize their advantages and disadvantages in Table <ref type="table">1</ref>. To achieve optimal training performance, we provide heuristic guidelines for DistTGL users to combine these strategies to pick their advantages and oset their disadvantages. Consider a distributed </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Gradient descent variance</head><p>same as single-GPU more than single-GPU same as single-GPU system with ? machines and @ GPUs per machine. Let 8 &#8677; 9 &#8677;: = ? &#8677;@ be a training conguration where 8 represents how many GPUs to compute each mini-batch, : represents how many copies of node memory to maintain, and 9 represents how many epochs to train in parallel for each copy of node memory. We determine the optimal choice of (8, 9, :) from task requirements and hardware congurations. There are two constraints from the hardware side. First, we need to have : ? as memory parallelism is the only strategy that does not synchronize node memory across the trainers. Then, the main memory of each machine should be able to hold :/? copies of node memory and cached mails, or at least hold sucient cache if using the disk-based memory caching storage option.</p><p>Under these constraints, we rst determine 8 according to the largest batch size. Figure <ref type="figure">8</ref> shows that when the batch size increases, fewer graph events would be captured in the node memory, especially for high-degree nodes. DistTGL users can set a threshold for the amount of missing information so that DistTGL would reversely nd out the largest batch size. For applications where highfrequency information is crucial, we can set a stricter threshold for high-degree nodes. Based on this batch size, 8 can be determined according to the GPU specications. For 9 and :, we always prefer to apply memory parallelism since it leads to better convergence, which we have also veried from experiments (see Figure <ref type="figure">9</ref>.(b)). In summary, we rst determine 8 based on task requirements, then : based on hardware specication, and lastly 9 is xed by ? &#8677; @/8 &#8677; :. For example, on a distributed system with 4 machines and 8 GPUs each machine, we determine the largest batch size is 3200 edges. The GPU saturates when batch size is larger than 1600 edges. So we rst set local batch size to be 1600 edges and 8 = 2. The main memory of each machine can hold two copies of the node memory. Then we set : = 32/2/2 = 8. Finally, 9 is xed to be 32/2/8 = 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Distributed Training System</head><p>Designing a scalable distributed training system for M-TGNNs is not trivial. Even for the most straightforward mini-batch parallelism, previous work <ref type="bibr">[30]</ref> only achieves 2-3&#8677; speedup using 8 GPUs on a single node due to excessive overheads in mini-batch generation. We solve this issue by prefetching the mini-batches in a separate process and pipelining the sub-tasks (loading from disk, slicing features, slicing node memory, writing back to node memory) within one mini-batch generation. Figure <ref type="figure">4</ref> shows an overview of DistTGL serializing the memory operations and executing them asynchronously on separate processes. Here we focus on describing the most important design that handles the reads and writes to the node memory. As memory parallelism works on separate copies of node memory which has no dependency and can be easily parallelized, we consider the case for each 8 &#8677; 9 trainer group that shares the same copy of the node memory. Since : ?, each trainer group must have all the processes on the same physical machine. Within each 8 &#8677; 9 group, the operations can be serialized as a spin lock acting on each 8 sub-group. For example, for 8 &#8677; 9 = 2 &#8677; 2, we have the memory access sequence</p><p>where R 8 and W 8 denote read and write requests from trainer 8, and there is no ordering for the requests within each bracket.</p><p>In DistTGL, instead of implementing an expensive cross-process lock mechanism, we launch an additional memory daemon process for each group of 8 &#8677; 9 trainer processes to handle the read and write requests for all the trainers in that group. Let 1B be the local batch size, 3 be the number of sampled supporting nodes for each root node, and 3 mem be the dimension of the node memory. The memory process allocates the following buers, which are shared with the trainers: Algorithm 1 shows the pseudo-code of the memory daemon process. Each trainer process issues the read and write requests by copying the inputs to the shared buers and setting the elements of its rank in read_status and write_status to be 1. The memory daemon process executes these requests in serialized order, puts the read results to the buers, and resets the status. Note that the rst read request of each epoch is not issued, as the results are always all zero matrices right after the initialization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">EXPERIMENTS</head><p>We perform detailed experiments to evaluate the performance of DistTGL. We implement DistTGL using PyTorch <ref type="bibr">[12]</ref> 1.11.0 and DGL <ref type="bibr">[19]</ref> 0.8.2.  Datasets. Table <ref type="table">2</ref> shows the statistics of the ve datasets for the evaluation. The task on each dataset is</p><p>&#8226; Wikipedia <ref type="bibr">[10]</ref> is a bipartite user-internet page graph where one graph event represents one user modies the one Wikipedia page. The edge features are extracted from the text that the users update the pages with. The task on this dataset is temporal link prediction. &#8226; Reddit <ref type="bibr">[10]</ref> is a bipartite user-reddit graph where one graph event represents one user posts to one sub-reddit. The edge features are extracted from the text of the post. The task on this dataset is temporal link prediction. &#8226; MOOC <ref type="bibr">[10]</ref> is a bipartite user-course action graph where one graph event represents one user interacting with one class item (i.e., watching a video, answering a question). The task on this dataset is temporal link prediction. &#8226; Flights <ref type="bibr">[13]</ref> is a trac graph where each node represents one airport, and each edge represents one ight between the two airports. The task on this dataset is temporal link prediction.</p><p>&#8226; GDELT <ref type="bibr">[30]</ref> is a knowledge graph tracking events happening all over the world where each node represents one actor, and each edge represents one event. Since the temporal link prediction task used in TGL <ref type="bibr">[30]</ref> is too simple, we use the 130dimensional CAMEO code as edge features and set the task to be a 56-class 6-label dynamic edge classication problem that predicts the rest of the 56-dimensional edge features.</p><p>For the temporal link prediction task, to reduce the variance in the validation and test accuracy, we randomly sample 49 negative destination nodes (for bipartite graphs, we only sample from the other graph partition) and report the Mean Reciprocal Rank (MRR) of the true destination nodes. For the dynamic edge classication task, we report the F1-Micro score.</p><p>4.0.1 Model. We use the most ecient one-layer TGN-attn <ref type="bibr">[14]</ref> model enhanced with the static node memory introduced in Section 3.1. We follow the original work to set the dimension of node memory to 100 and the number of most recent neighbors to 10 for each node. We pre-train the static node history with the same GNN architecture but only with static information using DGL <ref type="bibr">[19]</ref>. On the Wikipedia, Reddit, MOOC, and Flights datasets, we pre-train 10 epochs with stochastically selected mini-batches. On the GDELT dataset, we only pre-train 1 epoch. The pre-training of all datasets takes less than 30 seconds on a single machine. For the Wikipedia, Reddit, MOOC, and Flights datasets, we set the local batch size to be the largest available batch size 600 <ref type="bibr">[30]</ref>. For the GDELT dataset, the local batch size is set to 3200, limited by the GPU capacity. We set the learning rate to be linear with the global batch size. To ensure fairness, we keep the total number of traversed edges to be the same in multi-GPU training. The number of training iterations for G GPUs will be 1/G compared to a single GPU. On the Wikipedia, Reddit, MOOC, and Flights datasets, we traverse the training events 100 times (100 epochs on a single GPU). On the larger GDELT dataset, we traverse the training events 10 times (10 epochs on a single GPU). On the Wikipedia, Reddit, MOOC, and Flights datasets, we perform evaluation after every training epoch using the node memory in the rst memory process. On the GDELT dataset, due to the slow evaluation process (as DistTGL only accelerates training), we perform validation and testing every 2000 training iterations on a randomly selected chunk of 1000 consecutive mini-batches in the validation and the test set, starting with all-zero node memory and mails.  SSDs, and 100Gbps Ethernet connection. We create the instances in the same group of rack to make sure the cross-machine latency is minimized. We sample the mini-batch in advance and store them on the two NVMe SSDs in RAID0 mode to maximize the throughput. The positive edges in the mini-batches are reused in every epoch.</p><p>For the negative edges, we observe that in the temporal link prediction task, a small number of groups of negative edges are enough. So we prepare 10 groups of negative edges and randomly use them in the total 100 epochs. We assign 6 CPU threads for each trainer and memory process so that the total 96 physical threads can serve the needs for maximum memory parallelism of : = 8 on a single machine. To further overlap the mini-batch generation with the GPU computation, we pre-fetch the pre-sampled static information from disks 9 iterations in advance. However, the dynamic node memory still needs to be obtained following the serialized order in the memory process. For all methods, the node memory and cached mails are stored in the main memory and transferred between CPU and GPU in every training iteration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Convergence</head><p>We rst evaluate the convergence of DistTGL by comparing the validation accuracy after dierent numbers of training iterations and the testing accuracy for the nal model. We start with the performance of epoch parallelism on the Wikipedia, Reddit, Flights, and MOOC datasets, as the largest batch sizes on these datasets do not allow mini-batch parallelism. Figure <ref type="figure">9</ref>(a) shows the convergence curves of applying 1 (as the baseline), 2, 4, and 8 epoch parallelism. When 9 = 2, we observe more than 2&#8677; speedup for the number of training iterations before reaching 70%, 80%, and 90% of the best validation accuracy on all four datasets, especially on the Flights datasets where the nal test accuracy is even higher than the baseline. We believe that the superlinear scalability is due to the larger global negative batch size, where we observe similar convergence speed improvement when we increase the number of negative samples during training for the baseline. Unfortunately, increasing the number of negative samples cannot be used to speedup the convergence as the computation complexity is linear with the number of root nodes. When 9 = 4, epoch parallelism still manages to achieve linear speedup except on the Flights dataset with the most number of unique edges <ref type="bibr">[13]</ref>. When 9 = 8, epoch parallelism leads to signicant test accuracy drop and non-linear speedup. The sub-linear scalability for epoch parallelism when 9 is large is expected as it trains on the same positive nodes consecutively in multiple iterations, leading to increased variance in the mini-batch gradients. Then, on the same four datasets, we x 9 &#8677; : = 8 and evaluate the convergence with dierent memory parallelism. Figure <ref type="figure">9(b)</ref> shows the convergence curves of dierent epoch and memory parallelism. Compared with epoch parallelism (1 &#8677; 8 &#8677; 1), memory parallelism achieves both better validation accuracy and notably better test accuracy due to better gradient estimation in each minibatch. In general, the larger the memory parallelism : is, the better the test MRR. The training conguration with the largest : = 8 achieves linear speedup in convergence compared with the single GPU baseline with only an average of 0.004 drop in test MRR. Figure <ref type="figure">10</ref> shows the test MRR and the number of training iterations to reach the best validation MRR of dierent training congurations when 8 = 1 and 9 &#8677; : &#63743; 32. The experiment results agree with our strategy for optimal training conguration, where we prioritize memory parallelism over epoch parallelism within the hardware limit.</p><p>For the GDELT dataset, we verify that the largest batch size without accuracy loss is larger than the capacity of one machine (see Figure <ref type="figure">2(a)</ref>), which also agrees with previous work <ref type="bibr">[30]</ref>. Hence we follow our optimal training conguration choosing policy and prioritize mini-batch parallelism. Figure <ref type="figure">11</ref> shows the convergence of DistTGL on the GDELT datasets. The single GPU baseline 1&#8677;1&#8677;1 converges very slowly. Increasing the learning rate can speedup the convergence to some extent but will also lower the accuracy. By contrast, mini-batch parallelism 8 &#8677; 1 &#8677; 1 enjoys the benet of larger batch size and achieves super-linear speedup. To further speedup on more trainers, we need to use memory parallelism to solve the massive communication overhead across machines. On multiple machines, the combination of memory parallelism and mini-batch parallelism achieves satisfying convergence speedup with the highest test accuracy. We also test the performance of memory and epoch parallelism on the GDELT dataset. Memory parallelism achieves similar convergence as mini-batch parallelism while epoch parallelism has a slightly worse performance than the other two.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Training Throughput</head><p>We evaluate the training throughput of DistTGL on up to four 8-GPU machines. We do not test on more machines as the training time on the largest GDELT dataset is already less than 30 minutes on four machines while it only takes a few minutes to train on the smaller datasets. Figure <ref type="figure">12</ref>(a) shows the training throughput and the speedup compared with the single GPU baseline of the optimal training conguration on 2, 4, and 8 GPUs on a single machine, 16 GPUs on two machines, and 32 GPUs on four machines. On 8/32 GPUs on 1/4 machines, DistTGL achieves close to linear speedup averaging 7.27/25.08&#8677;, respectively. In terms of absolute throughput, the training throughput on the Reddit and Flights datasets is around 10% slower than the other datasets due to the larger amount of writes to the node memory and cached mails. Since DistTGL only applies memory parallelism across machines, the memory operations are evenly distributed to each machine. There is no cross-machine trac besides the synchronization of model weights, leading to a balanced workloads in each trainer. Due to the small TGNN models with only a few megabytes of weights, DistTGL also achieves near-linear speedup scaling on distributed systems.</p><p>We also compare the performance of DistTGL with the vanilla single GPU implementation TGN <ref type="bibr">[14]</ref> and its optimized version TGL-TGN <ref type="bibr">[30]</ref> that supports single-machine multiple-GPU. does not nish training in 10 hours. DistTGL with the optimal training congurations (memory parallelism on the Wikipedia dataset and a combination of mini-batch and memory parallelism on the GDELT dataset) signicantly outperform TGN and TGL. On 2, 4, and 8 GPUs, DistTGL achieves an average of 1.24&#8677;, 1.91&#8677;, and 2.93&#8677; improvement, respectively, compared with TGL. The 1 &#8677; 1 &#8677; 1 single GPU implementation of DistTGL is also faster than TGL due to our system optimization that overlaps the read and write operations from and to node memory. On the GDELT dataset, memory parallelism does not scale linearly on 8 GPUs due to the limitation of the bandwidth between CPU and RAM, whereas the scalability is notably better on multiple machines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">CONCLUSION</head><p>In this work, we propose DistTGL, an M-TGNN training framework for large-scale distributed M-TGNN training. DistTGL addressed the accuracy loss issue and communication overhead challenges by adopting three improvements of an enhanced model, a novel training algorithm, and an optimized system. Compared with state-ofthe-art TGNN framework TGL <ref type="bibr">[30]</ref>, DistTGL not only outperforms TGL both in convergence rate and training throughput on a single machine but also extends M-TGNN training to distributed systems. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ARTIFACT IDENTIFICATION</head><p>This work proposes DistTGL, an ecient and scalable solution to train memory-based Temporal Graph Neural Networks on distributed GPU clusters. For computational artifacts, we implement DistTGL using DGL and PyTorch. DistTGL support distributed GPU clusters with x86-64 CPUs, Nvidia GPUs, and most mainstream Linux distributions. We evaluate DistTGL on ve open-sourced dynamic graph datasets. We ensure a fair comparison between DistTGL and the baseline models. We plan to open source DistTGL with all ve datasets in converted format used in this work on Github so that all readers can reproduce the results in this work on similar hardware.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>REPRODUCIBILITY OF EXPERIMENTS</head><p>We implement DistTGL using DGL and PyTorch. We run DistTGL on AWS EC2 clusters using the g4dn.metal instances allocated in the same region to ensure fast interconnection. On moden GPU clusters, it should require less than 100 hours to run all experiments in this work, where most time are spent to run the baselines due to their lower throughputs. Since DistTGL needs to support multi-machine training, we need to control the randomness across machines. However, we do not hand pick random seeds to initialize the model weights. All accuracy results are using the default random seed 0 on all machines. Since the random seed is xed, the accuracy results can be reproduced exactly using the same version of PyTorch and DGL library. All throughput results are measured as an average of 3 epochs skipping the rst 5 epochs. After running, the user can pass a ag to DistTGL so the results are saved as CSV les. We directly use these CSV les to plot the gures using the tikzpicture library in Latex.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ARTIFACT DEPENDENCIES REQUIREMENTS</head><p>Hardware resources: four machines similar to g4dn.metal (&gt;50 CPU cores, &gt;300GB main memory, 8 Nvidia GPUs with &gt;10GB VRAM, &gt;1TB NVME SSD, 100Gbps Ethernet connection across machines). Note that our code does not have any AWS EC2 specic dependencies and can be run on machines outside of the AWS EC2 cluster. OS: Ubuntu 18 or 20. Ubuntu 22 may also work, but not tested. Software libraries: We use anaconda to manage the software libraries. We provided an yml le to create a virtual environment that has all dependencies needed in this work. Dataset: We use publicly available dataset from the previous work with DOI number 10.14778/3529337.3529342.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ARTIFACT INSTALLATION DEPLOYMENT PROCESS</head><p>Download code using this (updated) private FigShare link: <ref type="url">https://gshare.com/s/1abccf9c3d535037ac16</ref> and follow the Readmd.md le in the zip le.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>OTHER NOTES</head><p>08/16/2023: We have updated the content in the Figshare link (<ref type="url">https://gshare.com/s/1abccf9c3d535037ac16</ref>). This version includes a script to download the datasets, more details needed to run the code, and a proper Readme le. We will also publish this version to Github soon (before the conference). Please try again with this new version if time permit.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://github.com/amazon-science/disttgl</p></note>
		</body>
		</text>
</TEI>
