<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>03/09/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10166524</idno>
					<idno type="doi">10.1145/3373376.3378465</idno>
					<title level='j'>ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Mark Hildebrand</author><author>Jawad Khan</author><author>Sanjeev Trika</author><author>Jason Lowe-Power</author><author>Venkatesh Akella</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Memory capacity is a key bottleneck for training large scale neural networks. Intel® Optane DC PMM (persistent memory modules) which are available as NVDIMMs are a disruptive technology that promises significantly higher read bandwidth than traditional SSDs at a lower cost per bit than traditional DRAM. In this work we show how to take advantage of this new memory technology to minimize the amount of DRAM required without compromising performance significantly. Specifically, we take advantage of the static nature of the underlying computational graphs in deep neural network applications to develop a profile guided optimization based on Integer Linear Programming (ILP) called AutoTM to optimally assign and move live tensors to either DRAM or NVDIMMs. Our approach can replace 50% to 80% of a system's DRAM with PMM while only losing a geometric mean 27.7% performance. This is a significant improvement over first-touch NUMA, which loses 71.9% of performance. The proposed ILP based synchronous scheduling technique also provides 2x performance over using DRAM as a hardware-controlled cache for very large networks.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Deep Neural Networks (DNNs) have been dramatically successful over the past decade across many domains including computer vision <ref type="bibr">[29]</ref>, machine translation and language modeling <ref type="bibr">[40]</ref>, recommendation systems <ref type="bibr">[30]</ref>, speech <ref type="bibr">[46]</ref> and image synthesis <ref type="bibr">[47]</ref>, and real-time strategy game control <ref type="bibr">[43]</ref>. This success has in turn led practitioners to pursue larger, more expressive models. Today, state of the art models in language modeling and translation have 100s of billions of parameters <ref type="bibr">[38]</ref> which requires 100s of GB of active working memory for training. For instance, large models such as BigGAN <ref type="bibr">[6]</ref> found significant benefits from increasing both model size and training batch size, and Facebook's recent DLRM recommendation system <ref type="bibr">[30]</ref> contains orders of magnitude more parameters than conventional networks. Additionally, to reach beyond human-level accuracy these models are expected to grow even larger with possibly 100&#215; more parameters <ref type="bibr">[22]</ref>. The large memory footprints of these models limits training to systems with large amounts of DRAM which incur high costs.</p><p>As the memory capacity demands of DNN training are growing, new high density memory devices are finally being produced. Specifically, Intel&#174; Optane&#8482; DC Persistent Memory Modules (PMM) <ref type="bibr">[15,</ref><ref type="bibr">25]</ref> can now be purchased and are up to 2.1&#215; lower price per capacity than DRAM. These devices are on the main memory bus, allowing applications direct access via load and store instructions and can be used as working memory. Thus, in this paper we ask the question "what are the design tradeoffs of using PMM in training large DNN models, and more specifically, can PMM be used as a DRAM replacement when training for large DNN models?" Figure <ref type="figure">1</ref> shows the training performance for three different memory systems: an all PMM system (lowest cost), an all DRAM system (highest cost), and a heterogeneous system (moderate cost). The all PMM bar shows that naively replacing DRAM with PMM results in poor performance (about 5&#215; slowdown) for training large DNN models. The first-touch NUMA <ref type="bibr">[27]</ref> bar shows that current system support for heterogeneous memory is lacking, providing only a small benefit over the all PMM case. However, AutoTM provides 3.7&#215; speedup over the PMM case and is within 20% of the all DRAM system. Thus, we find that a small fraction of DRAM reduces the performance gap between PMM and DRAM, but only if we use smart data movement.</p><p>Use of heterogeneous memory to reduce DRAM has been studied in the past. Facebook has used SSDs to reduce the DRAM footprint of databases <ref type="bibr">[13]</ref>. Bandana <ref type="bibr">[14]</ref> uses SSD based persistent memory to store deep learning embedding tables <ref type="bibr">[10]</ref> with DRAM as a small software cache. In the context of machine learning, vDNN <ref type="bibr">[36]</ref>, moDNN <ref type="bibr">[8]</ref>, and SuperNeurons <ref type="bibr">[44]</ref> develope system-specific heuristics to tackle heterogeneous memory management between the GPU and CPU to overcome the low memory capacity of GPUs. Furthermore, future HPC systems will be increasingly heterogeneous with DRAM, PMM, and HBM <ref type="bibr">[35]</ref>, so we need a solution that is general and automatic.</p><p>In this paper we introduce AutoTM-a framework to automatically move DNN training data (tensors) between heterogeneous memory devices. AutoTM enables training models with 100s of billions of parameters and/or with large batch sizes efficiently on a single machine. We exploit the static nature of DNN training computation graphs to develop an Integer Linear Programming (ILP) <ref type="bibr">[37]</ref> formulation which takes a profile driven approach to automatically optimize the location and movement of intermediate tensors between DRAM and PMM given a DRAM capacity constraint.</p><p>We evaluate the effectiveness of AutoTM on a real system with Optane PMM by implementing our approach in the nGraph compiler <ref type="bibr">[11]</ref>. Our experiments show that naive use of PMM is not effective, but intelligent use of PMM and DRAM is required. Furthermore, using initial public pricing information, we evaluate the cost-performance benefits DRAM-PMM based systems. We show that ratios of 8 : 1 or 4 : 1 of PMM to DRAM can be more cost effective than only DRAM or only PMM. We also compare our approach to the existing hardware DRAM cache implemented in current Intel platforms <ref type="bibr">[25]</ref> and find AutoTM offers up to 2&#215; performance improvement over hardware-managed caching.</p><p>Finally, we demonstrate that AutoTM can be further generalized beyond PMM-DRAM heterogeneity by applying AutoTM to CPU-GPU systems. The approach taken by Au-toTM uses minimal problem specific heuristics and is thus a general approach toward memory management for many different heterogeneous systems.</p><p>The paper is organized as follows. In Section 2 we present a quick overview of training deep neural networks and Intel's Optane DC PMM. In Section 3 we will present the details of AutoTM and in Section 4 we will describe implementation details, followed by our evaluation methodology in Section 5, and the main results in Section 6. We will present extensions to AutoTM in Section 7 and conclude with related work and directions for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Deep Learning Training</head><p>Deep neural networks (DNNs) are often trained using a backward propagation algorithm <ref type="bibr">[28]</ref> and an optimizer such as stochastic gradient descent. Popular deep learning frameworks such as Tensorflow <ref type="bibr">[1]</ref> and nGraph <ref type="bibr">[11]</ref> implement DNNs as a computation graph where each vertex or node in the computation graph represent some computational kernel. Common kernels include convolutions (CONV), pooling (POOL), matrix multiplication, and recurrent cells such as LSTM or GRU. Each kernel has its own characteristics such as number of inputs, number of outputs, computation time, and computational complexity. Directed edges in the computation graph between kernels denote data or control dependencies between kernels. An edge representing a data dependency is associated with a tensor, which we consider to be a contiguous region of memory with a known size. All operations were performed using AVX512 load and stores.</p><p>Copies between DRAM and PMM were done using streaming load and store intrinsics. Figure <ref type="figure">2</ref> shows a simple example computation graph with 5 kernels and 3 tensors. Nodes in the graph are compute kernels, each with zero or more inputs and outputs. The inputs and outputs of a kernel are immutable tensors. Each tensor is annotated with its producing kernel, each user of the tensor, and the last user of the tensor. After its last use, a tensor's memory may be freed for future tensors.</p><p>We focus on the case where the computation graph describing the training iteration is static. That is, the computation graph contains no data-dependent control behavior and the sizes of all intermediate data is known statically at compile time. While many DNN graphs can be expressed statically, there are some networks that exhibit data-dependent behavior <ref type="bibr">[38]</ref>. We leave extending AutoTM to dynamic computation graphs as future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Intel Optane DC PMM</head><p>3D XPoint is a resistive memory technology developed by Intel <ref type="bibr">[19]</ref> that was initially introduced in the form a SSD called Optane SSD. Recently, this technology as been made available in the form of a standard byte-addressable DDR4 DIMM on the CPU memory bus, just like DRAM DIMMs <ref type="bibr">[25]</ref> with the new Cascade lake based chipsets and is called Optane DC PMM (different from Optane SSD). Optane DC PMMs are higher capacity than DRAM modules with up to 512 GB per module available today.</p><p>There are two operating modes for Optane DC PMM. In 2 Level Mode (2LM or cached) PMM act as system memory with DRAM as a direct mapped cache. This operating mode allows for transparent use of the PMM at the overhead of maintaining a DRAM cache. App Direct Mode allows users manage the PMM directly. The PMM are mounted on a system as direct access file systems. Files on the PMM devices are then memory mapped into an application. When using a direct access aware file system, loads and stores to addresses in this memory mapped file go directly to the underlying PMM. Note that in App Direct mode, the total available memory is the sum of DRAM and PMM while in 2LM only the PMM capacity is counted. In this work, we focus on using the PMM in App Direct mode, and make comparisons between our optimized data movement and 2LM.</p><p>Figure <ref type="figure">3</ref> shows the read, write, and copy bandwidth of DRAM and PMM on our test system with six interleaved 128 GB PMMs. The read, write, and copy operations were implemented by splitting a region of memory into contiguous blocks and assigning a thread to each chunk. AVX-512 streaming loads and stores were used to implement the copy operation as they provide significantly higher throughput between DRAM and PMM.</p><p>From Figure <ref type="figure">3</ref>, we make the following observations about PMM: bandwidth is significantly lower than DRAM, read bandwidth scales with the number of threads, write bandwidth peaks at a low number of threads and diminishes with a higher number of threads, copy bandwidth from DRAM to PMM scales with the number of threads, copy bandwidth is chiefly limited by PMM write bandwidth, and there is significant read/write asymmetry. These findings agree with the performance evaluation of Optane PMM by other researchers <ref type="bibr">[25]</ref>.</p><p>The read/write asymmetry has implications on the performance of kernels with inputs and outputs in PMM or DRAM. Figure <ref type="figure">4</ref> demonstrates the performance impact on a single CONV kernel. We observe that when the input to the CONV kernel is in PMM and the output is in DRAM, the performance of the kernel is comparable to when both input and output are in DRAM. However, in the cases where the output is in PMM, the kernel runs over two times slower. Any system seeking an optimal runtime with a memory constraint must take these relative timings into consideration when making decision on where to assign data. In the next section, we will describe the details of AutoTM and how it manages these performance characteristics.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">AutoTM</head><p>An overview of the proposed framework is shown in Figure <ref type="figure">5</ref>. A DNN model is given to nGraph, which optimizes the network DAG according to the selected backend (e.g. CPU, GPU etc.). As part of the compilation process, our system inspects the nGraph DAG data structure to extract (1) the order and types nodes in the graph, (2) the tensors produced and consumed by each node, and (3) the specific kernels chosen by nGraph.</p><p>We then perform profiling on every kernel in the computation graph by varying its inputs and outputs between the different memory pools (i.e., DRAM or PMM) and recording the execution time of the kernel in each configuration. Since this step is potentially time consuming and DNNs typically contain many identical kernels, we keep a software cache of profiled kernels. By keeping a profile cache, profiling for a given DNN only needs to be performed once. Profiling and DAG information is then fed into a Memory Optimizer (described in Section 3.1) along with a DRAM capacity constraint, that mutates the nGraph data structure with the tensor assignments and data movement nodes.</p><p>A user of this system only needs a nGraph function, which is a collection of "Node" and "Tensor" data structures describing computations and data flow of the compute graph. These functions can be created by using one of the nGraph front ends, or directly using C++. Profiling, optimization, and code generation all happen as part of the nGraph compilation process and is transparent to the user.</p><p>In the rest of this section, we first give a high level introduction to the memory optimizer. Then we present the details of the optimizer's underlying ILP formulation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Memory Optimizer</head><p>The goal of the Memory Optimizer is to minimize execution time by optimizing intermediate tensor movement and placement. The inputs to the optimizer are (1) the types of kernels in the computation graph in topological order, (2) the set of all valid tensor input/output locations for each kernel as well as profiled execution time for each configuration, (3) the sizes of all intermediate tensors, as well as their producers, users, and final users, (4) synchronous copy bandwidths between DRAM and PMM, and (5) a DRAM limit. The output of the optimizer describes the location and date movement schedules for all intermediate tensors that will minimize the global execution time of the graph.</p><p>Since the Memory Optimizer is implemented as an ILP, we need to model tensor location and movement using integer or binary variables and constraints <ref type="bibr">[32]</ref>. For each tensor t, we create a separate network flow graph G t = (V t , E t ) that traces the tensor's location during its lifetime. Examples of such graphs are given in Figure <ref type="figure">6a</ref> and 6b. The structure of these graphs allows us to customize the semantics of possible tensor locations and movements.</p><p>Using this graph structure, we investigate two separate formulations, static and synchronous. The static formulation (Figure <ref type="figure">6a</ref>) allows no tensor movement between memory pools. A tensor is assigned to either DRAM or PMM and remains there through its lifetime. The synchronous formulation (Figure <ref type="figure">6b</ref>) allows tensors to be moved between memory pools but blocks program execution to perform this movement. We further generalize the ILP formulation to an asynchronous formulation that allows overlap between computation and data movement in Section 7.</p><p>Network flow constraints <ref type="bibr">[16]</ref> are placed on each tensor flow graph G t so that flow out of the source vertex is 1, flow into the sink vertex is 1, and flow is conserved for each intermediate node. The solution to this network flow describes the movement of the tensor. For example, the bold path in Figure <ref type="figure">6b</ref> implies the following schedule for tensor t 1 : (1) created by kernel k 1 in DRAM, (2) remains in DRAM until the execution of kernel k 2 , (3) after k 2 , synchronously moved t 1 into PMM, (4) prefetch t 1 into DRAM right before k 4 , (5) move t 1 out of DRAM after k 4 , (6) tensor t 1 is in PMM for the execution of kernel k 5 , (7) after k 5 , tensor t 1 is no longer needed and can be freed from all memory pools.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Objective Function</head><p>We wish to minimize the execution time of the computation graph under a DRAM constraint. In our framework, computation kernels are executed sequentially. Therefore, in the static formulation where there is no tensor movement, the objective function (expected execution time) is</p><p>where K is the set of all kernels k in the computation graph and &#961; k is the expected execution time for kernel k. Note that &#961; k depends on the locations input and output tensor for kernel k. The selection of input and output tensor locations is not trivial because of dependencies between kernels. For example, if tensor t 3 in Figure <ref type="figure">2</ref> is assigned to PMM, then kernel k 2 must produce t 3 into PMM and kernels k 3 and k 4 must reference t 3 in PMM, which has a performance impact.</p><p>Given the lower performance of PMM relative to DRAM, the cost of moving a tensor from DRAM to PMM may be amortized by a resulting faster kernel execution. In the synchronous formulation, tensor movement that blocks computation graph execution and may only happen between kernel executions. The objective function then becomes  where T is the set of all intermediate tensors t in the computation graph and M sync t is the total amount of time spent moving tensor t. Note that a tensor t may be moved multiple times during its lifetime, so M sync t represents the sum of movement times of all individual moves of t.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">DRAM Variables</head><p>As noted above, the execution time of a kernel depends on the locations of its input and output tensors. We must also keep track of all live tensors in DRAM to establish a constraint on the amount of DRAM used. Thus, we need machinery to describe for each kernel k whether the input and output tensors of k are in DRAM or PMEM and which tensors are in DRAM during the execution of k.</p><p>For each kernel k &#8712; K and for each tensor t &#8712; T where t is an input or output of k, we introduce a binary variable</p><p>In practice, this variable is implemented as t DRAM t ,k = 1 if and only if any of the incoming edges to the DRAM node in the component in the network flow graph G t for k are taken.</p><p>To determine tensor liveness, we introduce binary variables</p><p>for each kernel k &#8712; K and for each tensor t &#8712; T where t is an output or output of k. These variables describe whether a tensor is written into DRAM after the execution of a kernel, and if it remains in DRAM until the next time it is used. In practice, this is implemented as t DRAM t ,k = 1 if and only if the outgoing DRAM to DRAM edge is taken from the DRAM node in the component in network flow graph G t for k.</p><p>We make these two distinct class of variables to handle the case in the synchronous formulation where a tensor is prefetched from PMM to DRAM as an input to some kernel k and then moved back to PMM immediately after k.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">DRAM Constraints</head><p>Our main goal here is to establish a constraint on the amount of DRAM used by the computation graph. We must ensure that the sum of sizes of all live tensors in DRAM at any point is less than some limit L DRAM We use the DRAM variables discussed in the previous section. First, define a helper function ref(k, t) = k &#8242; where k, k &#8242; &#8712; K and t &#8712; T with k &#8242; defined as latest executing kernel earlier or equal to k in the topological order of the computation graph such that there exists DRAM node in G t for kernel k &#8242; . For example, in Figure <ref type="figure">6a</ref></p><p>We want to ensure that at the execution time for each kernel k &#8712; K, the cumulative size of all live tensors resident in DRAM is with some limit L DRAM . Using the ref function, we add the following constraint for each k &#8712; K:</p><p>where |t | is the allocation size of tensor t in bytes, IO(k) is the set of input and output tensors for k, and L(k) is the set of all non-input and non-output tensors that are "live" during the execution of k. We assign a separate limit L DRAM,k for each kernel k initialized to L DRAM to address the memory fragmentation issue discussed in Section 4.2</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Kernel Configurations and Kernel Timing</head><p>For each kernel k &#8712; K, we use an integer variable &#961; k for the expected execution time of k given the locations of its input and output tensors. First, we define a configuration c as a valid assignment of each of a kernel's input and output tensors into DRAM or PMM. For example, a kernel with one input and one output tensor may have up to four configurations, consisting of all combinations of its input and output in DRAM or PMM. The definition of &#961; k is then</p><p>where C(k) is the set of all valid configurations c for kernel k, n k ,c is the profiled execution time of kernel k in configuration c, and d k ,c is a one-hot indicator with d k ,c = 1 if and only if kernel k's input and output tensors are in configuration c.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6">Tensor Movement Timing</head><p>The movement cost of a tensor t is the size of the tensor |t | divided by bandwidth between memory pools. Since bandwidth may be asymmetric, we measure and apply each separately. For each tensor t &#8712; T , the total synchronous movement time</p><p>is the sum of the number of taken edges in G t from DRAM to PMM multiplied by the DRAM to PMM bandwidth and the number of taken synchronous edges from PMM to DRAM multiplied by the PMM to DRAM bandwidth.</p><p>In our case where tensors are immutable, we may apply an optimization of only producing or moving a tensor into PMM once. Any future movements of this tensor into DRAM references the data that is already stored in PMM. Further movements from DRAM to PMM become no-ops.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Implementation Details</head><p>In this section, we describe some of the implementation details which are not directly part of the ILP formulation. The memory optimizer itself was implemented in the Julia <ref type="bibr">[5]</ref> programming language using the JuMP <ref type="bibr">[12]</ref> package for ILP modeling. Gurobi <ref type="bibr">[18]</ref> was used as the backend ILP solver. We chose nGraph <ref type="bibr">[11]</ref> over other popular machine learning frameworks based on static computation graphs as our backend because it is optimized for the Intel hardware and is relatively easy to modify. However, AutoTM is a general technique that can be integrated into other frameworks with similar underlying semantics. 1   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">nGraph Compiler Backend</head><p>The nGraph compiler is an optimizing graph compiler and runtime developed by Nervana Systems/Intel for deep learning (DL) applications aiming to provide an intermediate representation (IR) between DL frameworks and hardware backends. An nGraph IR is a directed acyclic graph (DAG) of stateless operations nodes, each node with zero or more inputs, outputs, and constant attributes. Inputs and outputs of each node are multidimensional arrays called tensors with 1 All of the AutoTM code can be found on GitHub at <ref type="url">https://github.com/ darchr/AutoTM</ref>. an arbitrary layout. The backend kernel used to implement a node is chosen based on the attributes of the node as well as the sizes, data types, and layouts of each of its inputs and outputs. nGraph will also apply generic and backend specific whole graph optimizations such as kernel fusion and algebraic simplification.</p><p>Memory location for intermediate tensors is performed using ahead-of-time heap allocation by traversing the function DAG and maintaining a list of live tensors. When tensors are last used, the memory space occupied by those tensors is freed and used for future tensors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Managing Memory Fragmentation</head><p>The ILP formulation presented thus far assumes perfect memory management, which means that if the sum of sizes of live tensors is under the memory limit, then all tensors will fit within memory. In practice, this is not always the case. The process of allocating and freeing tensors may fragment memory resulting in a larger memory requirement.</p><p>To manage this, we use an iterative process of reducing the DRAM limit for kernels where the the following limit is exceed and rerunning the ILP.</p><p>1. We initialize the kernel-wise DRAM limits L DRAM,k to the L DRAM . 2. We solve the ILP using the current values of L DRAM, K .</p><p>nGraph translates the resulting schedule and then executes its memory allocator pass. 3. We collect the set of kernels K frag where the total amount of memory allocated exceeds L DRAM due to fragmentation. If this set is empty, we are done. 4. Otherwise, we apply an update L DRAM,k = 0.98L DRAM,k for all k &#8712; K frag and go back to step (2). Thus, the ILP solver may have to run multiple times before a valid solution is found. In practice, this process is usually only done 1 to 2 times with a maximum of 5 as discussed in Section 6.6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Data Movement Implementation</head><p>Synchronous movement operations are integrated as new move nodes in the nGraph compiler, which are automatically inserted into the nGraph computation graph following memory optimization. The implementation of these move nodes uses a multithreaded memory copy with AVX-512 streaming load and store intrinsics followed by a fence.</p><p>Operation scheduling in nGraph consists of a simple topological sort of the nodes in the computation graph, beginning with the input parameters. This creates unnecessary memory usage with move nodes as they are scheduled ad hoc, resulting in tensor lifetimes that are longer than necessary. Thus, we extended the nGraph scheduler so that if a tensor is moved from DRAM to PMM after some kernel k, we ensure that this movement occurs immediately after the execution of k. Conversely, if a tensor is moved from PMM to DRAM Session 10A: Tensor computation and data orchestration -Playing musical chairs! ASPLOS <ref type="bibr">'20, March 16-20, 2020</ref>, Lausanne, Switzerland to be used for kernel k, we ensure this occurs immediately before the execution of k.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Evaluation Methodology</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">System</head><p>Our experimental Optane DC system was a prototype dual socket Xeon Cascade-Lake server. Each socket had 6 &#215; 32 GB of DRAM and 6 &#215; 128 GB Intel Optane DIMMs. Each CPU had 24 hyperthreaded physical cores. In total, the system had 384 GB of DRAM and 1.5 TB NVDIMM storage.</p><p>NUMA policy was set to local by default. Unless specified otherwise, all experiments were conducted on a single socket with one thread per physical core. Each workload was run until execution time per iteration (traversal of the computation graph) was constant. Since these workloads contain no data dependent behavior, performance will be constant after the first couple of iterations. Checks were used to ensure no IEEE NaN or subnormal numbers occurred, which can have a significant impact on timing <ref type="bibr">[3]</ref>.</p><p>Our approach does not change the underlying computations performed during training; it is a transparent backend implementation optimization. Thus, the performance of our benchmarks across a few training iterations is sufficient to obtain performance metrics.</p><p>We chose to evaluate AutoTM with a multicore CPU platform because Optane PMMs are only available for CPU platforms. However, the ILP formulation of AutoTM should apply to any heterogeneous memory system. We explore one other example with CPU and GPU DRAM in Section 7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">DNN Benchmarks</head><p>We choose a selection of state of the art Deep Neural Networks for benchmarking our approach. A summary of the benchmarks and batch sizes used is given in Table <ref type="table">1</ref>. Conventional CNNs for the Optane DC system were Inception v4 <ref type="bibr">[41]</ref>, Resnet 200 <ref type="bibr">[21]</ref>, DenseNet 264 <ref type="bibr">[24]</ref>, and Vgg19 <ref type="bibr">[39]</ref>. All but Vgg19 have complex dataflow patterns to stress test AutoTM. The batch sizes were chosen to provide a memory footprint of over 100 GB for each workload. These batch sizes, while larger than what is typically used, mimic future large networks while still fitting within the DRAM of a single CPU socket of our test system.</p><p>We also compare our approach against the native 2LM mode, which is a hardware solution to data management that uses PMM transparently with CPU DRAM as a cache. Since we can not change the physical amount of DRAM used by 2LM, we used very large neural networks that exceed the CPU DRAM and require the use of PMM to train. These very large networks include Vgg416 <ref type="bibr">[36]</ref> (constructed by adding 20 additional convolution layers to each convolution block in Vgg16) and Inception v4 with a batch size of 6144. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Experiments</head><p>We want to determine whether PMM is cost effective for training DNNs, and how AutoTM compares against existing solutions to use PMM.</p><p>For the conventional benchmarks, we consider the impact of performance with different ratios between PMM and DRAM. These ratios are given in the form a : b where a is the amount of PMM relative to b the amount of DRAM used to train the network. A ratio of 1 : 1 indicates that a network was trained with half PMM and half DRAM. For a network requiring 128GB total to train would have a split of 64 GB PMM and 64 GB DRAM. Setting a ratio such as this may lead to a larger total memory footprint in total due to memory fragmentation in both PMM and DRAM. However, in practice the total memory footprint expansion is minimal with an observed maximum observed value of 3.83% occuring in the static formulation for Inception v4. A ratio of 0 : 1 denotes a system where only DRAM is used while 1 : 0 is a system using only PMM.</p><p>We use a baseline of a first-touch NUMA allocation policy with DRAM as a near node and PMM as a far node for the conventional benchmarks. The NUMA policy was encoded in our framework by assigning intermediate tensors to DRAM as they are created until the modeled memory capacity of DRAM is reached. Future tensors can only reside in DRAM if existing tensors are freed.</p><p>For the large benchmarks, we compare our approach to the 2LM hardware managed DRAM cache to determine the effectiveness of AutoTM relative to an existing approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Conventional Networks</head><p>Figure <ref type="figure">7</ref> shows the speedup provided by our scheduling for over training solely with PMM (ratio 0 : 1). The horizontal axis is the ratio of PMM to DRAM used to train the network. We observe that when PMM is used as a direct substitute for DRAM, performance is poor with a 3x to 8x increase in training time (red horizontal line). However, with a minimal amount of DRAM such as an 8 : 1 PMM to DRAM ratio, we are able to dramatically improve performance without changing the overall memory footprint of the application. Further, adding more DRAM only marginally increases performance. This above performance gain does not occur using conventional first-touch NUMA. This is because first-touch NUMA <ref type="bibr">[27]</ref> works by allocating tensors into DRAM as they are used by the computation graph until the DRAM capacity is reached. In the training of DNNs, tensors produced early in the forwards pass are used during the backwards pass and thus must be live for the majority of the graph's computation <ref type="bibr">[36]</ref>. With first-touch NUMA, these long lived tensors are assigned to DRAM forcing future short-lived tensors into PMM.</p><p>AutoTM, on the other hand, is aware of the performance implication of these long lived tensors. The general strategy AutoTM takes is to prioritize short-lived tensors for DRAM placement (Section 6.4). These short-lived tensors mainly include intermediate tensors generated during the backwards pass. By prioritizing short-lived tensors, AutoTM ensures that more tensors overall may reside in DRAM.</p><p>Vgg is an outlier due to its extremely large second convolution layer. With small DRAM sizes, some or all of the input and output tensors of this large layer must be placed in PMM, incurring a performance penalty. Once these tensors can be placed in DRAM, we see a significant performance improvement as can be seen in the performance jump from the 4 : 1 ratio to the 1 : 1 ratio. Another interesting feature of this network is that the synchronous formulation performs slightly worse than the static formulation for an 8 : 1 ratio. This is caused by the interaction between the insertion of move nodes and the defragmentation procedure. The regions where the bars are higher than the dollar signs are regions where price-performance is lower.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Comparison to a hardware DRAM cache</head><p>We use very large networks to compare AutoTM to a hardwarecontrolled DRAM cache (2LM mode). The results from the large benchmarks Vgg416 and the large batchsize Inception v4 are shown in Figure <ref type="figure">8</ref>. For our analysis, we use the price of the cheapest PMM at $7.85 per GB and the cheapest DRAM at $16.61 per GB. This means the cost-per-GB advantage of PMM over DRAM is about 2.1x. In Figure <ref type="figure">9</ref> we only include the cost of the memory actually used. Since Optane DC is a new technology, prices are still adjusting, and as the technology matures, price will likely decrease, improving its cost-effectiveness.</p><p>Figure <ref type="figure">9</ref> shows the relative performance of AutoTM for our workloads (bars, left axis) as well as the cost of memory used by the application relative to the case where all DRAM is used (dollars, right axis). The use of PMM can be cost effective if the performace lost by replacing some DRAM with PMM is less than the cost reduction. We observe that only using PMM directly is not cost effective, the performance loss caused by the slower devices is not offset by the lower price. However, for PMM to DRAM ratios of 4 : 1 and 1 : 1, AutoTM can provide a cost-performance benefit. This cost-performance benefit may be reduced when taking the whole system into account, but the cost of memory is usually the dominant cost in large systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Understanding the ILP Solution</head><p>In this section, we present some insight to how and why Au-toTM works using Figure <ref type="figure">10</ref>. Figure <ref type="figure">10a</ref> shows the slowdown of the static and synchronous relative to all DRAM. With a small amount of DRAM, performance improves rapidly. This trend continues until a critical threshold where adding DRAM yields diminishing returns.</p><p>To understand this behavior, we look at the input and output memory locations for each kernel as well as the amount of data moved. Figure <ref type="figure">10b</ref> shows the percent by memory footprint of kernel input and output tensors in DRAM. We see a trend to assign as many kernel inputs and outputs into DRAM, with a slight priority on output tensors. This is consistent with the lower write bandwidth of PMM. Furthermore, the point where almost 100% of output/input tensors are in DRAM corresponds to the critical point in the performance graphs. This implies a general strategy to maximize kernel read and write memory accesses in DRAM, followed by data movement to PMM when DRAM capacity constrained.</p><p>This idea is reinforced by Figure <ref type="figure">10c</ref>, which shows the total amount of memory moved between DRAM and PMM in the synchronous formulation. With a DRAM limit near zero, no data movement occurs since no data may be moved into DRAM. A small DRAM allowance, however, is followed by a dramatic increase in data movement, again with an emphasis on moving data from DRAM to PMM. Once the DRAM limit allows almost all tensor inputs/outputs to reside in DRAM, the amount of data movement decreases. The region of gradual slowdown seen in the performance plot is caused primarily by data movement rather than kernel slowdown from more memory accesses to PMM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5">Kernel Profiling Accuracy</head><p>To evaluate the accuracy of our profile based approach, we show the error between the expected runtime and the measured runtime in Figure <ref type="figure">11</ref>. The worst case error occurs for in the static formulations for DenseNet 264 (19%). This error is likely due to CPU caching. During profiling, move nodes are placed at the inputs of kernels under test to allow the inputs and outputs of the kernel to be varied between DRAM and PMM. Kernels cannot be directly profiled due to levels of indirection used in nGraph. Because move nodes are implemented using streaming instructions, no data is resident in CPU caches following these instructions. Hence, our profiling step is essentially measuring the cold-performance of these kernels. This results in an overestimation in run time for the static formulation since no move nodes are used. Vgg19 is less affected due to its very large intermediate layers.</p><p>The expected runtime for the synchronous formulation closely follow the predicted runtime because of the use of move nodes placed in the computation graph. The error in the 1 : 0 all PMM case exists for similar reasons.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6">ILP Solution Times</head><p>It is important that the memory optimizer is able to run in a reasonable amount of time. Although ILP is inherently N Phard, recent solvers can find solutions to many problems quickly. Table <ref type="table">3</ref> shows the total amount of time optimizing the ILP. The number of retries due to memory fragmentation is shown in parentheses. Solution time increases with model complexity. Since the optimized computation graph will run for days or weeks to fully train the DNN, this optimization overhead will be amortized. The worst case is the static      formulation for DenseNet which takes a little less than an hour to fully solve.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Extending AutoTM</head><p>In this section, we discuss two extensions to AutoTM: allowing asynchronous data movement and performing kernel implementation selection. We explain why these extensions were not included in the original formulation and demonstrate their viability on a CPU-GPU platform. These extensions and the GPU implementation of AutoTM show that it is a general and flexible framework for managing heterogeneous memory.</p><p>The first extension we investigate is asynchronous offloading and prefetching of intermediate tensors between memory pools. This allows data movement to be overlapped with computation, improving the throughput of the application as a whole. We implemented asynchronous data movement on the PMM system, but found it performed poorly on existing CPU only systems for a number of reasons. Neither a dedicated copy thread nor DMA provided sufficient performance to mitigate the overhead of these approaches. However, a PCIe connected GPU offers a high speed asynchronous data copy API, which is ideal for implementing this extension.</p><p>The second extension to the formulation is performing kernel implementation selection. The underlying library used by nGraph to perform forward and backward convolutions for the GPU backend is cuDNN <ref type="bibr">[9]</ref>, a deep learning library from Nvidia. This library exposes several different implementations for each convolution, each with performance and memory footprint tradeoffs. Generally, faster implementations require more memory. In a memory starved case, this larger memory footprint may require more offloading of previous tensors, resulting in a global slowdown. Since nGraph does not expose any kernel selection options for the CPU backend, we implement this on the GPU instead.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">ILP Formulation Modifications</head><p>Since AutoTM is implemented using an ILP formulation, we can extend it to be aware of the performance and memory footprint of these different kernels and globally optimize tensor movement and implementation selection. Here, we provide a high level overview of the additions to the ILP formulation to express asynchronous data movement and kernel implementation selection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.1">Objective Function:</head><p>In our formulation, we allow an arbitrary number of tensors to be moved between GPU and CPU DRAM concurrently </p><p>where ASYNC(k) = {t &#8712; T : t can be move concurrently with k} and M async t ,k is the amount of time (if any) spent moving tensor t during the execution of k. The max operation is implemented using standard ILP techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.2">Tensor Graphs:</head><p>We must extend the tensor flow graphs G t to encode points of asynchronous tensor movement. We identify kernels that can be overlapped with data movement and add a component in each tensor's graph (like those shown in Figure <ref type="figure">6b</ref>) for each kernel with which the tensor can be moved concurrently.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.3">Asynchronous Data Movement:</head><p>Asynchronous move times for tensor t must be generated for each kernel k across which t may be moved. This comes directly from the extended tensor graph</p><p>where e P &#8594;D (e D&#8594;P ) is the binary edge variable in E t corresponding to the asynchronous movement of t from PMM to DRAM (DRAM to PMM) across kernel k.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.4">Selecting Kernel Implementations:</head><p>Let I(k) = {1, 2, . . . , n k } be an enumeration of the implementations for kernel k. We generate one-hot binary variables v i,k for all i &#8712; I(k) where v i,k = 1 implies implementation i is to be used for kernel k.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.5">DRAM Constraints:</head><p>Constraining DRAM is similar to the static and synchronous formulations, but now includes kernel memory footprints with</p><p>where s k ,i is the memory footprint of implementation i of k.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.6">Kernel Timing:</head><p>The expected runtime of a kernel is now dependent on which implementation of the kernel is chosen. Building on the example given in Section 3.5, assume that k has two implementations (i.e. I(k) = {1, 2}). The expected execution time for &#961; k kernel k is modeled as</p><p>with n k ,c ,i is the profiled runtime of implementation i of kernel k in IO configuration c. This approach does not account for the performance impact of memory conflict between data movement and the computation kernel. However, the maximum memory bandwidth of our GPU is 616 GB/s while the maximum bandwidth of PCIe is 16 GB/s. Thus, the impact of asynchronous data movement is likely low.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Implementation</head><p>We modified the GPU backend of nGraph to support synchronous and asynchronous tensor movement as well as to allow for kernel selection of forward and backward convolution kernels. All GPU kernels are profiled with inputs and outputs in GPU memory. When implementation selection is available, all possible implementations of a kernel are profiled as well. Asynchronous movement was implemented using two CUDA <ref type="bibr">[31]</ref> streams: one for computation and the other for data movement via cudaMemcpyAsync. These streams are synchronized before and after an asynchronous movement/computation overlap ensure data integrity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3">Methodology</head><p>Our system used a Nvidia RTX 2080 Ti with 11 GB of GDDR6 using CUDA 10.1 and cuDNN 7.6. The host system was an Intel Core i9-9900X with 64 GB of DDR4 DRAM. We use the same convolutional neural networks used earlier. The networks and batch sizes used are given in Table <ref type="table">1</ref>. We compare the results of AutoTM with the performance of cudaMallocManaged, which is a memory virtualization layer offered by Nvidia for automatically data from the CPU the GPU the event of a page fault and moving unused pages from GPU DRAM to CPU DRAM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4">GPU Results</head><p>The results for the GPU experiments are given in Figure <ref type="figure">12</ref>. For networks that fit on the GPU, our approach has no overhead as the ILP optimizer realizes no data movement is needed. As the intermediate working set increases, we observe a several fold improvement with AutoTM over cud-aMallocManaged due to the lack of runtime overhead of our approach and its algorithm awareness. AutoTM provides considerable speedup when data movement between the CPU and GPU is required. The asynchronous extension outperforms the synchronous formulation with its ability to overlap data movement and computation. However, the asynchronous extension is limited to overlapping tensor movement with a single kernel at a time. Since the RTX 2080 Ti executes kernels faster than data movement, time must be spent to synchronize the two CUDA streams.</p><p>The synchronization overhead of overlapping tensor movement with a single kernel can be seen by comparing the achieved performance with the theoretical best performance, calculated by assuming infinite GPU DRAM capacity and using the fastest possible implementations for all kernels. As the memory requirement for training increases, AutoTM achieves a lower fraction of this best performance due to synchronization.  We did not compare our results directly against vDNN <ref type="bibr">[36]</ref> for two reasons. First, the RTX 2080 Ti GPU is much faster than the Titan X used in that work and thus we cannot compare results directly. Second, the code for vDNN is not available, making direct testing on our GPU difficult. However, while vDNN leverages the same characteristics as AutoTM (communication overlapping, kernel selection, and liveness analysis), AutoTM uses mathematical optimization rather than heuristics providing a more general solution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Related Work</head><p>As an emerging technology Intel Optane DC has been explored in several recent works. These include in depth performance analysis <ref type="bibr">[25]</ref>, large graph analytics <ref type="bibr">[15]</ref>, and database I/O primitives <ref type="bibr">[42]</ref>. Research into using Optane PMM for virtual machines demonstrates that only a small amount of DRAM is needed <ref type="bibr">[23]</ref>. Flash based SSDs have also been used to reduce the DRAM footprint in database <ref type="bibr">[13]</ref> and ML <ref type="bibr">[14]</ref> workloads. These approaches use a software managed DRAM cache to mitigate the slow performance and block level read/write granularity of NVM SSDs. Operating system support for managing heterogeneous memory <ref type="bibr">[2,</ref><ref type="bibr">45]</ref> and support for transparent unified memory between GPU and CPU <ref type="bibr">[26,</ref><ref type="bibr">33]</ref> have been studied extensively in the past. However, to the best of our knowledge, the proposed work is the first to explore the design space and cost-performance tradeoffs of large scale DNN training on systems with DRAM and PMM.</p><p>Previous works such as vDNN <ref type="bibr">[36]</ref> exploit heterogeneous memory between GPUs and CPUs by recognizing that the structure of DNN training computation graphs has a pattern where intermediate tensors produced by early layers are not consumed until much later in the graph execution. The authors of vDNN exploit this to develop heuristics for moving these tensors between GPU and CPU DRAM during training to free GPU memory. SuperNeurons <ref type="bibr">[44]</ref> and moDNN <ref type="bibr">[8]</ref> build on vDNN. SuperNeurons introduces a runtime manager for offloading and prefetching tensors between GPU and CPU memory as well as a cost-aware method of applying recomputation of forward pass layers during the backward pass to reduce memory. Similar to our approach, moDNN allows tensors to be offloaded and uses profiling information of kernel runtime and expected transfer time to determine how it will overlap computation and communication. AutoTM differs from these previous approaches in that we use mathematical optimization rather than problem specific heuristics. AutoTM also generalizes the location of data across DRAM and PMM instead of requiring data to be in DRAM for computation.</p><p>Integer Linear Programming and profile guided optimization have been used widely to address similar problems in research literature. For example, work in the embedded system space <ref type="bibr">[4]</ref> uses ILP in to optimize the allocation of heap and stack data between fast SRAM and slow DRAM. ILP has also been used in register allocation <ref type="bibr">[17]</ref> and automatic program parallelization <ref type="bibr">[20]</ref>. ILP has been used to optimize instruction set customization and spatial architecture scheduling <ref type="bibr">[32]</ref>. Profile guided optimization has been used for dynamic binary parallelization <ref type="bibr">[48]</ref>, process placement on SMP clusters <ref type="bibr">[7]</ref> and online autotuning of CPU and GPU algorithm selection <ref type="bibr">[34]</ref>. AutoTM builds on these ideas to address the new problem of data movement in heterogeneous memory systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">Conclusions</head><p>We present AutoTM, an ILP formulation for modeling and optimizing data location and movement in static computation graphs such as those used for training and inference of DNNs. AutoTM uses profile data to optimally assign kernel inputs and outputs into different memory pools and schedule data movement between the two pools to minimize execution time under a memory constraint. With AutoTM, we can obtain 2x performance improvement over hardware DRAM caching solutions. We further find Intel Optane PMM can reduce the DRAM footprint of DNN training by 50 to 80% without significant loss in performance. Given the lower cost of Optane PMM, this can yield a cost-performance benefit in systems with mixed DRAM and PMM over a system with only DRAM.</p><p>AutoTM uses minimal problem specific heuristics, making it generally applicable to different systems and networks. We demonstrate this flexibility by extending AutoTM to GPUs, and believe it can be further extended to further heterogeneous systems, such as those with multiple GPUs or multilevel systems with HBM, DRAM, and PMM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10">Acknowledgements</head><p>This work is supported in part by the Intel corporation and by the National Science Foundation under Grant No. CNS-1850566.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>https://www.lenovo.com/us/en/p/7X05A01TNA/customize?dcscGuid= f3dd16d0-96dd-4deb-9c48-9c6cec9578ba (accessedAugust 14, 2019)   </p></note>
		</body>
		</text>
</TEI>
