<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>CachedArrays: Optimizing Data Movement for Heterogeneous Memory Systems</title></titleStmt>
			<publicationStmt>
				<publisher>38th IEEE International Parallel and Distributed Processing Symposium (IPDPS)</publisher>
				<date>05/27/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10519584</idno>
					<idno type="doi"></idno>
					
					<author>Mark Hildebrand</author><author>Jason Lowe-Power</author><author>Venkatesh Akella</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We propose a new framework called CachedArrays and a set of APIs to address the data tiering problem in large scale heterogeneous and disaggregated memory systems. The proposed framework operates at a variable size object granularity and allows the programmer to specify semantic hints about future use of data via a Policy API, which are used by a Data Manager to choose when and where to place a particular data object using a data management API, thus bridging the semantic gap between the programmer and the platform-specific hardware details, and optimizing overall performance. We evaluate the proposed framework on a real hardware platform with terabytes of memory consisting of NVRAM and DRAM on large scale ML training workloads such CNNs that exhibit different data access and usage patterns. We show that CachedArrays outperforms hardware caches, and can exploit many of the algorithmic-specific optimizations of prior works.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>From the dawn of computing, memory has been a significant bottleneck in our quest to improve performance. An idea memory subsystem provides high bandwidth, low latency, low cost per bit, and high-capacity. However, these competing demands cannot be satisfied with a single memory technology. With the advent of emerging interconnect standards like Compute Express Link (CXL) <ref type="bibr">[1]</ref>, system architects are looking to satisfy these conflicting requirements by creating a memory subsystem using multiple different technologies. Heterogeneous memory naturally introduces the need for data tiering <ref type="bibr">[2]</ref>, <ref type="bibr">[3]</ref> or moving the data that is being accessed by a program to a memory pool that suits the nature of the memory access pattern. Traditionally hardware-managed multilevel caches and operating system (OS) page migration methods have been used to address this problem. However, for emerging ML workloads with working sets in the range of terabytes and complex, application-specific data use patterns, these techniques are not effective.</p><p>Transparent hardware-managed caching can cause performance degradation when the memory sizes are hundreds of gigabytes or more <ref type="bibr">[4]</ref>, <ref type="bibr">[5]</ref>. Implementations of hardwaremanaged DRAM caches such as Intel's Cascade Lake systems are inefficient due to cache-line-level metadata tracking and write amplification, which results in poor bandwidth utilization that is particularly detrimental for heterogeneous systems using phase change memory (PCM) based technologies such as Intel's Optane DC <ref type="bibr">[6]</ref>. Moreover, these systems can benefit from bespoke optimizations to make data movement more efficient <ref type="bibr">[6]</ref>. For example, SAGE <ref type="bibr">[7]</ref> proposes new data structure design and new algorithms for graph computations, AutoTM <ref type="bibr">[8]</ref> uses static analysis using ILP-based techniques to optimize data movement for a certain class of ML applications such as Convolutional Neural Networks (CNNs) For GPU systems, vDNN <ref type="bibr">[9]</ref> and its derivatives <ref type="bibr">[10]</ref>, <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>, <ref type="bibr">[13]</ref> exploit the unique characteristics of the backprop algorithm to augment limited GPU memory with memory on the host.</p><p>These solutions are often ad hoc for a particular application, and they often require significant rewrite of the applications. To efficiently use memory and move data in a disaggregated system, we need a framework which is easily modified for new applications, algorithms, and platforms that defines exactly what the programmer should specify and what the underlying hardware can do. We believe such a framework requires a separation of concerns between the application's data access, the policy used to direct the data movement, and the underlying data movement mechanism. Thus, in this paper we present CachedArrays as an example framework which enables mostly automatic application hints for future data with a customizable data movement policy, and an independent data manager to handle data movement between different memories. Furthermore, we believe that often one-size-doesnot-fit all, and the framework should operate at a programspecific level of granularity (as opposed to the 64-byte cache block) so that it is easier for the programmer to convey semantic information and more efficient in terms of metadata tracking. This renders the overall framework more transparent to the programmer (more like hardware-managed caches), and shields the programmer from low-level details of data movement, mitigating the need for a significant application rewrite while still realizing performance benefits.</p><p>We propose the design, implementation, and evaluation of a framework called CachedArrays (see Figure <ref type="figure">1</ref>) that realizes these requirements. As described in Section III, CachedArrays the architectural abstractions of a policy and a data manager. The policy is constructed using the data management API and using the semantic intent by the programmer conveyed to via a policy API. We describe the requirements for a data movement framework and the details of software architecture of CachedArrays and an API suitable for implementing large scale deep neural networks in Section III. In Section IV we discuss the details of the implementation of CachedArrays in Julia programming language. Since disaggregated memory</p><p>TABLE I: Summary of related work in data management in heterogeneous memory systems. Work Abstraction Layer Granularity Programmability Mechanism Sage [7] Algorithm Data Structure Application Specific Manual Partitioning vDNN <ref type="bibr">[9]</ref>, ZeRO-Offload <ref type="bibr">[13]</ref> Application Tensor Application Specific Manual Partitioning AutoTM <ref type="bibr">[8]</ref>, Sentinel <ref type="bibr">[3]</ref>, <ref type="bibr">[14]</ref>, DLRM <ref type="bibr">[15]</ref> Compiler Tensor Transparent Profile Guided Optimization Nimble <ref type="bibr">[16]</ref>, KLOC <ref type="bibr">[17]</ref>, Thermostat <ref type="bibr">[18]</ref>, <ref type="bibr">[19]</ref>, <ref type="bibr">[20]</ref>, <ref type="bibr">[21]</ref>, <ref type="bibr">[22]</ref>, <ref type="bibr">[23]</ref>, <ref type="bibr">[24]</ref>, <ref type="bibr">[25]</ref>, <ref type="bibr">[26]</ref> Operating System Page Transparent Virtual Memory Memory Mode/2LM (in Intel NVRAM) Hardware Cache Block Transparent HW Managed Cache Kona [27], t&#228;k&#333; [28] Hardware Cache Block Transparent SW Runtime based on Cache Coherence CachedArrays (This work) Application Variable Sized Object Transparent (with annotations) Type System/Runtime systems with CXL are currently nascent, we instead evaluate the proposed framework on heterogeneous memory machine with both Optane Non-Volatile Random Access Memory (NVRAM) and DRAM-based memories. We show that for training Convolutional Neural Networks (CNNs) CachedArrays improves performance 1.4&#215; to 2.03&#215; over baseline hardware managed cache (called 2LM) on Intel's platform.</p><p>Interestingly, we show that some of the proposed optimizations (such as eager memory freeing) can improve the performance of the hardware managed caches as well.</p><p>The main contributions of this work are as follows:</p><p>&#8226; We show that by separating the concerns of communicating the application's future data use pattern (hints/annotations), when and where to move data (policy), and the underlying data movement mechanism (data manager) leads to a performant and generic data management system.</p><p>&#8226; We present CachedArrays, a concrete implementation of these ideas in Julia, enabling applications with arraylike data structures to transparently take advantage of heterogeneous memory with optional data usage hints. Detailed evaluation of using CachedArrays on a real heterogeneous memory (NVRAM+DRAM) platform on training large deep learning models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>With the advent of heterogeneous memory platforms and the growing memory footprint of ML workloads there has been significant interest in understanding and optimizing data movement <ref type="bibr">[12]</ref>, <ref type="bibr">[29]</ref>, <ref type="bibr">[30]</ref>, <ref type="bibr">[31]</ref>, <ref type="bibr">[19]</ref>, <ref type="bibr">[32]</ref> at all levels across the software/hardware stack. Table <ref type="table">I</ref> summarizes works most closely related to this paper.</p><p>Some prior works have proposed data structures and algorithms specifically for DRAM/NVRAM memory systems to overcome this challenge and show significant performance improvements using their bespoke solutions <ref type="bibr">[33]</ref>, <ref type="bibr">[8]</ref>, <ref type="bibr">[34]</ref>, <ref type="bibr">[35]</ref>, <ref type="bibr">[36]</ref>. In our work, we look at the optimizations these prior works have used and develop a general approach which can be applied to many different applications.</p><p>There have been many works focused on optimizing deep learning workloads in the CPU/GPU environments to overcome the memory bottleneck issue for GPUs. vDNN <ref type="bibr">[9]</ref>, ZeRO-Offload <ref type="bibr">[13]</ref> and other related works in this line <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>, <ref type="bibr">[10]</ref>, <ref type="bibr">[37]</ref> exploited the algorithmic characteristics of these workloads, such as backpropagation, to decide on data placement and movement. These solutions call for deep algorithmic changes, requiring the programmer to manually determine the reuse pattern and figure out a suitable data movement scheme to actually change the algorithm, accordingly. Other research works tried to take such decisions at the compiler level including AutoTM <ref type="bibr">[8]</ref> which used ILP to optimize data movement and Sentinel <ref type="bibr">[3]</ref> which used runtime profiling information. However, these solutions suffer from scalability and generalization if the workloads' reuse patterns are sparse or less straightforward to detect. Examples of such workloads include Deep-Learning Recommendation Models (DLRMs) which have sparsely accessed embedding tables. Hildebrand et al. <ref type="bibr">[15]</ref> show that DLRM workloads with sparse accesses also benefit from software data movement policies, but the policy must be flexible enough to adapt to the workload. The CachedArrays framework addresses these shortcomings by minimizing the required algorithmic changes by programmers and enabling a specialized placement and movement policy based on characteristics of the program and the reuse pattern.</p><p>Many works proposed intelligent data migration between slow and fast memories in page granularity to perform data tiering at the OS level, using runtime characteristics such as reuse pattern and page hotness <ref type="bibr">[22]</ref>, <ref type="bibr">[2]</ref>, <ref type="bibr">[38]</ref>, <ref type="bibr">[21]</ref>, <ref type="bibr">[23]</ref>. However, like hardware-based techniques, these works do not take into account future information about the data use and the semantic information from the application.</p><p>Finally, other works considered hardware/software redesign for data management in memory systems including Kona which removes the overhead of conventional virtual memory from the critical path of tracking applications <ref type="bibr">[27]</ref> and t&#228;k&#333; which enables application to be informed of the memory access behavior and customize actions based on callbacks <ref type="bibr">[28]</ref>. Both Kona and t&#228;k&#333; provide hardware and software components for better data management at the runtime. However, none of them provide support for semantic information sharing from application to the hardware, to form a suitable data management policy. T&#228;k&#333; overlaps with what we introduce as the data manager in Section III-C; however, t&#228;k&#333;isreactionary to the cache (e.g., its callbacks can be triggered on a miss or an eviction) whereas CachedArrays is proactive and triggers the evictions and insertions before the application accesses the data. Finally, t&#228;k&#333; has not been applied to massive 100 GiB DRAM caches and has many of the same downsides as hardware-managed DRAM caches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. CACHEDARRAYS DETAILS</head><p>In this section we discuss the requirements for a highperforming data movement system which does not require deep programmatic changes, the software architecture of our CachedArrays design, and end with an example end-to-end design for data movement for training large-scale convolutional neural networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Requirements for a Data Movement Framework</head><p>We observe the following optimizations are important for high-performance data management on heterogeneous memory platforms.</p><p>1) Initially allocate data only in one specific device (e.g., fast memory). Hardware caching potentially requires an initial movement from backing memory to the cache, which increases data movement. 2) Elide useless writebacks from one device to another when the data is deallocated. Section V shows this optimization significantly reduces NVRAM writes compared to the hardware managed cache. 3) Move data at a large granularity instead of at the blocklevel which more efficiently uses memory devices <ref type="bibr">[6]</ref>, <ref type="bibr">[4]</ref>. Section V shows that in CNN applications with a simple policy, CachedArrays has higher average memory bus utilization than the hardware managed cache. 4) Avoid polluting the fast memory with data which is rarely referenced, retaining only relevant data in the faster memory. These optimizations are not possible with fully transparent data management mechanisms such as NUMA and hardware caching. Thus, to generalize access to these optimizations for more applications, we need a data movement management framework such that the semantic information from the application can be used to drive the data movement without requiring the application to directly interact with the data movement mechanism.</p><p>Only the application (or runtime, compiler, programmer) has the semantic information to know the future use of the data which is required for the optimizations above. Specifically, we have found the following semantic information to be the most important to communicate from the application-level to the data manager.</p><p>&#8226; "I am going to read this object"</p><p>&#8226; "I am going to write this object" &#8226; "I am not going to access this object for a while" &#8226; "I am never going to access this object again"</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Data Management API Framework</head><p>Our main idea is to separate the three concerns of the data management system: the application's access to data, the data movement mechanism (the data manager), and the data movement policy. By separating these three components, we enable the data movement to use semantic information from the applications without requiring the application to directly interact with the data movement mechanism. Figure <ref type="figure">1</ref> presents Fig. <ref type="figure">1</ref>: High level idea of a generic heterogeneous memory management system. a high-level overview of a generic heterogeneous memory management system.</p><p>The data access from the application is separated from the other components by a level of indirection (the object).</p><p>The semantic information about the data use is separated from the other components via the policy API which the application uses to communicate information about the data use to the policy. Finally, the data movement mechanism is separated from the other components via the data management API. An expert programmer or language runtime designer can implement a policy for data movement which directly interacts with the data manager layer to move data without requiring application changes due to object abstraction.</p><p>In the rest of this section, we will describe each component of this system in detail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Data Manager</head><p>The data manager exposes an API so that the policy can direct the data movement. This data management API includes functions to move data, allocate space, query the current state of the system, and update the current state of the system. There are many partial implementations of this API (e.g., t&#228;k&#333; <ref type="bibr">[28]</ref>, libmemkind, SMDK). However, they lack fundamental features such as the ability to transparently move data between devices and associate data on multiple devices with a single logical object. Thus, we build CachedArrays from the ground up, implementing allocators supporting these operations on top of raw memory obtained either through a single large malloc or a memory map from a direct access (DAX) file system.</p><p>Objects and Regions: One key observation in prior work to enable the optimizations discussed in Section III-A is that data movement should be done on the "object" level within a program. The object level is where the programmer, compiler, and runtime have specific knowledge of the semantics of the data within their program which can be used to drive data movement and placement considerations. For instance, the programmer knows whether an array will be accessed sparsely or densely, which can affect caching decisions. Examples of "objects" include standard data structures (e.g., standard containers like std::vector in C++) and tensors in CNNbased workloads. In this paper, we focus on CNN-based workloads with relatively large (&gt; 100s of KiB) tensors which are both densely and sparsely accessed.</p><p>To support moving objects between different memory pools and construction of higher order constructs like two-level caches we use the idea of regions. A region is a contiguous slice of virtual memory that either holds the current data for an object (which we call the primary region) or copies of the data for an object (where we call each copy a secondary). Two regions are said to be linked if they both are associated with the same object. An object's secondary regions may be valid if the primary is clean and read only,o rstale if the primary has been updated and without propagating this change to all its secondaries. Our initial prototype of CachedArrays requires its underlying memory heaps to be preallocated from the operating system prior to execution (i.e., no dynamic memory allocation from the OS through system calls like mmap). Dynamic memory could be added through growing or shrinking the base heap on each device or though multiple heaps per device that are mapped and freed as required. CachedArrays inherently supports object reallocation which mitigates fragmentation in either case.</p><p>Data Access: In our implementation of CachedArrays,w e focus on applications which use the kernel programming model (i.e., most computation is done in kernels). The kernels operate on objects by either reading an object, writing an object, and allocating and freeing temporary objects. This model fits well with deep learning frameworks (e.g., Tensorflow <ref type="bibr">[39]</ref>, PyTorch <ref type="bibr">[40]</ref>, OneAPI [41], etc.) and many high-performance computing applications.</p><p>Because of the kernel programming model, the extra level of indirection for accessing the underlying data of objects can be easily hidden. When the kernel is executed, the runtime can replace the object reference with the current primary region pointer. Unlike OS-based page tables and other "general purpose" indirection tables, the indirection in CachedArrays is essentially zero overhead since it happens once for a longrunning kernel. One limitation of our current implementation is that an object's primary cannot change during the execution of a kernel. However, we have not observed any examples of prior optimizations which allow for changing the data location while it is actively being accessed.</p><p>Data management API: The data manager in CachedArrays supports allocation and deallocation of regions, linking and unlinking of regions, high-performance memory copying between regions, and reassignment of an object's primary region.</p><p>There are three broad categories of functions: those working on objects, those working on regions, and those working on devices. The former consists of just two functions: getprimary, to obtain the primary region for an object and setprimary,t o update an object's primary region.</p><p>Functions in the second category include allocate and free methods for each supported device, as well as a fast memory copy (copyto) between and within devices. Regions can be linked or unlinked using link and unlink respectively. Query functions (sizeof, getlinked, and in) provide methods for obtaining the size of a region, finding a linked region's on a specified device (if one exists), and querying the device to which a region belongs. Regions can be marked and queried as dirty or clean, which helps maintain consistency when an object is associated with multiple regions. Finally, parent provides a mechanism of going from a region to its object.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Data Movement Policy</head><p>The goal of the policy is to coordinate the assignment of objects to regions to improve application performance. The application provides hints to policy regarding how and when objects are used. The policy reacts to this information by using the data management API previously described to manipulate the state of objects and regions. This separation of data management from policy is important because <ref type="bibr">(1)</ref> it allows the application to influence data placement without needing to understand the low-level details associated with the data management API and (2) allows the policy to be tuned on a per-application basis. In this section, we describe one data movement policy interface which is well suited for very large footprint CNN workloads running on a Cascade Lake heterogeneous memory system with a large NVRAM (Intel Optane DC) and smaller DRAM.</p><p>The policy we implement for this class of workloads exposes the functions outlined in Table <ref type="table">II</ref>. If the application is aware that an object will be used in the near future (for example, when it is about to execute a compute kernel), it can notify the policy with will_use. More specific hints can be provided with will_read or will_write if this is known. This allows the policy to perform preemptive action on the corresponding objects and track object usage. When an object will not be used for some time, it can be marked as archived. Finally, if the application knows that an object will never be used again, it can use retire. Only improper use of retire will affect the correctness of the application. All other functions only influence performance of the application.</p><p>How the policy reacts to these hints is an implementation detail of the policy. For example, the policy may respond to all instances of will_use by prefetching the corresponding object into fast memory. In order to drive these decisions, the policy needs to be aware of the relevant characteristics of its memory devices. For a DRAM/NVRAM system, these include: TABLE II: The policy for CachedArrays exposes an objectbased API for applications to provide hints regarding memory location and movement. These hints can either be placed manually by the programmer or inserted by the compiler.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Operation Description</head><p>will_use/read/write</p><p>Will read or write in the near future.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>archive</head><p>Will not be used for some time.</p><p>retire Will never use again.</p><p>1 function evict(policy, object) 2 x = DM.getprimary(object) 3 if DM.in(x, FAST) 4 y = DM.getlinked(x, SLOW) 5 sz = DM.sizeof(x) 6 allocated = false 7 if isnothing(y) 8 y = DM.allocate(SLOW, sz) 9 allocated = true 10 end 11 if DM.isdirty(x)| |allocated 12 DM.copyto(y, x) 13 end 14 DM.setprimary(object, y) 15 if !allocated 16 DM.unlink(x, y) 17 end 18 DM.free(x) 19 end 20 end Listing 1: An example of building an eviction function from the CachedArrays data manager interface (prefixed by DM).</p><p>To understand how the policy interacts with the data manager, we present the implementation of two kinds of operations the policy may take: evicting and object from fast to slow memory and prefetching an object from slow to fast memory. These may be executed in response to the archived or the will_use hints from the application, respectively. Note that the policy maintains the following invariant: if an object has a region in fast memory, then this region will be the primary region for that object.</p><p>Listing 1 shows an example implementation of the evict operation. If x is already linked with another region in slow memory, only a simple copy and free of x fast is needed. Otherwise, x slow needs to be allocated (see lines 4-10). Lines 11-13 show a potential optimization: if we can track whether or not x has been modified while in fast memory then we may be able to elide the expensive copy operation. Line 14 updates the object's primary region y. If a linked region existed in slow memory, x and y need to be unlinked (line 16). This does not need to be performed if the slow region was just allocated because the slow and fast regions were then never linked to begin with. Finally, x is freed (line 18). As a more complicated example, Listing 2 shows an implementation of prefetch (e.g., in response to will_use). In line 3, the primary region for the object is retrieved. Line 4 checks the device that region resides in: if the region is already in fast memory, there is nothing to be done. Line 5 tries to allocate a similar sized region in fast memory. If the operation fails (because fast memory is full) and the operation is forced, then the policy will forcibly free memory from the fast memory. This is done by selecting an initial region</p><p>1 function prefetch( 2 policy, object, force::Bool = false) 3 x = DM.getprimary(object) 4 if DM.in(x, SLOW) 5 sz = DM.sizeof(object) 6 y = DM.allocate(FAST, sz) 7 if isnothing(y)&amp; &amp;force 8 start = find_region(policy) 9 DM.evictfrom( 10 FAST, start, sz 11 ) do region 12 evict(policy, DM.parent(region)) 13 end 14 y = DM.allocate(FAST, sz)::Region 15 else 16 return 17 end 18 DM.copyto(y, x) 19 DM.link(x, y) 20 DM.setprimary(object, y) 21 end 22 return 23 end Listing 2: Building a prefetching function out of the data movement API.</p><p>start via some heuristic like LRU (line 8) and using the function evictfrom (lines 9-11) though the data manager to free a contiguous block of memory from the fast memory pool of the appropriate size. The evictfrom function uses a callback mechanism that allows the policy to use the previously described evict to preserve existing objects. Following the eviction, the fast memory allocation is performed again (line 12). Data is copied (line 16), the two regions are linked as siblings (line 17) and the fast memory region is assigned as the primary region (line 18).</p><p>We implement and evaluate multiple different policy implementations which take different actions for the same application hints in Section V.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. End-to-End Example</head><p>We use these APIs to implement an end-to-end example of using our split policy-data management interface to optimize data movement for very large-scale CNN training. Our underlying system is a heterogeneous system with DRAM and NVRAM. The important characteristics of the workload that we can take advantage of are the following.</p><p>&#8226; Each iteration of training is broken into two phases: a forward pass that computes the output of the network and a backward pass that computes the gradients of the model's trainable parameters.</p><p>&#8226; During the forward pass many intermediate activations are generated that are not consumed until the backward pass.</p><p>&#8226; These activations are generally used and freed in a first-in last-out manner. During the forward pass, the convolution reads an input activation tensor, weights, and bias and writes to an output activation tensor. The backward pass kernels require these intermediate activations, which thus must be kept in memory until needed.</p><p>For each compute kernel, we call will_read on read-only parameters and will_write on written parameters, providing the policy with the opportunity to prefetch data. Following kernel execution on the forward pass, archive is called on the weights, bias, and previous activations. A reasonable policy implementation will not eagerly evict data upon an archive annotation and will instead prioritize the annotated objects for future eviction if memory pressure is experienced. This means that if the memory required to train the network fits entirely in fast memory, then there is no down side to using archive.</p><p>Finally, retire is handled a little more carefully. For simple linear networks like VGG, intermediate activations can be retired on the backwards pass on a layer-by-layer basis. More complicated networks like ResNet or DenseNet require more precise annotations for which we leverage the Julia language, though we could also leverage APIs in PyTorch or Tensorflow to accomplish this task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. IMPLEMENTATION AND EVALUATION METHODOLOGY</head><p>We implemented a software prototype of our data manager in Julia <ref type="bibr">[42]</ref>, a high-level compiled language targeting technical computing. The main idea behind our implementation is to create an array datatype called a CachedArray in Julia backed by objects managed by the data manager. Policy hints take the form of function calls that are forwarded from a CachedArray to the manager's policy, which then uses the data management API to manipulate the regions backing the objects. We use Julia's ChainRules <ref type="bibr">[43]</ref> package and the Zygote <ref type="bibr">[44]</ref> automatic differentiation framework to implement deep learning model training. Semantic annotations are applied when compiling the model with Zygote and therefore don't necessarily require modifications to the original source code. PyTorch 2.0 can provide similar capabilities through the custom compilation of compute graphs.</p><p>Next, we describe the general methodology used to evaluate the performance of CachedArrays. Unless otherwise specified, our case studies were performed as the sole user of a 2-socket 56 core (112 thread) Intel Xeon Platinum 8276L running Ubuntu 21.10 with 192 GiB (6x32 GiB) DRAM and 1.5 TB (6x256 GiB) Optane DC NVRAM per socket. Experiments were conducted on a single socket, with one thread per core.</p><p>To implement deep learning primitives, we wrote a Julia wrapper around Intel's oneDNN [45] library. Memory for kernel input and output tensors comes from the Julia side.</p><p>We investigate several different optimizations in the policy to understand their relative impact. These include: Memory Optimizations (M): We retire arrays as soon as possible rather than relying solely on Julia's garbage collector (GC), fulfilling requirement 2 in Section III-A. By doing this, we reduce the amount of data kept alive longer than necessary. If this intermediate data is kept alive, then it must be written to NVRAM if evicted to maintain correctness, resulting in unnecessary slow NVRAM writes. Disabling this optimization means we need to rely on the GC for resource management which involves explicitly triggering collection when memory pressure is detected. Local Temporary Allocations (L): As discussed in Section III-A (requirement 1), CachedArrays has been designed</p><p>TABLE III: Large and small CNN models used as benchmarks. Small networks fit entirely in DRAM. Large Networks Small Networks Model Batchsize Footprint Model Batchsize DenseNet 264 1536 526 GB DenseNet 264 504 ResNet 200 2048 529 GB ResNet 200 640 VGG 416 256 520 GB VGG 116 320 to support unlinked regions in fast memory. In combination with memory optimizations, this is a potent tool for reducing NVRAM traffic. Our policy can be modified to disable local allocations, in which case a newly created array must first be allocated in NVRAM and then copied into DRAM before use. The purpose of disabling this optimization is to more directly compare with a naive 2LM implementation. Since 2LM operates as a DRAM cache, each physical cache line must ultimately have a copy in NVRAM. By effectively generating a compulsory miss on first access with CachedArrays,w ec a n more closely model the behavior of 2LM. Prefetching (P): We explore two different policy strategies for responding to will_read annotations: one that always prefetches into DRAM and one that never does. The relatively high read bandwidth of NVRAM suggests that kernel execution time may be less sensitive to the location of readonly arguments. Further, prefetching requires making room in DRAM for the prefetched arguments, which could result in other arrays being evicted to NVRAM. By moving data at the block granularity we fulfill requirement 3 in Section III-A.</p><p>The combination of optimizations and operating modes investigated is as follows:</p><p>&#8226; 2LM: &#8709; -2LM with no memory optimizations.</p><p>&#8226; 2LM: M -2LM with memory freeing optimizations.</p><p>&#8226; CA: &#8709; -CachedArrays with no memory optimizations or prefetching. All arrays begin in NVRAM and are moved into DRAM before use, like in a true cache.</p><p>&#8226; CA: L -No memory optimization or data prefetching. Unlike CA: &#8709;, arrays can be allocated in DRAM only.</p><p>&#8226; CA: LM -Memory optimizations but no data prefetching.</p><p>&#8226; CA: LMP -Memory optimizations and prefetching.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Large Networks: Comparison With 2LM</head><p>The Xeon server used to run these experiments can be configured in two modes, allowing multiple different use cases for NVRAM. The memory mode (2LM) configures NVRAM as main memory with DRAM serving as a transparent, directmapped hardware managed cache <ref type="bibr">[4]</ref>. App direct allows NVRAM to be used directly. In this mode, NVRAM can either be configured as an extra NUMA node to be used automatically by the OS, or mounted as a direct access (DAX) file system. In the latter case, files memory mapped from the NVRAM-based DAX file system are directly mapped into the program's address space with reads and writes sent directly to the NVRAM devices. Our CachedArrays based policies uses this last option.</p><p>To compare against 2LM, the overall memory footprint of the models used for benchmarking must greatly exceed the size of DRAM cache on a single socket which we accomplish though a combination of deep networks and large batchsize. For large traditional networks, we used ResNet 200 <ref type="bibr">[46]</ref> and DenseNet 264 <ref type="bibr">[47]</ref>, two networks with complex data flow. As an extra comparison point, we implemented VGG 416 <ref type="bibr">[9]</ref>, which is essentially a greatly extended VGG 16. A summary is provided in Table <ref type="table">III</ref> along with the approximate minimum memory footprint required for a single iteration of training. The batchsizes for the small networks were chosen such that the memory requirement for training was between 170 GB and 180 GB: small enough to fit entirely in DRAM.</p><p>All runs in 2LM were conducted with a maximum memory size of 1300 GB. When we run in app direct mode, CachedArrays is configured with the same hardware limits of 180 GB DRAM and 1300 GB NVRAM. After each training iteration (forward + backward pass), the GC was invoked to clean up all temporary memory, leaving only the model weights and computed gradients. The local heap was then defragmented before the next run to help keep behavior similar across iterations (defragmentation overhead is negligible compared to the iteration time). Each model was run for four iterations and performance metrics were checked to ensure that behavior for each iteration was consistent. Input data was randomly generated using a normal distribution.</p><p>For each experiment, we used hardware performance counters to capture read and write traffic to DRAM and NVRAM. For the 2LM based runs, the same hardware counters were used to also capture DRAM cache statistics including cache hits, clean cache misses, and dirty cache misses. All memory heaps used by CachedArrays are pre-allocated before running our experiments, ensuring that the OS assigns physical pages to all virtual pages within the heap. This in itself provides a large speedup over Julia's default allocator. For large allocations, normal allocators typically memory map new memory from the OS, which must be zero-initialized. Due to this overhead, we use 2LM with the CachedArrays allocator as the baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Small Networks -Sensitivity Analysis</head><p>Since we control data movement from software, we can vary the amount of DRAM available to CachedArrays to see how our simple policy holds up as we decrease the DRAM allowance. To that end, we use the small networks and batch sizes provided in Table <ref type="table">III</ref>. These are chosen such that the memory footprint required for training is between 170 and 180 GB and thus can fit within the DRAM of a single socket of our benchmark machine. We then vary the DRAM budget from the full 180 GB down to 0 GB (NVRAM only). As for the large benchmarks, we run each model for four iterations, defragmenting heaps between iterations and checking that variance between iterations was minimal. For this experiment, we use VGG 116 instead of VGG 416 in order to maintain a higher batch size. All of these runs were conducted in the CA: LM mode as this performs well across all networks.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. R ESULTS</head><p>Figure <ref type="figure">2</ref> shows the absolute runtime for a single iteration for the large CNN benchmarks. First, we explore the performance of 2LM: &#8709; (hardware DRAM cache with no optimizations) and 2LM: M (DRAM cache with memory optimizations). For CachedArrays, CA: L (supporting local-only allocation) is faster than CA: &#8709;, and applying memory optimizations further improves performance. Prefetching hurts performance for DenseNet and ResNet, but improves performance for VGG.</p><p>a) Why does semantic information improve performance of the hardware DRAM cache?: Figure <ref type="figure">2</ref> shows that adding the memory freeing optimizations improve the performance for 2LM as well as CachedArrays. To understand why, we investigate the physical addresses and cache behavior for ResNet 200. Figure <ref type="figure">3</ref> shows the occupancy of the memory heaps for a single iteration of ResNet under the two 2LM operating regimes. These runs only have a single memory heap that is implicitly managed by the hardware DRAM cache. Without any memory optimizations (2LM: &#8709;), memory usage &#8709; &#8709; 0 2 4 6 8 4.88 4.88 4.76 4.81 4.19 4.65 3.84 2.86 3.45 1.99 1.99 2.34 1.87 0.88 2.13 0.68 0.52 0.44 1.42 0.68 1.04 1.13 0.35 0.37 (a) DenseNet 264 &#8709; &#8709; 0 5 10 6.53 6.52 6.06 5.8 5.59 6.28 4.49 3.31 4.04 2.21 2.32 2.69 2.2 1 2.76 1.28 0.68 0.37 1.69 0.82 1.1 1.21 0.36 0.37 (b) ResNet 200 &#8709; &#8709; 0 5 10 15 20 11.3 11.3 8.45 8.44 8.93 10.87 5.83 4.65 5.41 3.85 3.85 4.2 1.99 0.8 4.71 3.17 1.95 0.36 1.53 0.56 1.05 1.06 0.34 0.34 (c) VGG 416 keeps increasing until the garbage collector is run (around time 220s) which causes the monotonically increasing behavior of that curve. Note that both PyTorch and Tensorflow must also run their own garbage collectors to free GPU memory. In contrast, when the annotated run (2LM: M) begins its backward pass, (around time 100s) it proactively frees memory produced on the forward pass. This results in more reuse of the physical pages backing that memory, leading to fewer cache misses and less data movement. Supporting this idea is Figure <ref type="figure">4</ref>, which shows the average DRAM cache hit, clean miss, and dirty miss rates for the two 2LM runs. The annotated run has an 18% higher hit rate and 50% lower dirty miss rate, both of which improve 2LM performance <ref type="bibr">[4]</ref>. By adding semantic information, we achieve better utilization of the data movement mechanism.</p><p>b) Why does CachedArrays outperform 2LM?: To understand the performance difference between 2LM and CachedArrays, we first focus on the total amount of memory moved for a single iteration of training. Figure <ref type="figure">5</ref> shows the total amount of DRAM and NVRAM traffic for a single iteration of training for two of our large networks, further broken into reads and</p><p>&#8709; &#8709; 0 0.1 0.2 0.3 0.063 0.047 0.085 0.06 0.036 0.025 0.179 0.254 0.222 0.192 0.273 0.299 (a) ResNet 200. &#8709; &#8709; 0 0.1 0.2 0.3 0.046 0.022 0.072 0.058 0.037 0.012 0.223 0.256 0.174 0.169 0.206 0.25 (b) VGG 416. Fig. 6: Average utilization of the DRAM bus. writes. With no optimizations, CachedArrays is slower than memory-optimized 2LM and in the case of VGG is even slower than unoptimized 2LM. For DenseNet and ResNet, CA: &#8709; generates similar read and write traffic to 2LM: &#8709;, though with generally fewer NVRAM writes. The saving of NVRAM writes occurs because even though memory optimizations are not applied, we still run the garbage collector after every iteration of training. In 2LM, this optimization does not help because physical addresses used on the backwards pass are dirty with respect to the DRAM cache.</p><p>Even though this still applies to VGG, CA: &#8709; is still slower than 2LM: &#8709;. This is where traffic shaping comes into play. Prior work shows that large sequential accesses provide the highest bandwidth for NVRAM <ref type="bibr">[6]</ref>, <ref type="bibr">[4]</ref>. In 2LM, NVRAM traffic is haphazard and results from conflict misses in the DRAM cache. With CachedArrays, NVRAM traffic is the result of explicit, well-shaped memory copies.</p><p>Figure <ref type="figure">6</ref> shows the average utilization of the memory bus for a single iteration of training for ResNet 200 and VGG 416. For ResNet, CA: &#8709; achieves a higher average utilization than 2LM: &#8709; while the situation is reversed for VGG. The memory movement engine in CachedArrays is highly multi-threaded, specifically targeting large memory sizes which works well for ResNet because the large batch size of 2048 results in large memory transfers. However, a much smaller batch size of 256 is used for VGG, leading to smaller data transfers and more parallelization overhead. However, bus utilization must be considered in the context of overall memory moved. As optimizations are applied via CachedArrays, bus utilization tends to increase and overall traffic generated tends to decrease, both resulting in better performance.</p><p>Impact of Local Allocation: As shown in Figure <ref type="figure">5</ref>, adding the local allocation optimization to CachedArrays significantly decreases the NVRAM read and DRAM write traffic due to the elision of the initial memory copy. The performance difference between these two is largely due to the decreased time spent synchronously moving data.</p><p>Impact of Memory Optimizations: Memory optimizations decrease memory pressure by freeing memory as soon as possible. This avoids many unnecessary writes to NVRAM, which can be seen in Figure <ref type="figure">5</ref>. In particular, observe the difference in NVRAM reads and writes in Figure <ref type="figure">5a</ref> between CA: L and CA: LM for DenseNet. For CA: L, the number of NVRAM writes exceeds the number of NVRAM reads, implying that unnecessary data is being moved to NVRAM. This is the result of intermediate allocations not being freed as soon as possible. When applying memory optimizations (CA: LM), the number of NVRAM writes for DenseNet drops from about 1100 GB to about 350 GB, with NVRAM reads exceeding NVRAM writes. The other networks experience similar decreases in NVRAM writes when applying memory optimizations. The local allocation and memory optimizations reduce the amount of NVRAM writes down to a bare-minimum.</p><p>Impact of Prefetching: Enabling data prefetching harms performance for DenseNet and ResNet. As can be seen in Figure <ref type="figure">5</ref>, prefetching decreases NVRAM read traffic and increases DRAM read traffic because arrays are moved from NVRAM to DRAM where there are referenced multiple times to compute the backwards pass. However, some operations are not sensitive to the bandwidth of their read-only arguments. Hence, this prefetch wastes time and potentially evicts other arrays. In the case of VGG, on the other hand, prefetching does slightly improve performance since it significantly decreases NVRAM read traffic by a factor of 5.4&#215;. Thus, there is no "one size fits all" approach to memory management.</p><p>In summary, full CachedArrays results in less DRAM and NVRAM traffic overall, because it is aware that data freed on the backwards pass is semantically dead, and thus does not need to be written back to NVRAM. Hardware caches do not have this semantic insight and thus must always act conservatively. Furthermore, Figure <ref type="figure">6</ref> shows the average DRAM bus utilization though out an iteration of training. CachedArrays maintains a higher average utilization while also moving less total data. CachedArrays both maintains the memory semantics required to elide unnecessary dirty writebacks and uses traffic shaping to achieve high bandwidth.</p><p>c) Sensitivity to DRAM capacity: The runtime for a single iteration of training for the small DenseNet is shown in Figure <ref type="figure">7</ref>. Running with only NVRAM results in a 3-4&#215; performance penalty (other models show similar results). However, with even a small amount DRAM, CachedArrays is able to get most of that performance back.</p><p>Figure <ref type="figure">7</ref> also shows what the performance would be if CachedArrays had perfectly asynchronous data movement (as opposed to purely synchronous) and could overlap movement with execution. Asynchronous data movement could be implemented with a separate thread pool or with an accelerator <ref type="bibr">[48]</ref>. For DenseNet and ResNet, this projected performance varies only slightly as the DRAM budget decreases. VGG, on the other hand, still experiences a slow-down due to more reads being generated to NVRAM. These results are consistent with the large network results where DenseNet and ResNet had lower performance with prefetching unlike VGG. The kernels composing VGG are more sensitive to read bandwidth. d) Why is a small amount of DRAM enough?: Contrary to the behavior of pure DRAM, DRAM to NVRAM copy bandwidth actually decreases with increasing parallelism <ref type="bibr">[6]</ref>, <ref type="bibr">[4]</ref>. Furthermore, the copy kernel in CachedArrays uses nontemporal stores to NVRAM, which are crucial for best performance. OneDNN kernels are not optimized for writing to NVRAM and thus the performance with all NVRAM is slow. When a small amount of DRAM is used, the output parameters of the computation kernels can be placed in DRAM and any movement of data from DRAM to NVRAM goes through code-paths optimized to get the best write performance out of NVRAM. An interesting area for future research would be to explore computation kernel implementations that specialize based on the memory location of its arguments (much like existing specialization via just-in-time compilation based on the dimensions of the arguments).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CachedArrays EXTENSIONS</head><p>In this section, we will discuss three directions for extensions of CachedArrays: (1) supporting sparse data structures and more complex deep learning workloads, (2) supporting other heterogeneous memory devices, and (3) supporting application frameworks beyond Julia. The approach taken by CachedArrays is not limited to static computation graphs or the well understood data use patterns of CNN workloads like many previous works <ref type="bibr">[9]</ref>, <ref type="bibr">[8]</ref>. The CachedArrays policy responds to runtime annotations, and can apply to applications exhibiting dynamic memory use such as Transformers, RNNs, and Mixtures of Experts <ref type="bibr">[49]</ref>. For example, Hildebrand et al. applied a dynamic policy to a DLRM workload <ref type="bibr">[50]</ref> and found that flexibility in the data movement policy is required to meet different memory access patterns as the locality of the data changes based on user input <ref type="bibr">[15]</ref>. Further, we could augment the policy with real-time kernel performance information, allowing the policy to explore and adapt its strategy.</p><p>Additionally, our framework is agnostic to the compute/interconnect framework surrounding the memory. Thus, CachedArrays applicable to other heterogeneous memory devices such as local/remote memory (e.g., CXL) and CPU/GPU memory. In this paper, we focused on a realistic platform available for experimentation, but the CachedArrays framework is not limited to DRAM+NVRAM systems. In fact, the flexibility enabled by separating the data movement policy from the data movement mechanism means that when migrating an application to a new heterogeneous memory platform, the userdefined policy does not have to be modified. The only change necessary is for the platform developer to provide the interface needed to implement the policy.</p><p>Finally, while we implemented an initial prototype of CachedArrays in Julia, the approach is not limited to Julia. Frameworks such as PyTorch <ref type="bibr">[40]</ref> and Tensorflow <ref type="bibr">[51]</ref> have very similar interfaces that can be exploited to implement CachedArrays. For instance, PyTorch has a torch.Tensor class that is very similar to Julia's Array type and it supports on-the-fly compiler transformations with torch.compile.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. CONCLUSIONS</head><p>In this paper, we presented CachedArrays, a framework for managing data movement in heterogeneous memory systems. CachedArrays is a software framework that allows the user to define a data movement policy that is applied at runtime. We implemented CachedArrays in Julia and evaluated it on a DRAM+NVRAM system and found significant performance improvements compared to the hardware-implemented cache. Futhermore, although not evaluated in this paper, CachedArrays can be extended to other heterogeneous memory systems, frameworks beyond Julia, and applications beyond CNNs.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>&#8226; Writes to NVRAM are slow and low bandwidth.&#8226; Reads to NVRAM are not much slower than DRAM.&#8226; DRAM bandwidth is high.&#8226; The capacity of NVRAM is large.&#8226; The capacity of DRAM is constrained. To understand why the policy must be aware of hardware characteristics, we can consider the policy's reaction to will_read versus will_write. Because the read bandwidth of NVRAM is relatively high while the write bandwidth is low, the policy may choose to take no action in response to will_read but move objects into DRAM if needed in response to will_write. If the properties of the underlying memory change, the policy may need to be modified.</p></note>
		</body>
		</text>
</TEI>
