<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>PEARL: Enabling Portable, Productive, and High-Performance Deep Reinforcement Learning using Heterogeneous Platforms</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>05/07/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10550223</idno>
					<idno type="doi">10.1145/3649153.3649193</idno>
					
					<author>Yuan Meng</author><author>Michael Kinsner</author><author>Deshanand Singh</author><author>Mahesh Iyer</author><author>Viktor Prasanna</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Deep Reinforcement Learning (DRL) is vital in various AI applications. DRL algorithms comprise diverse compute kernels, which may not be simultaneously optimized using a homogeneous architecture. However, even with available heterogeneous architectures, optimizing DRL performance remains a challenge due to the complexity of hardware and programming models employed in modern data centers. To address this, we introduce PEARL, a toolkit for composing parallel DRL systems on heterogeneous platforms consisting of general-purpose processors (CPUs) and accelerators (GPUs, FPGAs). Our innovations include: 1. A general training protocol agnostic of the underlying hardware, enabling portable implementations across various platforms. 2. Incorporation of DRL-specic optimizations on runtime scheduling and resource allocation, facilitating parallelized training and enhancing the overall system performance. 3. Automatic optimization of DRL task-to-device assignments through throughput estimation. 4. High-level API for productive development using the toolkit. We showcase our toolkit through experimentation with two widely used DRL algorithms, DQN and DDPG, on two diverse heterogeneous platforms. The generated implementations outperform state-of-the-art libraries for CPU-GPU platforms by up to 2.2⇥ throughput improvements, and 2.4⇥ higher performance portability across platforms.
CCS CONCEPTS• Computing methodologies ! Parallel computing methodologies; Reinforcement learning.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Deep Reinforcement Learning (DRL) is extensively applied in various domains, including robotics, surveillance, etc. <ref type="bibr">[7,</ref><ref type="bibr">25]</ref>. Most DRL algorithms involve three collaborative compute kernels: policy execution, training, and dataset management. In policy execution, parallel Actors gather data through inference on the policy, interact with the environment, and deposit the data into a Prioritized Replay Buer for dataset storage. In training, a centralized Learner samples data from the Prioritized Replay Buer to update the policy model. The dataset management within the Prioritized Replay Buer is facilitated by a sum tree data structure storing data priorities <ref type="bibr">[27]</ref>. DRL training is highly time-consuming. Due to the distinct compute kernels in DRL that may not be eciently optimized using a homogeneous architecture, there has been a growing trend in using heterogeneous architectures to accelerate DRL algorithms <ref type="bibr">[9,</ref><ref type="bibr">14,</ref><ref type="bibr">16]</ref>. However, even with access to heterogeneous resources, DRL application developers still face several challenges: (a). Sub-optimal performance: DRL's distinct components require careful placement and scheduling onto heterogeneous devices based on both computational and hardware characteristics. Sub-optimal placement and scheduling can lead to under-utilization of heterogeneous resources, resulting in missed opportunities for performance improvement. (b). Lack of portability across platforms: The optimal DRL primitiveto-hardware assignments can change based on varying algorithms and platforms. Consistently achieving high-performance implementations requires portable solutions that can map and distribute DRL onto various devices, but existing frameworks lack such exibility. (c). Low development productivity: The growing diversity of heterogeneous resources in data centers <ref type="bibr">[1,</ref><ref type="bibr">24,</ref><ref type="bibr">26]</ref> have increased the need for hardware optimizations and bridging between dierent programming models. This signicantly increases the required learning eort and programming time for application developers. In this work, we address the above challenges by proposing PEARL, a toolkit that enhances the performance, productivity, and portability <ref type="bibr">[20]</ref> of DRL system development on heterogeneous platforms. PEARL provides DRL application developers with tools and familiar</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">BACKGROUND 2.1 Deep Reinforcement Learning</head><p>A generalized DRL training process comprises four primitives: Actors, Learner, Replay Manager (RM), and Experience (Exp) Memory. These primitives work and interact as follows: Actors: Each Actor maintains a Deep Neural Network (DNN) policy network, inferring an action based on an input environment state. Each Actor operates on an instance of the environment simulator, applying the inferred action. The environment responds and generates a tuple {state, action, new state, reward}, constituting an experience (i.e., a data point) for training. Multiple copies of the Actor repeat this process to collect experiences, which populate a training dataset called the Replay Buer. Replay Buer: Unlike pre-labeled datasets in supervised learning, the Replay Buer in DRL is continuously lled by online interactions of Actors with the environment, and its data points are dynamically changing as the policy evolves. In state-of-the-art DRL, the Prioritized Replay Buer is widely used for managing data with probabilities proportional to the current policy loss to enhance training quality <ref type="bibr">[11,</ref><ref type="bibr">23]</ref>. It incorporates a Replay Manager (RM) associating a priority (i.e., probability of being sampled) with each experience in the Experience (Exp) Memory. During data sampling, a data point (i.e., experience) G 8 is selected based on the probability distribution Pr(8) = % (8)/ &#213; 8 % (8), 8 2 [0, replay buer size), where % (8) represents the priority of data point 8. This selection is achieved by identifying the minimum index 8 for which the prex sum of probabilities up to 8 is greater than or equal to G, where G is a uniformly generated random target prex sum value between 0 and the total priority sum <ref type="bibr">[23]</ref>:</p><p>replay buer size</p><p>To enable rapid sampling and scalable update operations for large Exp Memory, priorities are managed using a sum tree data structure <ref type="bibr">[23,</ref><ref type="bibr">28]</ref>. Replay sampling and replay update operations on an n-ary sum tree are dened in <ref type="bibr">[27]</ref>.</p><p>Learner: In each training iteration, a batch of indices are sampled via the RM to obtain experiences by reading from the Exp Memory. Then, the Learner performs training using stochastic gradient descent (SGD, <ref type="bibr">[22]</ref>) on the policy network. During the computation of the loss function in SGD, an updated priority is produced and written back to the Replay Buer via the RM. Policy network parameters are updated and sent to the Actors to ensure that experience collection employs the latest policy.</p><p>DRL Workload Characterization: The characteristics of Deep Reinforcement Learning (DRL) primitives exhibit variations not only among themselves but also across dierent learning functions, policy models, hyper parameters, etc. Consequently, relying on a xed architectural solution proves inadequate for optimizing hardware utilization and achieving high-throughput DRL across the diverse spectrum of algorithms and applications. As examples, In Figure <ref type="figure">1</ref>, we illustrate throughput performance of key compute primitives (replay sampling, replay update, and learner) for two algorithms (DQN <ref type="bibr">[18]</ref>, DDPG <ref type="bibr">[15]</ref>) and policy models (MLP, CNN), on the rooine models for a CPU, GPU, and FPGA. In this example, Figure <ref type="figure">1</ref>: DRL Primitives Workload Analysis primitives such as small MLP policies (commonly used in classical control and robotics benchmarks <ref type="bibr">[10]</ref>) and replay operations exhibit low arithmetic intensities and high-latency memory accesses, making them memory-bound and challenging to optimize on multi-core or data-parallel architectures (CPU and/or GPU). The performance of these primitives can benet from a near-memory fashion design using spatial architecture (FPGA). Learner functions with higher arithmetic intensity and data reuse, such as CNN policies used in vision-based applications <ref type="bibr">[18]</ref>, justify the data parallel resources provided by GPUs. Still, the characteristics of DRL performance may vary due to signicant dierences among replay and learner congurations based on applications, as well as diverse rooines resulting from device bandwidth and compute capabilities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Target Platforms</head><p>Today's data centers comprise highly heterogeneous machines combining a variety of processors, accelerators, and memory <ref type="bibr">[1,</ref><ref type="bibr">3,</ref><ref type="bibr">4]</ref>. Based on the DRL workload characterization in Section 2.1, we justify that there is a compelling need for dynamic mapping of DRL algorithms using such a heterogeneous platform to consistently achieve high performance. Our toolkit is motivated by this need, addressing the optimization challenges and emphasizing performance, portability, and productivity in the design automation for DRL application users. PEARL is designed to adapt to a wide range of heterogeneous computing platforms with interconnected CPUs and accelerators like GPUs and FPGAs. Developing applications on such platforms typically demands expertise in designing hardware and bridging between dierent programming models, which requires a learning curve that hinders the productivity of application developers. PEARL's strength lies in its ability to support highperformance DRL across diverse heterogeneous hardware, while abstracting away complex hardware details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Related Work</head><p>A number of works have implemented DRL on parallel and distributed systems. RLlib introduces high-level abstractions for distributed reinforcement learning, built on top of the Ray library <ref type="bibr">[14]</ref>. Other works, such as <ref type="bibr">[12,</ref><ref type="bibr">27]</ref>, implement parallel DRL algorithms by employing multiple parallel Actor threads and a centralized Learner thread, utilizing deep learning libraries like Tensorow and LibTorch. These works leverage CPU and GPU data parallel resources for training, but do not eciently optimize memory-bound primitives (such as small model training and replay operations) on specialized hardware. In recent years, some research works have focused on hardware acceleration for DRL algorithms. For instance, <ref type="bibr">[9]</ref> and <ref type="bibr">[16]</ref> present FPGA implementations for specic algorithms, the Asynchronous Advantage Actor-Critic (A3C) and the Proximal Policy Optimization (PPO). <ref type="bibr">[17,</ref><ref type="bibr">28]</ref> introduced an FPGA-based accelerator design for the Replay Buer and mapped several DRL algorithms onto an FPGA-based heterogeneous platform. However, they only target a specic heterogeneous device setup and lack performance portability across dierent heterogeneous platforms; Moreover, these work map each primitive onto a single device, in the case of the Learner being the bottleneck, they lack the exibility to improve its runtime performance using dierent devices. Our work bridges these gaps by developing a generalized protocol that makes the development of DRL portable to dierent heterogeneous platforms, accompanied by runtime heterogeneous resource management to fully saturate the heterogeneous compute power.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">RUNTIME SYSTEM &amp; TRAINING PROTOCOL 3.1 System Design</head><p>The implementation generated by PEARL is based on a parallel DRL system managed by a Host Runtime Thread. Figure <ref type="figure">2</ref> shows the setup of such a system. Multiple Actor threads generate new data points (experiences) and periodically synchronize weights from the Learner. They send the experiences to the Host Runtime Thread through Data Collection Queues (DCQs). The Host Runtime Thread interacts with the RM through an RM Request Queue (RRQ), where  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">DRL Heterogeneous Training Protocol</head><p>To perform training on a given heterogeneous system, we propose a general DRL heterogeneous training protocol (Figure <ref type="figure">3</ref>). The training protocol can be ported to various heterogeneous devices since the interactions among processors and accelerators are dened at the application layer (i.e., DRL logical components), and are not bound to a specic type of accelerator. We show the essential data exchange and handshake signals between modular components as 1 -8 in Figure <ref type="figure">3</ref>. We provide a runtime code template that manages the thread pools and the accelerators, allowing the "plug and play" of heterogeneous devices for DRL primitives. It is a Python program executed on the Host Runtime Thread, which utilizes a loop whose iterations follow this protocol.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1">Replay-Collision-Free Scheduling.</head><p>Our protocol features a novel scheduling optimization to encourage concurrency while maintaining algorithm correctness. We adopt a strategy of deferring the immediate insertion of experiences into the Replay Buer when experiences are received from Actor threads. We maintain a data collection buer to cache experiences generated by the Actors, and only insert them when the buer is full. Upon experience insertion, we schedule the batched insertion operations after the sampling process concludes. This optimization has two advantages. Firstly, this approach permits us to compare the insertion index against the sampled indices, hence eectively mitigating the potential contamination of data when the Learner and Actors concurrently modify the same indices of the Replay memory. We refer to this procedure as "collision-free data collection" shown in Figure <ref type="figure">3</ref>. Secondly, by sequencing data insertion after the sampling phase, we align its execution concurrently with the training process. This hides the time overheads of the priority retrieval and update operations initiated by experience insertion in the training pipeline. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Runtime System Optimizations</head><p>and sorted[0]&gt;2 &#8677; ) 2C&gt;A then</p><p>4: freed = ?&gt;&gt;; 2C&gt;AB .size()/2; ?&gt;&gt;; 2C&gt;AB .size() = freed; 5: activate ?&gt;&gt;; CA08= &#8672;%* ; ?&gt;&gt;; CA08= &#8672;%* .size() = freed; 6: ?&gt;&gt;; CA08= &#8672;%* .submit(train(1 + +), sync-host()) 7: learner.submit(train(&#9003;( = 1)) 8: else if sorted[0]==) CA08= &#8672;%* then 9:</p><p>?&gt;&gt;; CA08= &#8672;%* .submit(train( <ref type="formula">1</ref>), sync-host())</p><p>10: learner.submit(train(&#9003;( = 1)) 11: else if sorted[0]==) B~=2 &#8984;&gt;BC then 12: ?&gt;&gt;; CA08= &#8672;%* .size() ; 13: ?&gt;&gt;; 2C&gt;AB .size() ++; PEARL's runtime system design integrates a few optimizations that increase the eective utilization of heterogeneous resources and hide communication overheads. 3.3.1 Dynamic Heterogeneous Resource Allocation. To eciently map DRL onto a heterogeneous platform, we rst utilize the predicted result from our performance model (Section 5.2) to initially determine the mapping of each primitive onto a single accelerator at compile time. Even when optimally mapped to the most suitable accelerator, the Learner can remain the system's bottleneck. Meanwhile, if the Actors' data generation rate is signicantly higher than the Learner's data consumption rate, the sample eciency of DRL [6] can be negatively aected due to squandering of experiences information. To further ne tune Learner acceleration using heterogeneous hardware and avoid severe Actors-Learner load unbalancing, we develop a mechanism that supports dynamic re-allocation of CPU threads to process a sub-batch training. This mechanism is iteratively executed within the host runtime thread, as outlined in Algorithm 1. In instances where the amortized Learner latency dominates compared to the Actors by a large factor, it activates a pool of threads for training on CPU (?&gt;&gt;; CA08= &#8672;%* ) and re-assign Actor threads into CPU-training threads (which functions as a parallel sub-module of the Learner). Accordingly, the runtime thread also performs gradient synchronization to aggregate the gradients from the learner accelerator and the CPU-training threads. This logic is only activated if CPU threads are involved in training, and overhead from waiting for intermediate gradients is proled and recorded in the variable storing host synchronization time ) B~=2 &#8984;&gt;BC . When the host gradient synchronization time emerges as a bottleneck, the number of CPU training threads is reduced to alleviate its overhead. This optimization strategy helps fully exploit the heterogeneity oered by both processors and accelerators, facilitating parallelized policy training and ensuring workload balance. 3.3.2 Communication Overhead Reduction. Our scheduling allows concurrent execution of the Actor threads (data collection) and the sampling ! policy training (Learner) ! experience update (RM) process.</p><p>We also overlap Learner computation with replay operations. This is achieved by host-device (or on-chip) streaming communication queues between the RM and the Learner, so that training using each data point starts asynchronously as soon as the Learner receives them (rather than waiting for the full batched sampling). Additionally, we use double buering to alleviate the weight transfer overheads between the processor and Learner accelerator. Two buers with sizes of the complete policy weights is allocated in the host memory (shared by Actors threads and runtime thread).</p><p>In each iteration 8, the CPU threads read from buer 8%2 while the Learner writes into buer 1 8%2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">PARAMETERIZED LIBRARY OF PRIMITIVES 4.1 Replay Manager (RM)</head><p>The RM performs three replay operations on a sum tree, where leaf nodes store the priorities for all experiences, and a parent node stores the sum of priorities of its children: (1) Priority sampling: Based on Equation <ref type="formula">1</ref>, sampled indices are obtained by traversing the tree performing prex sum from root to leaf. The computations are explained in <ref type="bibr">[27]</ref>. (2) Priority retrieval: Given the indices of the experiences, it outputs the priorities stored at the corresponding leaf nodes.</p><p>(3) Priority update: the inputs are the indices of experiences and the changes to their priorities ; It applies the changes to the priorities (and sums of priorities) stored in parent nodes in all the tree levels. Note that Insertion of priorities is realized with priority retrievals followed by priority updates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1">RM on CPU and GPU.</head><p>The computations in replay operations can be viewed as a sequence of operations traversing all levels of the sum tree from the root to a leaf. Our RM implementations on CPU and GPU are parameterized with the tree depth, fanout, &#9003;(, and , , where &#9003;( is the batch size of the replay operation requests, and , is the number of workers (degree of parallelism) allocated. Each worker is responsible for sampling or updating &#9003;( , priorities. All workers share concurrent accesses to the sum tree. We use mutex to ensure the correctness of parallel priority updates that potentially collide on the same node. We develop an accelerator template (parameterized with the tree depth and fanout) that can be re-congured to support a range of fanout and tree sizes. We adopt a design of multiple pipeline stages processing a stream of operation requests as shown in Figure <ref type="figure">4</ref>. Each pipeline stage is a hardware module responsible for operating on a certain tree level and exclusively stores all the nodes on that level. Dierent replay operation requests in a batch are concurrently processed by dierent pipeline stages.</p><p>The request fed into the accelerator has a unied operation code as shown in the top of Figure <ref type="figure">4</ref>. The requests are decoded at each pipeline stage, and the corresponding operations are executed in an online manner. We apply the memoization technique in the updaters by using a dedicated register to store the sampled indices at each tree level so that the replay update does not need to backtrace through the tree, re-computing these indices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Learner</head><p>The Learner takes a batch of experiences, and performs SGD constituting forward propagation (FP), loss function (LOSS), backward propagation (BP), and gradient aggregation (GA).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4.2.1</head><p>Learner on CPU and GPU. We use PyTorch <ref type="bibr">[2,</ref><ref type="bibr">19]</ref> to implement DNN training on CPUs and GPUs. On the GPU, PyTorch utilizes CuDNN <ref type="bibr">[19]</ref> or Xe Matrix Extensions <ref type="bibr">[2]</ref> backend to exploit SIMD parallelism. We also support using multiple streams, each stream independently processes the FP, LOSS, BP, and GA on a sub-batch of experiences. Compared to bulk processing a full batch of data, this helps overlap the data transfer and computation time between sub-batches of data. The GPU-based Learner code is parameterized to specify the number of streams. On FPGA, we design a Learner Module that supports both pipeline parallelism across dierent neural network layers and data parallelism among sub-batches of data. As an example, we show the design for an # -layer MLP in Figure <ref type="figure">5</ref>. Each pipeline stage uses buers to store intermediate activations, and uses an array of multiplier-accumulator units to compute matrixvector multiplication for a given input. The number of multiplieraccumulator units allocated to each layer is controlled by a unique unroll factor * , which will be tuned to ensure load balancing for best performance (Section 5.1). To realize data streaming between modules, they are connected by on-chip FIFO pipes. To support data parallelism, we make &#8673;% copies of such pipelines. Each pipeline generates the gradients for a sub-batch of experiences, which are accumulated before sending them back to the host.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">SYSTEM COMPOSER</head><p>Given the user-specied Replay Manager (RM) and Learner metadata in the Optimizer Construction Program as inputs, the goals of the system composer are to (A) determine the best-performing accelerator conguration within each device for all the primitives, and (B) determine an optimal primitive-to-device assignment that maximizes system performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Accelerator Setup and Performance Estimation</head><p>To realize goal (A), we customize the parameterized accelerators described in Section 4 to suit the user-input RM and Learner specications. Based on the customized accelerators, we obtain the expected latency of executing each primitive in one DRL iteration on each of the available devices, and store these latency numbers in a Primitive Device Assignment Matrix for further analysis of system performance in goal (B). The Primitive Device Assignment Matrix is a 3&#8677;# table. The # rows denote the # available devices; each column refers to either one primitive or a combination of both primitives to be assigned to one device. Each entry ) G ~in the table denotes the latency of performing one iteration of a given primitive (or a combination of 2 primitives) G on device ~(For RM, the latency includes times of the sampling, updates and insertions). We explain how the table entries are populated based on accelerator setups as follows: Primitive Setup on a CPU/GPU : For the primitives that can be mapped to the CPU, i.e., RM and Actors, we allocate their number of threads initially based on the ratio of their single-iteration latency for processing/producing one experience in order to match their throughput. Note that based on this setup, if RM ends up being mapped to an accelerator that provide faster RM processing, the Actors will be initially set to occupy all available threads, and will be further dynamically adjusted based on the runtime data processing speed of Actors and Learner (Section 3.3.1). For the RM on a GPU, the degree of parallelism is set to &#9003;(. The sum tree is stored in the GPU global memory. For the Learner on a GPU, we search for the best-performing number of streams in the range [1, &#9003;(] by recording their per-SGD-step latencies. Accelerator Conguration on an FPGA: The RM and the Learner can both be mapped to the same FPGA device only if the total buer size required by the RM and Learner modules is smaller than the total amount of SRAM resources. This is to avoid eciency losses in accesses to o-chip memory. For the RM, the number of Autorun kernels in the pipeline is congured to match the tree depth, and the buer sizes are congured based on their corresponding tree levels. For the Learner, the number of pipelines &#8673;% is set to the largest value within resource capacity. The amount of compute resources allocated to each pipeline stage, * , is tuned such that all pipeline stages are load balanced (for the maximal eective hardware utilization):</p><p>(2) We obtain the latency of accelerators on FPGA through performance modeling:</p><p>)</p><p>)</p><p>) In equations 3-5, the pipeline latencies are calculated by multiplying single pipeline stage latency by the batch size &#9003;( and pipeline ll/drain overhead &#8673; (&#8673; equals the sum tree depth in RM and # layer propagation's in Learner, respectivaly). ) 2&gt;&lt;&lt; refers to the communication time of taking inputs from device 8 executing other primitives. They are lled in Algorithm 2 -Equation 6 depending on whether the communication is within the same device (e.g., through DDR) or across dierent devices (e.g., through PCIe). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Heterogeneous System Composition</head><p>+ ) Learner ( <ref type="formula">9</ref>) ) ( <ref type="formula">6</ref>)</p><p>!8 ) comm ) (2?D!8 ) comm 5: Output &#8673; [Learner], &#8673; [RM] 6: # Step 2: Memory Component Placement 7: Initialize &#8673; [Exp Memory]; min_trac 1 8: &#8672; Learner &#9003;( &#8677; (&#8674; + 1); &#8672; Actor # 2C&gt;A &#8677; &#8674;; &#8672; RM &#9003;( 9: for 8 in [Learner, Actors, RM] do 10: Total data trac = &#213; 8 0 2{Learner,Actors,RM} &#8672; 8 0 bandwidth(&#8673; [8 ],&#8673; [8 0 ]) 11: if Total data trac &lt; min_trac then 12: min_trac Total data trac; &#8673; [Exp Memory] &#8673; [8 ]; 13: Output &#8673; [Exp Memory]</p><p>Based on a completed Primitive Device Assignment Matrix, we develop a Heterogeneous System Composition Algorithm (Algorithm 2). It rst determines the best device assignment of the primitives to maximize achievable compute throughput, then places the memory component (Exp Memory) to minimize the total data trac. In Step 1 (lines 2-5, Algorithm 2), the training throughput can be estimated using the processed batch size in each iteration, &#9003;(, and the iteration execution time, ) 8CA . ) 8CA is dened in Equation <ref type="formula">6</ref>. The critical path in an iteration is the priority sampling followed by SGD training and priority update, while the other replay operations overlap with the training process. The required costs of communication with other compute modules are encapsulated in each component of Equation <ref type="formula">6</ref>corresponding to the candidate devices 8, 9 for RM and Learner, where 8, 9 are permutated to include all the device assignment choices. When 8 = 9, the latencies are sampled from the third column of the Compute-Performance Table . The complexity of Step 1 is $ (# 2 ), given # available devices on the heterogeneous platform. In Step 2 (lines 7-13, Algorithm 2), we decide on the device assignment of the Exp Memory. The data trac wrt the Exp Memory during each iteration includes &#9003;( words of sampling indices from the &#8673; Learner , &#9003;( &#8677; &#8674; sampled experiences to the &#8673; Learner (where &#8674; is the size of each experience for the given benchmark), and # 2C&gt;A &#8677; &#8674; inserted experiences from the Actors. These communication costs are denoted as &#8672; in Algorithm 2. We place Exp Memory on the device that minimizes the total data trafc based on available bandwidths between devices (e.g., PCIe) and within each device (e.g., DDR). The complexity of Step 2 is $ (1), as the number of primitives is constant.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">EVALUATION 6.1 Experiment Setup</head><p>To show the portability of our toolkit to dierent platforms, we conduct our experiments on two heterogeneous platforms. The rst platform, (4AE4A &#8672;&#8999; , has a Host CPU and an integrated GPU that shares the same die. The second platform, (4AE4A &#8672;&#8999; , consists CNN in <ref type="bibr">[18]</ref> of a Host CPU connected to a GPU and an FPGA, both through PCIe with 16 GB/s bandwidth. The specications of these platforms are summarized in Table <ref type="table">1</ref>. For FPGA bitstream generation, we follow the oneAPI development ow <ref type="bibr">[13]</ref>. We select three widelyused RL benchmarking environments: CartPole, MountainCar, and Pong, in the OpenAI Gym software simulation environment <ref type="bibr">[5]</ref>. We demonstrate our toolkit using two representative DRL algorithms widely applied in various applications, DQN <ref type="bibr">[18]</ref> and DDPG <ref type="bibr">[15]</ref>. The algorithm, size of the states and actions, and policy model for solving each benchmark are shown in Table <ref type="table">2</ref>. We evaluate the training throughput as the number of Experiences processed Per Second (&#8674;%( =</p><p>, where ) 8CA is the execution time of one training iteration dened in Equation <ref type="formula">6</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Performance of Accelerated Primitives</head><p>Since &#8674;%( is bounded by latencies of the primitives in each iteration, we rst show the device assignment tradeos for each primitive.</p><p>In Figures 6, we present the total execution latencies for batched Replay Manager (RM) operations. They are plotted across a range of commonly used training batch sizes (a signicant DRL hyperparameter aecting DRL iteration time). For PCIe-connected GPU and FPGA on (4AE4A &#8672;&#8999; , all the latencies of primitives in Figure <ref type="figure">6</ref> include the data transfer (PCIe) time. Note that the latencies  for priority retrieval and update are combined since these operations are typically performed together during priority insertion and update processes. Our observations reveal superior scalability of GPU-and FPGA-accelerated replay operations compared to the multi-threaded CPU implementation. The RM operations are memory-bound. While GPU data parallel compute resources exhibit good scalability, they are underutilized due to high-latency global memory accesses that cannot be hidden by the computations. The FPGA accelerator processes the sum tree operations in a near-memory manner, storing the data structure on-chip, thus delivering the highest scalability. In Figure <ref type="figure">7</ref>, we show the Learner execution times for one gradient update iteration. Batched layer propagations exhibit a higher arithmetic intensity compared to replay operations. Consequently, the advantages of utilizing data parallel architectures (GPUs) lead to consistently lower gradient update latency compared to CPU. The FPGA accelerator design surpasses GPU performance when arithmetic intensity is low. This is particularly evident when dealing with smaller neural network sizes and batch sizes. As the batch size increases, the execution time of training primitives on GPU begins to outperform that on FPGA. This shift is due to hidden memory overhead at larger batch computations and a higher clock frequency on the GPU.   the achieved throughput &#8674;%( for all device assignment choices, as well as the compositions returned by the PEARL toolkit, on both heterogeneous platforms. In all the subgures, the color gradients in the grids are proportional to the magnitudes of the achieved throughput on their corresponding device assignment. The stars denote the optimal mappings returned by our System Composer. We observe that the choice of device for the primitive with the highest latency signicantly inuences variations in throughput. Specically, for small-batch computations (i.e., grid plots with batch size 32), the color gradient changes most drastically along the horizontal axis, because replay operations result in signicant overheads as Learner computations are small; On the other hand, for large-batch computations (i.e., grids with batch size 512), the color gradient changes most drastically along the vertical axis, as the Learner dominates each training iteration and replay operation overheads are hidden. Note that when multiple device assignment choices lead to the same throughput, our toolkit selects the one with the lowest total data trac (e.g., Figure <ref type="figure">8b</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">System Composition</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Comparison with Existing DRL Libraries</head><p>We compare PEARL-generated optimal implementations with two state-of-the-art DRL frameworks, RLlib <ref type="bibr">[14]</ref> and OpenAI Stable Baselines 3 (SB3) <ref type="bibr">[21]</ref>, on (4AE4A &#8672;&#8999; . The performance of RLlib and SB3 are obtained using the optimal settings required by each of them (i.e., using GPU for training). The detailed performance across dierent benchmarks are shown in Table <ref type="table">3</ref>.</p><p>System Throughput. The additional exibility of supporting FPGA accelerators along with our runtime optimizations enable PEARL to achieve up to 1.9&#8677;, 2.2&#8677; and 1.4&#8677; improvements in &#8674;%( for the three benchmarks. Even using the same set of hardware (CPU-GPU), our novel scheduling and resource allocation leads to 21% to 55% higher &#8674;%(. We also evaluate the eect of our runtime dynamic heterogeneous resource allocation. In our experiments, the cases where CPU actor threads are re-allocated for collaborative training are labeled with * in Table <ref type="table">3</ref>. These are the scenarios where the Learner requires large-batched data or a large model for training, and this re-allocation leads to a 15% to 35% improvement in &#8674;%(. Another study focused on mapping DRL onto FPGA-based heterogeneous platforms <ref type="bibr">[28]</ref>, and evaluated using the CartPole benchmark. Due to the dierent hardware and optimal device assignments, &#8674;%( is not directly comparable. Nonetheless, we compare the eective heterogeneous resource utilization (achieved throughput given the peak throughput of all the processors and accelerators in the platform). For CartPole DQN batch-32 training, PEARL achieves 7.9K &#8674;%( using a CPU-FPGA with a total peak performance of 0.46 TFLOPS; <ref type="bibr">[28]</ref> achieved an amortized throughput of 7.1K &#8674;%( using a CPU-FPGA with 0.72 TFLOPS. Despite having 36% lower available device performance, our result shows a 11% higher &#8674;%(. Portability. To show the performance portability of our toolkit, we adopt the portability metric for a framework to be consistent with that described in <ref type="bibr">[20]</ref>:</p><p>where can be either &#8673; or %: &#8673; denotes a set of device assignment choices using a single heterogeneous platform; % denotes a set of heterogeneous platforms; &#8674;%( 8 is the achieved &#8674;%( using the 8 C&#8984; device assignment choice or platform in the set . If the implementation cannot be portable to the 8 C&#8984; device assignment choice or platform in the set , &#8674;%( 8 = 0. The results are shown in the last two rows of Table <ref type="table">3</ref>. (&#8673;) quantizes the ability to use dierent heterogeneous resources given by a single platform. existing works that do not support accelerated RM or FPGA-based Learner are not portable to these device assignments (98 2 &#8673;, &#8674;%( 8 = 0), thus having (&#8673;) = 0. In contrast, our work is portable to all assignment choices provided by (4AE4A &#8672;&#8999; . Our work enables the ability to utilize compute powers of a wider range of heterogeneous devices, thus achieving better device portability and higher performance. (%) quantizes the ability to achieve performance across dierent platforms (i.e., both (4AE4A &#8672;&#8999; and (4AE4A &#8672;&#8999; ), where &#8674;%( 8 is the highest throughput achieved on the 8 C&#8984; platform. Our toolkit consistently achieves higher platform-throughput portability (%) compared with the existing works.  runs of the algorithm-benchmark pair. For all the algorithms and benchmark applications, we consistently observe faster convergence, meaning our implementation improves throughput without signicantly sacricing algorithm performance in terms of reward and convergence rate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5">User Productivity</head><p>For a quick assessment of programmability, we enlisted 5 graduate students familiar with RL but lacking expertise in heterogeneous hardware, aligning with PEARL's target user community, to implement two algorithms using PEARL. compilation time in Table <ref type="table">4</ref> (as consistent with established practice <ref type="bibr">[8]</ref>), since it is an integral part of the oneAPI workow, and is not a step directly specied by PEARL users. In addition to illustrating the eort required for developing a specic algorithm, we also present the Code Divergence (&#8672;&#8673;) to demonstrate productivity dierences between development on the two distinct platforms. &#8672;&#8673; between platforms 8 and 9 is computed by &#8672;&#8673; = 1 <ref type="bibr">[20]</ref>, where 2 represents the lines of user code. The &#8672;&#8673; value falls within the range [0,1]: a value of 0 indicates that a "single-source" code can be shared between both platforms, while a value of 1 implies that the user code is entirely dierent for the two platforms. In our case, &#8672;&#8673; is close to 0, as the only required changes when porting to dierent devices involve modifying the paths to input les. Overall, DRL application development through training in simulation is for tuning the best model and set of hyper-parameters before physical deployment. This requires repeated rounds of testing with dierent algorithms, hyper-parameters, and environmental scenarios to ensure the reliability of the agent. In state-of-the-art data centers, it is unrealistic for application users to hand-tune each round of testing. With PEARL, developers write only dozens of lines of code to generate the accelerated DRL implementation within minutes, signicantly reducing the development eort and leading to more robust AI agents with faster development cycles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">CONCLUSION &amp; FUTURE WORK</head><p>We presented PEARL, a toolkit for productive development of performance-portable DRL on heterogeneous platforms. Future directions include scaling the primitives across heterogeneous nodes, and developing general-purpose tools based on intermediate graph representations for mapping custom-dened training algorithms onto heterogeneous hardware.</p></div></body>
		</text>
</TEI>
