<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Endurance-Aware Deep Neural Network Real-Time Scheduling on ReRAM Accelerators</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE International Conference on Computational Science and Computational Intelligence (CSCI)</publisher>
				<date>12/13/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10540856</idno>
					<idno type="doi">10.1109/CSCI62032.2023.00072</idno>
					
					<author>Shi Sha</author><author>Xiaokun Yang</author><author>Trent M Szczecinski</author><author>Daniel Whitman</author><author>Wujie Wen</author><author>Gang Quan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Achieving accurate multi-modal Deep Neural Networks (DNN) testing often requires operating rich model parameters under limited computing and memory resources. The lowcost Resistive Random-Access Memory (ReRAM)-based DNN accelerator is a promising solution for such an application thanks to its inherent processing-in-memory capability. However, its lifetime, which is crucial for applications with strict reliability standards such as self-driving cars, can be significantly limited due to: 1) the need to frequently switch weights among different models for real-time streaming applications; 2) the low endurance of ReRAM devices compare to DRAMs by orders of magnitude. This work proposed an Endurance-Aware multi-modal DNN Scheduling (EAS) strategy to address this issue using real-time techniques. First, a pre-processing methodology transformed a DNN into an end-to-end execution sequence for partitioning and scheduling. Then, a periodic real-time scheduling method was developed via data reusing for extending ReRAM programming cycles under deadline constraints. The experiment results showed that our EAS approach can extend the baseline ReRAM accelerator's lifetime from 0.98 years to 3.14 years on average at a low scheduling overhead.Index Terms-Resistive random-access memory (ReRAM), deep neural network (DNN), real-time, reliability, and streaming applications.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Multi-Modal Deep Neural Network (DNN) testing with constrained response latency is indispensable for future intelligent systems <ref type="bibr">[1]</ref>. For example, autonomous driving needs the fusion of heterogeneous neural network architectures, including CNNs and RNNs for sensor-streamed video, audio, location tracking, and thermal video processing <ref type="bibr">[2]</ref>. Moving from computing to a data-centric paradigm, the miniature device parameters-such as cost, area, and energy consumptionimpose intra-/inter-task resource contention upon intrinsic processing demands, which could be even more challenging when sustaining hardware endurance upon complicated scheduling policies. Since aging-induced memristor defects downgrade inference accuracy significantly <ref type="bibr">[3]</ref>, this work focuses on prolonging the Resistive Random-Access Memory (ReRAM) lifetime in the context of distinguished DNN data flow properties using real-time scheduling techniques.</p><p>ReRAM-based Processing-In-Memory (PIM) accelerators achieve fast and energy-efficient DNN execution by inherent in-situ multiply-accumulate operations <ref type="bibr">[4]</ref>, <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>, e.g. the ReRAM convolution computing efficiency can achieve &gt; 10 3 times over traditional ASIC designs <ref type="bibr">[2]</ref>. However, a ReRAM can only sustain a limited number of programming cycles &#8776; 10 8 <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref>, <ref type="bibr">[10]</ref> as opposed to &gt; 10 16 for DRAM memory <ref type="bibr">[11]</ref>. For example, the ReRAM lifetime is 0.95 years for sequential DNN testing <ref type="bibr">[12]</ref> on an autonomous vehicle at 40 frames per second (FPS) sensing rate <ref type="bibr">[13]</ref> and running 2 hours per day. The need to simultaneously execute multimodal DNNs with targeted deadlines in such resource-limited PIM accelerators can further aggravate the lifetime issue since weights shall be frequently reprogrammed to execute different models.</p><p>Frequently programming ReRAM not only expedites memristors' aging effect but also extends response latency, increases local power density and temperature, and downgrades inference accuracy. Writing on ReRAM cells costs a typical latency of 1 write-verify cycle (&#8776;500ns) for 1-bit/cell <ref type="bibr">[14]</ref>, during which the high voltages and large currents are applied to establish the conductive filaments at targeting memristor resistances. Meanwhile, the large conductance and current result in soaring local power density that increases internal and ambient temperature <ref type="bibr">[15]</ref>. As memristors' resistivity is strongly temperature-sensitive, the thermal instability leads to inaccurate programming conductance. Consequently, conductance deviation from the pre-trained DNN can downgrade inference accuracy <ref type="bibr">[16]</ref>.</p><p>The ReRAM endurance can be leveraged by reducing the memory writes <ref type="bibr">[17]</ref>, <ref type="bibr">[18]</ref>, <ref type="bibr">[19]</ref>, <ref type="bibr">[20]</ref>, adjusting programming modes <ref type="bibr">[16]</ref>, or managing runtime thermal status <ref type="bibr">[15]</ref>, <ref type="bibr">[9]</ref>. However, some solutions exhibited either accuracy degrada-tion <ref type="bibr">[17]</ref>, <ref type="bibr">[18]</ref>, <ref type="bibr">[20]</ref> or scalability limitations <ref type="bibr">[19]</ref>, <ref type="bibr">[9]</ref>. To satisfy real-time constraints, some works took advantage of the "intrinsic resilience" of DNNs to approximate results by judiciously reducing the data volume and the response latency <ref type="bibr">[21]</ref>, <ref type="bibr">[22]</ref>, <ref type="bibr">[23]</ref>, e.g. formulating the DNN as a scalable computational model <ref type="bibr">[24]</ref>, <ref type="bibr">[25]</ref> or adjusting the NN hyperparameters, such as compression <ref type="bibr">[26]</ref>, network pruning <ref type="bibr">[27]</ref>, precision scaling <ref type="bibr">[28]</ref>, etc. However, these solutions still suffered from accuracy degradation. We propose to take advantage of the slack time between the task arrival and its deadline for reusing the mapped data on the ReRAM and testing a maximal number of feature maps sharing the same kernels without shrinking network architectures or any accuracy loss. This work utilized real-time techniques to reschedule DNN data flow for ReRAM endurance enhancement under deadline constraints, which was orthogonal to other state-of-the-art works. Our proposed methodologies can also be applied to different reprogrammable devices when mitigating the intensive reprogramming-induced lifetime reduction in the real-time domain. The contributions included</p><p>&#8226; We designed a DNN pre-processing methodology that converted a DNN to an end-to-end execution sequence with bounded layer-wise computational and data storage demands for partitioning and scheduling. &#8226; We developed a real-time DNN inference scheduling framework on ReRAMs that can reuse the mapped kernel weights for minimizing the memristor writes and enhancing the ReRAM endurance under time constraints. &#8226; Our experimental results showed that our approach can extend the baseline ReRAM lifetime by 3.2 times at high feasibility for different deadlines and workload intensities with low scheduling overhead. The rest of this paper is organized as follows. Section II presents the system architecture, task model, and problem formulation. Section III presents our approach. Section IV shows the experiment results and is followed by a conclusion in Section V.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. PRELIMINARIES</head><p>This section presents the architecture and task models followed by the problem formulation. The bold characters represent the vectors and matrices, and non-bold characters are used for ordinary variables and coefficients. All of the matrices/vectors/values are in the real number domain R.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Architecture Overview</head><p>We adopt the 3D stacking ReRAM <ref type="bibr">[9]</ref>, <ref type="bibr">[18]</ref>, <ref type="bibr">[29]</ref>, <ref type="bibr">[8]</ref> as a full-fledged DNN accelerator R, as shown in Figure <ref type="figure">1</ref>. The ReRAM accelerator consists of N tiles as R = {T 1 , &#8226; &#8226; &#8226; , T N }, which can be independently controlled and configured. The data and signal are transmitted through vertical Through-Silicon-Via (TSV). Each tile, similar to <ref type="bibr">[30]</ref>, has a mesh of multiply-accumulator (MA) units, each of which consists of arrays of homogeneous crossbars (Xbar). Other components, such as shift-and-add, activation (&#963;), and pooling units are built on each tile and their execution times are added to the MA operations for bounding the layer-wise worst-case execution time (WCET). Each tile controls memory access independently and has a fixed size of eDRAMs [18] for storing transient results. Without losing generality, this work assumes that each tile cannot be shared by multiple applications, and all tiles run at a uniform frequency. DNN tasks use memristors to perform analog computing and their transient results are stored in eDRAMs. Let M and E represent the sizes of memristors and eDRAMs of a tile, respectively. The total sizes of memristors and eDRAMs on the ReRAM are N &#8226;M and N &#8226;E, respectively. Assume n k tiles are assigned to the k-th task &#915; k &#8712; &#915; &#915; &#915;, the allocated sizes of memristors and eDRAMs are n k &#8226; M and n k &#8226; E, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Task Model</head><p>Consider that a DNN inference task set &#915; &#915; &#915; contains &#952; independent tasks of different models as</p><p>contains s k real-time streaming instances sharing the same kernel set, and is constrained by an end-to-end deadline equal to the period of D. For &#915; k , each layer abstracts the characteristic elements by strikingly convolute a set of kernels with data from the previous layer (layer l -1) to generate feature maps of the next layer (layer l) <ref type="bibr">[31]</ref>, where the output feature map of layer l -1, denoted as o l-1 k , is temporarily stored in eDRAMs and serves as the input feature map of layer l. Let &#915; l k be the lth layer of &#915; k . Then, the j-th computational instance of &#915; l k can be denoted as <ref type="bibr">[32]</ref>. Next, an activation function, e.g. sigmoid &#963;(&#8226;), rectifies the convoluted instance as o l k,j = &#963;(y l k,j ), where o l k,j is the j-th instance's output neuron value of &#915; l k . The latter layer can be executed only after the completion of the previous one due to data dependency <ref type="bibr">[33]</ref>.</p><p>To quantify the inferencing resource demands, assuming that executing &#915; l k needs a size of m l k memristors to load weights and a size of e l k eDRAMs to store intermediate feature maps. Then, in the run time, if loading d consecutive layers' kernel weights to its partition at the same time, the total size of weights</p><p>k &#8712; J, need to be stored in the eDRAMs, the total size of o l k &#8712;J e l k , should not exceed n k &#8226; E. Since convolutional operations dominate the DNN computation <ref type="bibr">[34]</ref>, we let &#964; l k be the WCET of &#915; l k , including convolution, pooling, activation, and their communications through buses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Problem Formulation</head><p>DNN inference involves a combination of convolutional, normalization, pooling, and classification operations <ref type="bibr">[35]</ref>. Since the number of weight parameters is magnitudes larger than neurons on most DNNs <ref type="bibr">[32]</ref>, this work focuses on reusing the mapped weights to reduce the memristor writes and improve ReRAM endurance. When a ReRAM is large enough to load all the kernels in &#915; &#915; &#915;, i.e. N &#8226; M &#8805; &#952; k=1 Lk l=1 m l k , memristors only need to be programmed once. However, when multiple concurrent DNNs share one ReRAM under deadline constraints, it needs to update the weights frequently as network inference proceeds. Therefore, the short programming cycles, which are detrimental to hardware endurance, motivate us to develop innovative multi-modal DNN scheduling methodologies to maximally attain endurance under the targeted deadline. With the above analyses, the problem to be solved in this work is formulated as follows.</p><p>Problem 1: Given a real-time streaming DNN inference task set </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. OUR APPROACH</head><p>With the formulated problem, we present our ReRAM partitioning, DNN pre-processing, and real-time scheduling approaches for endurance enhancement. The main idea is to convert heterogeneous DNN layers into a sequence of homogeneous sub-layers to fit into partitioned resources and reuse the mapped kernel weights in consecutive layers to prolong the ReRAM endurance without violating deadline constraints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. ReRAM Partitioning</head><p>For a multi-modal DNN task set, the number of partitioned tiles for each task is proportional to the workload intensity. Assuming a uniform sensing rate is applied to the multimodal DNN tests, e.g. in-sync video and audio for autonomous driving, increasing s k leads to a longer response latency, as explained later in Figure <ref type="figure">(4b</ref>), so more tiles should be allocated to &#915; k under deadline constraints as</p><p>in the run time. However, the sizes of parameters and feature maps for different layers vary significantly as shown in Figure <ref type="figure">2</ref>, which may exceed the allocated tiles' total capacity or lead to an unbalanced execution pipeline. To this end, we first develop a DNN pre-processing method to decompose "large" layers into a sequence of end-to-end sub-layers for fitting each DNN into partitioned resources. Then, an endurance-aware real-time scheduling approach is introduced. l l+1 Fig. 3. Layer decomposition: (a) Given DNN workload; (b) A layer before decomposition; (c) Split filters into multiple sub-layers; (d) Split the feature map inputs into multiple sub-layers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. DNN Pre-processing</head><p>When allocating a DNN to its partitioned tiles, the heterogeneity of the layer-wise parameters may exceed the allocated resources. In the meantime, different volumes of multiplyaccumulation operations cause varying layer-wise latency that can be detrimental to throughput performance due to the unbalanced execution pipeline. To resolve the layer heterogeneityinduced resource contention and unbalanced pipeline for a better response latency, we propose a pre-processing stage for bounding the sizes of parameters, the input feature maps, and the WCET for each layer. Then, each DNN inference task can be treated as a sequence of homogeneous layers for</p><p>... ... ... ... ... ... scheduling.</p><p>To put the DNN pre-processing into perspective, given an arbitrary network as Figure <ref type="figure">3</ref>(a), the kernel filters convolute through the input feature maps on layer l to output feature maps for layer l+1 as Figure <ref type="figure">3(b)</ref>. Assume layer l is pending to be decomposed, let m k be the layer-wise parameter size bound and f k be the feature map size bound for &#915; k . We first split kernels according to m k without reshaping the input feature maps as Figure <ref type="figure">3</ref>(c). Then, we split its input feature maps according to f k into multiple sub-layers as Figure <ref type="figure">3(d)</ref>. Note that splitting kernels or input feature maps also impacts the output feature map sizes. The post-processing DNN should contain all consecutive layers with bounded kernel and feature map sizes, which caps the total multiply-accumulation operations. Since the multiply-accumulation dominates and is proportional to the WCET, the post-processing layer-wise latency is also bounded as &#964; k . Overall, the pre-processing stage could dramatically enhance DNN's real-time predictability and schedulability.</p><p>To enhance the ReRAM endurance, we propose to reuse mapped kernel weights to test as many feature maps as possible under deadline constraints. In addition, the tasks sharing the same kernels are expected to be executed in the pipeline to minimize the response latency <ref type="bibr">[33]</ref>, <ref type="bibr">[36]</ref>. When the partitioned tiles cannot hold all kernels at once, we define reconfiguration as a renewal of kernel weights on the memristors as DNN inference proceeds. Then, each configuration can only hold the kernels of several consecutive layers, if not all. Therefore, an inference task usually needs multiple reconfigurations as shown in Figure <ref type="figure">4(a)</ref>. In addition, due to the continuity of DNN execution, once the inference starts, it must proceed until the completion of all layers in consecutive configurations. Otherwise, storing intermediate results consumes eDRAM capacity, and squeezes later configurations' data-preserving space.</p><p>Specifically, as shown in Figure <ref type="figure">4</ref>(b), the kernels that belong to one configuration can be reused by multiple inferencing instances, and partitioned eDRAMs are shared among them. For each inferencing instance, the previous layer's output feature map is stored in the eDRAM and immediately transferred to the next layer, except that the last output feature map of a configuration should be stored in the eDRAM longer to serve as the first input feature map for the following configuration. Since the feature map sizes are bounded, as f k for task &#915; k , and the number of intermediate feature maps to be stored in the eDRAM are identical for all configurations, eDRAM overflow should never occur.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Endurance-Aware Real-Time Scheduling</head><p>For each task &#915; k &#8712; &#915; &#915; &#915;, we are ready to create its realtime schedule for maximizing the programming cycles (C k ) and enhancing the ReRAM endurance under the hardware and timing constraints. Let v k be the number of instances reusing the same set of kernels when executing &#915; k , where v k is a positive integer (v k &#8712; Z + ). The larger the v k , the more input feature maps can reuse the same mapped kernel weights, and, thus, ReRAM has longer programming cycles and better endurance. However, from Figure <ref type="figure">4</ref>(b), we can see that increasing v k leads to a longer inference latency and a higher eDRAM utilization in each configuration. Let v D k and v E k be the maximal v k values concerning the deadline and the assigned eDRAM size, respectively. Then, we have</p><p>To identify the maximal v D k , let d k be the execution pipeline depth. The value of d k is constrained by the largest number of layers that can be concurrently loaded onto the partitioned tiles in one configuration as</p><p>where</p><p>Solving Equation ( <ref type="formula">2</ref>), we have </p><p>The pre-processing procedures and endurance-aware realtime scheduling approaches are summarized in Algorithm 1.</p><p>In Algorithm 1, the larger the d k , the less r k and the higher the endurance. Thus, assuming v D k = s k in Equation ( <ref type="formula">2</ref>), we linearly search the maximum d k without exceeding the partitioned eDRAMs capacity or violating deadline D for determining m k and f k in line 6. With the bounded memristors and feature maps, Algorithm 1 lines 4-12 can be applied to decompose kernels and feature map inputs. Next, we can readily create endurance-aware real-time DNN inference schedules for each S k (v k , C k , d k , r k ) as shown in lines 13-21. The effectiveness of Algorithm 1 are formalized and proved in Lemma 3.1 and 3.2, assuming that the post-processing DNNs have their layer-wise size of weights and size of feature maps ideally equal to their corresponding bounds.</p><p>Lemma 3.1:</p><p>where is a very small value.</p><p>Proof When D &#8594; &#8734;, we have v D k &#8594; &#8734; from Equation (3). In addition, since s k &#8594; &#8734;, from Equation (1), we have v k =</p><p>TABLE I RERAM ACCELERATOR PARAMETERS [30] Component Spec Value ReRAM N T L 192 tiles MA number 12 per tile number 8 per MA Xbar size 128 &#215; 128 cells bit 2 per cell eDRAM num. of banks 2 per tile size 64 KB</p><p>With Lemma 3.1 and 3.2, we can infer that Algorithm 1 can maximally explore the temporal slack time and spatial eDRAM storage when reusing the mapped kernels on the ReRAM, either constrained by the given real-time deadline or by the partitioned eDRAM storage size, when s k is not a limiting factor.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. EXPERIMENTAL RESULTS</head><p>In Section IV, we verified our design and compared their endurance and real-time performances with state-of-the-art techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Experimental Setup</head><p>We adopted the similar ReRAM accelerator settings as <ref type="bibr">[30]</ref> in Table <ref type="table">I</ref>. All DNN inferences employed 16-bit operation if not otherwise specified. The DNN inference task set was randomly composed of VGG16, AlexNet, GoogleNet, SqueezeNet, and ResNet, which had different kernel sizes, output shapes, and parameter volumes <ref type="bibr">[12]</ref>. The execution time of each layer was normalized to the system ticks. We assumed an optimized and negligible memristor writing overhead <ref type="bibr">[16]</ref>, <ref type="bibr">[31]</ref> compared with the longer inference latency in the experiment. The simulation was written in C++ and ran at Intel-i9 CPU at 2.40 GHz. We evaluated the effectiveness of our proposed approach by comparing the following methods.</p><p>&#8226; Baseline Approach (BL): The single-priority sequential execution as the default setting in the state-of-art DNN frameworks, such as Caffe, TensorFlow, and Torch <ref type="bibr">[12]</ref>. &#8226; Pipelined Approach (PA): The best-effort pipelined DNN execution to maximize the system throughput <ref type="bibr">[33]</ref>, <ref type="bibr">[36]</ref>. &#8226; Endurance-aware DNN scheduling Approach (EAS): Our proposed endurance-aware real-time multi-modal DNN scheduling in Algorithm 1. The experiment was conducted by setting the deadlines D from 30 to 240ms with an increment of 30ms. Let Ub(s k ) be the upper boundary of s k and serve as an index for the workload intensity given fixed DNN models. For each D setting, we increased Ub(s k ) from 2 to 24 at the step size of 2. Then, for each pair of D and Ub(s k ), the composition of the task set &#915; &#915; &#915;, in terms of &#915; k and s k , were generated for 1K random cases and took their average results. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Endurance and Feasibility Comparisons</head><p>We adopted the typical ReRAM endurance as 4.14 &#215; 10 8 writes <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref>, <ref type="bibr">[10]</ref>. Assuming a ReRAM in a commercial autonomous vehicle runs 8 hours/day at 40 frames per second <ref type="bibr">[13]</ref> sensing rate, it can sustain 0.98 years using the BL approach. Although PA's pipelined execution pattern benefits its response time, it does not consider saving memristor writes for endurance attainment and has the same endurance as BL.</p><p>The proposed EAS's lifetime monotonically decreases as the total amount of workload increases. As shown in Figure <ref type="figure">5</ref>(a), when Ub(s k ) increased from 2 to 24, the average ReRAM lifetime reduced from 12.89 to 0.98 years in an average of 3.14 years. Overall, the proposed EAS can attain more than 3.2&#215; lifetime compared with the baseline.</p><p>Figure <ref type="figure">5</ref>(b) compares the feasibility of different approaches. The BL approach has the lowest feasibility because it executes DNNs in sequential patterns without pipelining or overlapping different tasks, which results in the longest response time. For example, as Ub(s k ) increased from 4 to 10, the feasibility of BL decreased from 62.4% to 9.7% and became totally infeasible when Ub(s k ) &#8805; 12. On the contrary, the PA approach pipelines different layers to achieve the best-effort throughput, so the feasibility of PA is the highest among these methods, which exceeded 87.5% for all the cases.</p><p>Our proposed EAS approach feasibility is better than BL, but lower than PA. The reason is that EAS adopts the pipelined execution pattern; however, due to endurance consideration, when reusing the mapped weights, the early finished layers must wait until the completion of other tasks that share the same kernel. Therefore, the EAS response time for each task becomes longer than PA. As shown in Figure <ref type="figure">5(b)</ref>, when Ub(s k ) increased from 4 to 22, the feasibility of EAS decreased from 99.6% to 10.1% with an average of 60.3%.</p><p>Figure <ref type="figure">5</ref>(c) evaluates the feasibility of our proposed EAS approach for different D and Ub(s k ). For each deadline, the feasibility decreases as the amount of workload increases. Further, as the deadline becomes larger, it could achieve better feasibility since more tasks can meet their deadlines. For example, when D increased from 30 to 90ms, the earliest all-infeasible Ub(s k ) extended from 12 to 22 and kept at 22 when D &#8805; 90ms, which indicated the feasibility improvement as the deadline was relaxed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. System Utilization</head><p>This section evaluates the system utilization on memristors and eDRAMs for the proposed EAS approach. The utilization of memristors is defined as the number of engaged memristors versus the total ReRAM memristors within one period. Similarly, the utilization of eDRAMs is defined as the size of eDRAMs storing intermediate results versus the system-wide eDRAM capacity in one period.</p><p>Figure <ref type="figure">6</ref> shows that both memristors' and eDRAMs' utilization increase as the total amount of workload become larger for all different deadline settings. For example, when D = 120ms and Ub(s k ) = 2, the memristor and eDRAM utilization was 38.1% and 22.4%, respectively. As Ub(s k ) increased to 20, the memristor and eDRAM utilization became 74.7% and 41.7%, respectively. The reason is that when there are fewer inferencing instances for a task, the completion time may be earlier than the deadline, so both memristors' and eDRAMs' utilization could be low because of idle time. However, as Ub(s k ) increased to the extent that the completion time of &#915; &#915; &#915; violated the deadline, the infeasible cases' utilization for memristors and eDRAMs was set as 0.</p><p>We also observed that for all different deadline settings, the memristor utilization was higher than the eDRAM utilization for feasible cases. The reason is that the number of weights in most DNNs is several magnitudes larger than neurons. Therefore, in Figure <ref type="figure">6</ref>, the maximum memristor utilization can reach 75.7% at D = 240ms, but the maximum eDRAM utilization can only achieve 47.2% at D = 150ms. The "headroom" of the memristor utilization was caused by the margin between the actual size of weights m l k and the bounded size of weights m k per layer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Execution Overhead</head><p>Lastly, we evaluated the execution overhead of our proposed EAS method. We found that the partitioning and scheduling consumed a negligible overhead, e.g. the average partitioning and scheduling overheads were 0.29&#956;s and 5.58&#956;s, respectively. A majority of the time was spent on decomposition at 0.45ms on average. As shown in Table <ref type="table">II</ref>, the CPU time increases as Ub(s k ) becomes larger. The reason is that the larger number of work items that can share the same kernel, the lower bounded layer-wise weights, and feature map sizes may apply, and, thus, layer decomposition is needed</p><p>TABLE II CPU TIME OF EAS FOR DIFFERENT WORKLOAD INTENSITIES. (IN ms) Ub(s k ) 4 8 12 16 20 24 Avg CPU time 0.23 0.34 0.39 0.43 0.61 0.71 0.45</p><p>to generate more sub-layers in pre-processing. Overall, our proposed EAS method is computationally efficient.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. CONCLUSION</head><p>ReRAM exhibits high computing and energy efficiency, but the low endurance limits its lifetime and reliability. When multi-modal DNNs share limited crossbars under constrained latency, the high programming density exacerbates the ReRAM lifetime. In this work, we developed a novel endurance-enhancement real-time multi-DNN scheduling approach, including resource partitioning, DNN pre-processing, and real-time scheduling. The proposed approach can maximally explore the temporal and spatial ReRAM resources for reusing the mapped kernels and extending programming cycles, which dramatically leverages ReRAM endurance without any accuracy loss. The experimental results showed that our proposed method can effectively and efficiently enhance the ReRAM lifetime by 3.2&#215; compared to the state-of-the-art techniques.</p></div></body>
		</text>
</TEI>
