<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Processing-in-Memory Accelerator for Dynamic Neural Network with Run-Time Tuning of Accuracy, Power and Latency</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>09/08/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10295349</idno>
					<idno type="doi">10.1109/SOCC49529.2020.9524770</idno>
					<title level='j'>2020 IEEE 33rd International System-on-Chip Conference (SOCC)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Li Yang</author><author>Zhezhi He</author><author>Shaahin Angizi</author><author>Deliang Fan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[With the widely deployment of powerful deep neural network (DNN) into smart, but resource limited IoT devices, many prior works have been proposed to compress DNN in a hardware-aware manner to reduce the computing complexity, while maintaining accuracy, such as weight quantization, pruning, convolution decomposition, etc. However, in typical DNN compression methods, a smaller, but fixed, network structure is generated from a relative large background model for resource limited hardware accelerator deployment. However, such optimization lacks the ability to tune its structure on-the-fly to best fit for a dynamic computing hardware resource allocation and workloads. In this paper, we mainly review two of our prior works [1], [2] to address this issue, discussing how to construct a dynamic DNN structure through either uniform or non-uniform channel selection based sub-network sampling. The constructed dynamic DNN could tune its computing path to involve different number of channels, thus providing the ability to trade-off between speed, power and accuracy on-the-fly after model deployment. Correspondingly, an emerging Spin-Orbit Torque Magnetic Random-Access-Memory (SOT-MRAM) based Processing-In-Memory (PIM) accelerator will also be discussed for such dynamic neural network structure.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Recently, Artificial Intelligence, especially Deep Neural Network (DNN), becomes the main stream intelligent data processing technology due to its remarkable performance in various domains, like image/video pattern recognition, nature language processing, etc. <ref type="bibr">[3]</ref>. However, to improve DNN performance, it evolves into deeper, denser, wider network architecture with rapidly growing network size and computing complexity, which brings great challenge to deploy such powerful technique into power-, size-, resource-limited computing hardware, like IoT, embedded system, mobile phone, etc. To address such issue, many prior works have been proposed to compress DNN in a hardware-aware manner to reduce the computing complexity, while maintaining accuracy, such as weight quantization <ref type="bibr">[4]</ref>, <ref type="bibr">[5]</ref>, pruning <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>, convolution decomposition <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref>, etc.</p><p>In general, such model compression techniques need to optimize for each individual target computing platform due to This work is supported in part by the National Science Foundation under Grant <ref type="bibr">No.2005209, No.2003749, No.1931871</ref> and Semiconductor Research Corporation nCORE their unique computing resources. Moreover, for one specific computing hardware platform, after DNN model optimization and compression, it generated a smaller, but fixed, network structure from a relative large background model. However, the real-word is dynamic, meaning that the computing hardware platform is dynamic (e.g. high performance mode or low battery mode, dynamic computing resource allocation, etc.), as well as the working environment is dynamic (e.g. varying workloads from streaming input data due to dynamic sensing efforts or communication channels). Thus, it urges the need to create a new dynamic DNN structure that could equip with the ability to tune its structure on-the-fly to best fit for a dynamic computing hardware resource allocation and workloads.</p><p>To address above challenge, in this paper, we mainly review our two previous research works published in ASPDAC'20 and DAC'20 <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref> to summarize the construction of dynamic neural network. <ref type="bibr">[1]</ref> samples the multiple uniform sub-nets from a large background super-net by adjusting the channel-width ratio, which is fixed among all layers, as well as leveraging knowledge distillation to improve accuracy of each sub-net. Furthermore, <ref type="bibr">[2]</ref> presents a non-uniform subnet selection method, inspired by an important observation that different DNN layers may have different sensitivities to capacity reduction. This observation has bee reported in many prior works in model pruning <ref type="bibr">[10]</ref>- <ref type="bibr">[12]</ref> or Network Architecture Search (NAS) <ref type="bibr">[13]</ref>- <ref type="bibr">[16]</ref>. In general, the overflow of dynamic neural network can be divided into two successive stages: Sub-nets generation and Fused sub-nets training <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>, <ref type="bibr">[17]</ref>. In the first step, multiple sub-nets are sampled from a background model (aka. super-net). Two different sub-nets sampling methods are discussed corresponding to uniform and non-uniform sub-nets respectively. Then, the sampled sub-nets are fused into a cross entropy loss function for multi-objective optimizations to construct a dynamic neural network.</p><p>Moreover, in terms of DNN hardware accelerator design, the traditional Von-Neumann computing architecture faces large challenges in such data-/compute-intensive applications, such as long memory access latency, significant congestion at I/Os, limited memory bandwidth, huge data communication energy and large leakage power consumption for storing network parameters in volatile memory <ref type="bibr">[1]</ref>, <ref type="bibr">[18]</ref>- <ref type="bibr">[20]</ref>. To address these challenges, accelerating DNN in a Processingin-Memory (PIM) platform is widely explored nowadays <ref type="bibr">[1]</ref>, </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Channel index layer index</head><p>Non-uniform Sub-net1 0.25x Sub-netN 0.5x</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Uniform Sub-nets Generation</head><p>Figure <ref type="figure">1</ref>: Overview of two-stage dynamic neural network framework for both uniform and non-uniform sub-nets sampling. In the first stage, #N sub-nets are sampled from the background model with difference sizes by a sub-nets generator. Note that, the weights of these sub-nets are partially shared according to the same weight channels utilization. Then in the followed second step, fused sub-nets training helps to construct the full-size dynamic model. In the end, these #N sub-nets can do inference individually to achieve the complete trade-off among speed, power consumption and accuracy <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>.</p><p>[18], <ref type="bibr">[21]</ref>, <ref type="bibr">[22]</ref>. The key idea of utilizing PIM to accelerate DNN is to leverage the logic-in-memory design to directly implement the required computation of DNN within memory, without relocating the large DNN model parameters out of memory chip, leading to greatly reduced computing energy and latency. From device technology perspective, there are many recent works carried out by the emerging Non-Volatile Memories (NVM) that are promising to design such PIM concept, for example, Resistive Random-Access Memory (RRAM), Phase Change Memory (PCM), Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM). Comparing with STT-MRAM, which is the leading non-volatile memory candidate, Spin-Orbit Torque Magnetic Random-Access-Memory (SOT-MRAM) not only improves the write/read stability but also requires a much smaller writing current due to strong spin-orbit coupling <ref type="bibr">[23]</ref>. In this review paper, the SOT-MRAM based PIM accelerator will be discussed for the uniform dynamic neural network <ref type="bibr">[1]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. DYNAMIC NEURAL NETWORK</head><p>In this section, we first review the dynamic neural network framework leveraging both uniform and non-uniform sampling methods and then summarize the main experiment results. As illustrated in 1, the general overflow of constructing dynamic neural network consists of two stages:1) Sub-nets generation: various sub-nets are sampled from the background model with different sizes. Two sub-nets generation methods will be discussed for uniform and non-uniform sub-nets respectively; 2) Fused sub-nets training: the sub-nets sampled form stage 1 will be constructed to a dynamic inference model, where the overlapped weights are partially shared. To optimize the dynamic model, these sub-nets are fused into a cross entropy loss with multi-objectives optimization <ref type="bibr">[17]</ref>. By doing so, the dynamic model could switch between different sub-nets for run-time trade-off between speed, power consumption and accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Problem definition</head><p>Suppose that we have a neural network f (x; W) with L layers, where x is the input data and W l is the weight parameters with the size of R p&#215;q&#215;kh&#215;kw . To perform the subnets sampling, we also denote the sub-net-ith weight binary mask as M l,i &#8712; {0, 1} p&#215;q&#215;1&#215;1 . Thus, the overflow as shown in Fig. <ref type="figure">1</ref> can be formatted as:</p><p>where N is the total number of sub-nets, and i &#8712; {1, ..., N } is the index of sub-nets. Note that, the weights are partially shared for all sub-nets. For the uniform sub-nets generation, R represents the channel-width ratio which is a fixed constant among all layers. For the non-uniform alternative, we have two hyper-parameters &#955; i and &#946; i , which are utilized to adjust model capacity during sub-nets sampling. L i is the general loss function(i.e. cross-entropy) for sub-net-i, and L * is the loss with a weight penalty term to perform sampling. We will discuss each stage in the two subsections below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Sub-nets generation 1) Uniform sub-nets generation method:</head><p>The channelwidth ratio is utilized as a factor for each layer to select subnet. All layers share the identical channel-width ratio (e.g., 0.25&#215;, 0.5&#215;) which is manually configured. For example, the sub-net 0.5&#215; sequentially selects half number of weight output channels for all layers except the last one <ref type="bibr">[1]</ref>, <ref type="bibr">[17]</ref>.</p><p>2) Non-uniform sub-nets generation method: Different from the uniform sub-nets generation, which simply selects sub-nets by hand, the non-uniform dynamic neural network utilizes optimized group Lasso-based regularization method, named clipped Lasso <ref type="bibr">[2]</ref>, <ref type="bibr">[7]</ref>, to learn the sub-nets structures as depicted in Eq.1.</p><p>Group Lasso-based regularization is a general technique for structured pruning <ref type="bibr">[12]</ref>, which is a weight penalty term that consistently acts on all the weights. Different from that, the clipped group Lasso only works on the "important" weights (larger magnitude) by utilizing an adaptive Weight Penalty Clipping (WPC) as the countermeasure <ref type="bibr">[2]</ref>, which is:</p><p>(2) Where &#955; is a scaling factor to adjust the weight penalty strength. Through choosing different &#955; i , multiple non-uniform sub-nets with different model sizes can be generated <ref type="bibr">[7]</ref>. Moreover, to avoid the weight penalty unnecessarily acts on all weights, the adaptive threshold &#964; l is used to clip the penalty on weights with larger L 2 -norm ||W l,i || 2 . The clipping threshold &#964; l is calculated by mean of ||W l,i || 2 with a scaling hyperparameter &#946;. Thus, during the training processing, the weight penalty will be only added on the "important" weights, which has larger L 2 -norm ||W l,i || 2 comparing with the threshold &#964; l . Otherwise, the clipped group Lasso term will be treated as a constant, where has no effect on backward propagation of the corresponding weights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Fused sub-nets training</head><p>After sampling multiple sub-nets from stage one, the fused sub-net training is leveraged to fuse those sampled subnets into a cross entropy loss function for multiple objective optimizations (Eq.1). Fig. <ref type="figure">1</ref> shows the training processing. Similar with multiple individual networks optimization, all sub-nets go through the forward and backward to compute their gradients. Then differently, these gradients are gathered to update the weights once. In spite of the fact that the computation complexity is increased in comparison with a single network training scheme, the dynamic model framework reduces the developer-costs of training these sub-nets individually. Furthermore, it achieves the complete trade-off among speed, power and accuracy in run-time for a trained inference model. So, utilizing dynamic neural network, users can adjust the model to the suitable sub-net according the current hardware and environment requirements in real-time. In addition, to avoid the effect on shared batch-norm layers, we follow <ref type="bibr">[17]</ref> to use independent batch-norm layers for each sub-net.</p><p>The fused training scheme are slightly different between uniform and non-uniform counterparts as shown in Fig. <ref type="figure">2</ref>. For uniform dynamic neural network training, knowledge distillation <ref type="bibr">[1]</ref>, <ref type="bibr">[24]</ref> is utilized to improve accuracy while sacrificing the memory and computation cost. The main idea behind it is to train sub-nets by mimicking a pretrained larger model (teacher net as shown in Fig. <ref type="figure">2(a)</ref>). In practice, the subnets are not only learned by the general cross entropy loss with data labels, but also by distilling knowledge from the teacher network whose output is treated as soft label. On the other hand, non-uniform fused training just computes a cross entropy loss for all sampled sub-nets as shown in Fig. <ref type="figure">2(b</ref>) <ref type="bibr">[2]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Input Dynamic Model</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Dynamic neural network evaluation</head><p>We perform the experiments for both uniform <ref type="bibr">[1]</ref> and non-uniform dynamic neural network <ref type="bibr">[2]</ref> on CIFAR-10 <ref type="bibr">[25]</ref> and ImageNet <ref type="bibr">[26]</ref> dataset using ResNet <ref type="bibr">[27]</ref> architecture. Extensive experiments analysis and ablation study have been conducted in our prior works <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>. Here we only show important experimental results.</p><p>1) Uniform dynamic neural network: ResNet20 <ref type="bibr">[27]</ref> is tested on CIFAR-10 dataset. It compares with Slimmable Neural Network (S-NN) <ref type="bibr">[17]</ref>, which also presents channelwidth based uniform dynamic neural network but without utilizing knowledge distillation. Both methods have the same configurations and are compared under the same model capacity. Table <ref type="table">I</ref> shows the accuracy results. To deploy the dynamic model to our PIM platform later, instead of using full precision (FP) model, both weight and activation are further quantized into 8-bit and 16-bit with little accuracy degradation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Network</head><p>Width S-NN <ref type="bibr">[17]</ref> Ours <ref type="bibr">[</ref> We further study the performance with ResNet50 on Im-ageNet dataset. To prove the non-uniform sampling method, <ref type="bibr">[2]</ref> also does comparison with the single network which are trained independently under the same sizes. As depicted in Table <ref type="table">II</ref>, the best accuracy is achieved on each model size. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. BIT-WISE PIM ACCELERATOR</head><p>In this section, we present a Processing-in-Memory (PIM) accelerator architecture for uniform dynamic neural network <ref type="bibr">[1]</ref>. The presented platform, as depicted in Fig <ref type="figure">3a</ref>, consists of image and kernel banks, computational memory sub-arrays, and a Digital Processing Unit (DPU). DPU is essentially composed of three components indicated by Active. (Activation) Function, Quantizer (Qunt.), and Batch Normalization (BN). A controller unit in every sub-array governs and synchronize the platform to execute various DNNs layers. The presented design is inspired by our preliminary works in PIM domain mainly IMCE <ref type="bibr">[22]</ref>, CMP-PIM <ref type="bibr">[19]</ref>, and GraphS <ref type="bibr">[28]</ref>. Prior to the computation, Input feature maps (denoted by I) and Kernels (W ) are required to be stored in Image and Kernel Banks as shown in Fig 3a . During the first computation stage (1), W , kernels can be instantly quantized by DPU's Quantizer and then mapped into computational sub-arrays. The PIM-enhanced sub-arrays are especially designed to take the computational load and process it in parallel. During the (2) and (3) stages, the sub-arrays with add-on shifter/counter units execute feature extraction based on method explained accordingly. Eventually, the DPU activates the feature maps and generates the output (stage (4) in Fig <ref type="figure">3a</ref>).</p><p>&#8226;Sub-array architecture. Fig. <ref type="figure">3b</ref> illustrates the processing in memory sub-array architecture implemented by SOT-MRAM. To realize a dual mode computation supporting both memory read/write and PIM operations, the sub-array is composed of a Memory Row Decoder (MRD), a Memory Column Decoder (MCD), a Write Driver (WD) and y configurable Sense Amplifier (SA) (y &#8712; # of columns) by controller. SOT-MRAM device is a composite device of a Magnetic Tunnel Junction (MTJ) and a spin Hall metal (SHM). With MTJ as the storage element, the parallel magnetization resistance in both magnetic layers is shown by '0' and is smaller than that of anti-parallel magnetization resistance ('1'). As shown in Fig. <ref type="figure">3b</ref>, a SOT-MRAM bit-cell is controlled by 5 signals, namely Write Word Line (WWL), Write Bit Line (WBL), Read Word Line (RWL), Read Bit Line (RBL), and Source Line (SL). The computational sub-array implements bulk bitwise in-memory AND and addition operations between inmemory operands with 2-row activation and 3-row activation to respectively. &#8226;Bit-line computing: In the 2-/ 3-row activation PIM mechanism, MRD <ref type="bibr">[29]</ref> activates two/three bits saved in the similar column. Then the SA can simultaneously sense the connected SOT-MRAM cells resistance within every bit-line. As shown in Fig <ref type="figure">4</ref>, the configurable Sense Amplifier (CSA), consists of three sub-SAs and 5 reference resistors. The integrated references are enabled by the controller through enable bits (C M , C OR3 , C MAJ , C AND3 , C AND2 ). The CSA is designed to implement read operation and one-threshold logic operations such as (N)AND2/(N)OR2/(N)AND3/(N)OR3/ by enabling one control bit at a time. Besides, the CSA could implement up to three logic operations at a same time realized with its SAs. Such scheme is used to produce 2-threshold logic operations such as XOR3/XNOR3. As shown in Fig <ref type="figure">4</ref>, the CSA works as follow: by injecting a very small current I sense over the selected cells in the memory, a sense voltage V sense is generated on the RBL that goes to the first input of 3 sub-SAs as voltage comparator. At the same time, a reference current (I ref ) is injected to reference resistance branches to generate a V ref at the second input of sub-SAs. Depending on the enable bit configuration up to three references could be selected. The CSA's voltage comparison idea between V sense and V ref to realize (N)AND2/(N)OR2 is shown on Fig 4 . For example, through selecting R M and R AND2 , the CSA respectively executes memory and 2-input AND function. We'll refer the readership to Ref. <ref type="bibr">[1]</ref> to get more information about reference tuning. Besides, the subarray implements single cycle add/sub operation leveraging CSA. Comprehensive experiments on process variation of 2-/3-row activation mechanisms have been performed in our preliminary works <ref type="bibr">[22]</ref>, <ref type="bibr">[28]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I1[31] I1[30] I1[29] I1[2] I1[1] I1[0] I2[31] I2[30] I2[29] I1[ [2 [ [ ] ] 2 I2[2] I1[ [1 [ [ ] I2[1] I1[ [0 [ ] ] 0 I2[0] W1[31] W1[30] W1[29] W1[2] W1[1] W1[0] W2[31] W2[30] W2[29] W1 W W [ [2 [ [ ] ] 2 W2[2] W1 W W [ [1 [ ] ] W2[1] W1 W W [ [0 [ ] ] 0 W2[0]</head><note type="other">Kernel Bank Image Bank</note></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. HARDWARE EVALUATION RESULTS</head><p>In this section, we elaborate on the evaluation configuration, then study the performance trade-off between the hardware specifications (e.g., overhead, memory, energy) and accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Platform Setup</head><p>The PIM accelerator is configured with 256&#215;512 memory sub-array and 512Mb total capacity adhered to an H-tree routing fashion. We built a device to architecture evaluation framework starting from the device modeling using Non-Equilibrium Green Function (NEGF) and Landau-Lifshitz-Gilbert (LLG) equation with spin Hall effect term <ref type="bibr">[22]</ref>, <ref type="bibr">[30]</ref>. The we developed a Verilog-A model for 2T1R SOT-MTAM cell and used it in Cadence Spectre to simulate it with other CMOS circuitry. To estimate the circuit performance, we used 45nm NCSU PDK library <ref type="bibr">[31]</ref>. Based on the results, we got from circuit simulations, we extensively modified NVSim <ref type="bibr">[32]</ref> by presenting a particular PIM library at architecture level. The simulator updates the configuration files (.cfg) w.r.t. various memory array organization and the model size. We then coded a behavioral simulator that calculates the latency and energy parameters that the platform consumes to run the uniform dynamic neural network. Additionally, we considered a mapping optimization algorithm w.r.t. the resources to boost the throughput. Here, we take the 8-bit quantized ResNet20 to run on the CIFAR-10 data-set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Energy Consumption</head><p>In this subsection, we analyze and report the energy consumption of the presented PIM platform emphasizing on two intensive operations i.e. the bit-wise AND and write-back. In this way, Fig. <ref type="figure">5a</ref> depicts the break-down of number of AND operations in various convolutional layers of ResNet20. Similarly, the latency is proportional to the computation load. Fig. <ref type="figure">5b</ref> shows the estimated energy consumption of ResNet20 with various model sizes split into 4 regions namely: ANDcompute, AND-WB, bitcount-shift, and add operations. It is worth pointing out that we plotted the energy consumption of Bit-Counter and Shifter together in Fig. <ref type="figure">5b</ref>. Based on the figure, it can be observed that the number of PIM operations and so energy consumption diminish as a result of presented adaptive sub-array pruning technique. The presented technique essentially eliminates the energy overhead of the the indolent computational sub-arrays sacrificing the accuracy. Based on our analysis, the energy consumption of 0.25&#215; model reduces by a factor of 2.7 as compared with the baseline PIM platform.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Memory Storage Requirement</head><p>Fig. <ref type="figure">6</ref> reports the PIM platform's memory storage to execute a full-precision and quantized (8-bit) ResNet20 model for CIFAR-10 with various model sizes. We can see that by adaptively tuning the channel-width, the memory storage is different. In addition, Fig. <ref type="figure">6</ref> depicts the accuracy and the memory storage trade-offs for the PIM design. In the figure, for example 8:0.50x configuration represents 8-bit-quantized ResNet20 with 0.50x as model size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Area Overhead</head><p>We consider 3 important hardware cost sources to estimately calculate the area overhead of the design. (1) The CSAs' addon transistors connected to every BL. <ref type="bibr">(2)</ref> The enhanced multirow activated MRD overhead; we edited all W L driver by inserting two extra transistors in the memory decoder buffer.</p><p>(3) The overhead of Ctrl's to handle CSA's enable bits. We consider area of the counter and shifter a part of Ctrl area. In general, the PIM platform incurs 7.9% overhead to main memory die <ref type="bibr">[1]</ref>). Fig. <ref type="figure">7a</ref>    </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. CONCLUSION</head><p>In this work, we explicitly review the method to construct a dynamic neural network, which mainly includes two steps: sub-nets generation and fused sub-nets training. Two different sub-nets generation methods are presented, which are used for the uniform and non-uniform dynamic neural networks respectively. Moreover, in terms of the hardware acceleration, we discuss a processing-in-MRAM platform for such neural network to accelerate dynamic inference. In the end, the performance trade-off between the hardware specification (e.g. memory, energy, overhead) and accuracy is elaborated.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: ASU Library. Downloaded on September 18,2021 at 18:33:10 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
