<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization Framework</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10357922</idno>
					<idno type="doi">10.1145/3528578</idno>
					<title level='j'>ACM Transactions on Embedded Computing Systems</title>
<idno>1539-9087</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Geng Yuan</author><author>Peiyan Dong</author><author>Mengshu Sun</author><author>Wei Niu</author><author>Zhengang Li</author><author>Yuxuan Cai</author><author>Yanyu Li</author><author>Jun Liu</author><author>Weiwen Jiang</author><author>Xue Lin</author><author>Bin Ren</author><author>Xulong Tang</author><author>Yanzhi Wang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Efficient deployment of Deep Neural Networks (DNNs) on edge devices (i.e., FPGAs and mobile platforms) is very challenging, especially under a recent witness of the increasing DNN model size and complexity. Model compression strategies, including weight quantization and pruning, are widely recognized as effective approaches to significantly reduce computation and memory intensities, and have been implemented in many DNNs on edge devices. However, most state-of-the-art works focus on ad-hoc optimizations, and there lacks a thorough study to comprehensively reveal the potentials and constraints of different edge devices when considering different compression strategies. In this paper, we qualitatively and quantitatively compare the energy efficiency of FPGA-based and mobile-based DNN executions using mobile GPU and provide a detailed analysis. Based on the observations obtained from the analysis, we propose a unified optimization framework using block-based pruning to reduce the weight storage and accelerate the inference speed on mobile devices and FPGAs, achieving high hardware performance and energy-efficiency gain while maintaining accuracy.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>The rapid development of DNNs in recent years has led them to become a core enabler for a broad spectrum of application areas, such as computer vision, natural language processing, medical engineering, autonomous driving, and virtual reality. Meanwhile, there is a rising trend showing that the deployment of these applications is shifting from traditional cloud computing platforms (e.g., servers and supercomputers) to edge devices (e.g., mobile and handheld platforms) whose power eiciency is one of the major constraints [1, <ref type="bibr">21,</ref><ref type="bibr">32,</ref><ref type="bibr">37,</ref><ref type="bibr">48,</ref><ref type="bibr">61,</ref><ref type="bibr">79]</ref>. Field Programmable Gate Arrays (FPGAs) and mobile devices, as the two most popular substrates in edge devices<ref type="foot">foot_0</ref> , have established their dominance through the delivery of promising energy/power eiciency and performance. On the one hand, FPGA is known for its high data parallelism, and its data-low oriented design methodology makes deep and specialized optimizations possible for DNN accelerations. On the other hand, mobile devices are instruction-based computing devices with general-purpose mobile CPU and GPU that provide signiicant potential for parallel computation.</p><p>However, with the ever-increasing demand of DNN accuracy in diferent application scenarios, there is a signiicant increase in both the DNN model size and complexity. For instance, both the computation intensity and memory intensity signiicantly increase when conducting image classiication on 3D image segmentation, compared to MNIST, in medical applications (e.g., MRI). Consequently, the computation and memory intensities have become a major obstacle to eiciently deploy DNN applications on edge devices with pleasant energy eiciency and performance.</p><p>There exist substantial prior works that target to reduce the computation/memory intensity and improve the execution of DNNs on edge devices through model compression. Based on the high-level techniques they use, one can divide the prior eforts into two bodies: Quantization-based optimizations and Pruning-based optimizations. In the former <ref type="bibr">[9,</ref><ref type="bibr">19,</ref><ref type="bibr">22,</ref><ref type="bibr">31,</ref><ref type="bibr">38,</ref><ref type="bibr">42,</ref><ref type="bibr">44]</ref>, high-precision loating-point values are mapped into low-precision values with fewer bits to reduce computation and memory consumption, whereas in the latter <ref type="bibr">[18,</ref><ref type="bibr">28,</ref><ref type="bibr">41,</ref><ref type="bibr">46,</ref><ref type="bibr">74,</ref><ref type="bibr">81]</ref>, the DNN model size is shrank by removing redundant parameters. These two strategies are platform-independent and can potentially work collaboratively in some scenarios. First, FPGAs are good at handling calculations with linear-quantized (e.g., ixed-point) data as well as bit shift calculations with non-linear-quantized (e.g. power-of-2) data. As a result, a large block size is required for DNN implementations on FPGAs for maximum computation parallelism, whereas pruning introduces irregularity into the DNN models and is not compatible with large block sizes. Second, in mobile devices (especially mobile GPU), ixed-point computation is not natively supported. Meanwhile, pruning can achieve a higher compression rate compared with quantization, which is especially important for bandwidth-bound DNNs.</p><p>In this paper, we ask several fundamental questions regarding accelerating DNN executions on FPGAs and mobile devices. First, the fundamental designs of FPGAs and mobile devices are diferent, we want to ask how these two platforms compare with each other considering the optimizations? Second, it is generally acknowledged that FPGAs have better energy-eiciency compared to general computing mobile devices <ref type="bibr">[11,</ref><ref type="bibr">63]</ref>. Is this knowledge holding true for DNN executions as well? Third, it is not clear the maximal potential of applying quantization and pruning on FPGAs and mobile devices. What are the maximum beneits one can obtain from applying these optimizations on both platforms?</p><p>To answer these questions, we conduct a thorough analysis of FPGA-based and mobile-based solutions under both baseline executions as well as optimized executions where state-of-the-art optimizations are applied. Based on our analysis, we qualitatively and quantitatively compare FPGAs and mobile devices. We observe that, for DNN executions, it is very diicult for FPGAs to beat mobile devices when considering both energy-eiciency and model accuracy. In fact, even with state-of-the-art quantization applied to FPGAs, the energy-eiciency of FPGAs is still lower compared to mobile devices without pruning being applied. To further explore the beneit of weight pruning, we propose a uniied, hardware friendly pruning strategy that can be efectively applied to both mobile devices and FPGAs, along with compiler-assisted acceleration framework. For both mobile devices and FPGAs, the proposed pruning and acceleration framework can achieve high hardware performance gain while maintaining accuracy. We ind that, under a similar accuracy, weight pruning provides higher beneits to mobile devices than FPGAs, which further enlarges the energy eiciency advantage of mobile-based solutions over FPGA-based ones. This paper makes the following contributions:</p><p>&#8226; We conduct a thorough analysis of modern FPGA-based and mobile GPU-based DNN executions. The analysis is established through a detailed characterization of energy eiciency, performance, and accuracy for both FPGAs and mobile devices. &#8226; We qualitatively and quantitatively compare the FPGA-based and mobile-based DNN executions. Our comparison is conducted in a step fashion where we start with the baseline execution, and then apply optimizations (i.e., quantization and pruning). &#8226; Based on our observation, we conclude that with the technology currently in wide usage, mobile GPU-based DNN executions are more energy-eicient than FPGA-based DNN executions, unless the binary or ternary quantization is adopted on FPGA at the cost of signiicant accuracy loss. &#8226; We further propose a block-based pruning scheme and a uniied optimization framework to reduce the weight storage and accelerate the inference speed. &#8226; Diferent from the existing pruning schemes, our proposed pruning scheme is friendly to both mobile GPU-based and FPGA-based DNN executions. And the corresponding compiler optimizations are designed. &#8226; The experimental results demonstrate that our proposed framework outperforms the state-of-the-art mobile-based and FPGA-based DNN acceleration frameworks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">BACKGROUND 2.1 Weight uantization</head><p>DNN quantization has become an efective compression technique for acceleration. It maps high-precision loating-point values into low-precision values. Linear quantization schemes with uniform quantization levels have been the focus of many works, including ixed-point <ref type="bibr">[9,</ref><ref type="bibr">10,</ref><ref type="bibr">19,</ref><ref type="bibr">22,</ref><ref type="bibr">38,</ref><ref type="bibr">84]</ref>, ternary (with values -1, 0, +1) <ref type="bibr">[31,</ref><ref type="bibr">40,</ref><ref type="bibr">85]</ref>, and binary (with values -1, +1) <ref type="bibr">[12,</ref><ref type="bibr">13,</ref><ref type="bibr">44,</ref><ref type="bibr">65]</ref>. Additionally, a non-linear logrithmic quantization scheme called power-of-2 (P2) has been utilized on weight parameters in some studies <ref type="bibr">[39,</ref><ref type="bibr">42,</ref><ref type="bibr">54,</ref><ref type="bibr">78,</ref><ref type="bibr">83]</ref>. A variant of the P2 scheme is sum-of-power-of-2 (SP2) with the summation of two P2 numbers as a weight value for more precise data representation <ref type="bibr">[7]</ref>.</p><p>On FPGAs, weight quantization is a natural it. Besides storage reduction, the additional beneits include (1) the DSP on FPGA can support multiple multiply-and-accumulate (MAC) computations with appropriate weight (and activation) quantization, and (2) the look-up table (LUT) computing resources can support lowprecision computing <ref type="bibr">[24,</ref><ref type="bibr">25,</ref><ref type="bibr">35,</ref><ref type="bibr">36,</ref><ref type="bibr">47,</ref><ref type="bibr">48,</ref><ref type="bibr">55,</ref><ref type="bibr">58,</ref><ref type="bibr">71]</ref>. Low-bit-width ixed-point quantization is achieved in <ref type="bibr">[24]</ref> through greedy solution to determine the radix position of each layer for quantization, and in <ref type="bibr">[71]</ref> with a hybrid quantization scheme that allows diferent bit-widths for weights to provide more lexibility. Binarized Neural Networks (BNNs) can be implemented with XNOR gates to execute multiplications <ref type="bibr">[25,</ref><ref type="bibr">55,</ref><ref type="bibr">58]</ref>, and replacing zero padding with odd-even padding enables it to realize a fully binarized neural network accelerator <ref type="bibr">[25]</ref>. The implementation with P2 quantization has been explored in <ref type="bibr">[48]</ref>.</p><p>On mobile devices, the quantization technique is not commonly adopted in mobile-based DNN inference accelerations mainly because of two reasons: (1) Mobile GPUs do not natively support the quantization technique and hence the low-bit or ixed-point computations [2]; (2) Even though the mobile CPU can support as low as 8-bit quantization, but it is not a desired computing device for DNN accelerations. This is because: irst, the higher power of mobile CPU (3W) compared to mobile GPU (1.5W) makes it less energy eicient, and second, the mobile CPU is usually occupied by the operating system and other background programs. Thus, quantization is less favored by mobile-based DNN accelerations compared with FPGAs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Weight Pruning</head><p>Weight pruning for DNNs has been widely studied these years, eliminating redundancies of the DNN model to reduce both the storage and computation consumption in DNN inference. The existing weight pruning researches can be categorized into two major groups according to the pruning scheme: Unstructured pruning <ref type="bibr">[14,</ref><ref type="bibr">26,</ref><ref type="bibr">27,</ref><ref type="bibr">53]</ref> and structured pruning <ref type="bibr">[18,</ref><ref type="bibr">28,</ref><ref type="bibr">30,</ref><ref type="bibr">41,</ref><ref type="bibr">46,</ref><ref type="bibr">49,</ref><ref type="bibr">72,</ref><ref type="bibr">74,</ref><ref type="bibr">81]</ref>.</p><p>Unstructured pruning, also called irregular pruning, removes weights at arbitrary location as shown in Figure <ref type="figure">1</ref> (a). The early work <ref type="bibr">[27]</ref> focused on the pruning of fully connected (FC) layers by using a heuristic approach to prune the weights with the least magnitudes. The later work <ref type="bibr">[26]</ref> (dynamic network surgery) extended to the convolutional (CONV) layers which are the most compute-intensive. Extensions of iterative weight pruning, such as <ref type="bibr">[14]</ref> (NeST) and <ref type="bibr">[53]</ref>, use more delicate algorithms such as selective weight growing and pruning to achieve a higher compression rate. Though unstructured pruning can signiicantly decrease the weights number in DNN model, the resulted sparse matrix requires extra index storage and degrades the parallel implementations, which results in limited improvement of hardware performance.</p><p>To overcome the limitation in unstructured, irregular weight pruning, much work <ref type="bibr">[18,</ref><ref type="bibr">28,</ref><ref type="bibr">30,</ref><ref type="bibr">41,</ref><ref type="bibr">45,</ref><ref type="bibr">46,</ref><ref type="bibr">49,</ref><ref type="bibr">72,</ref><ref type="bibr">74,</ref><ref type="bibr">81]</ref> studied the structured pruning at the level of ilters and channels as shown in Figure <ref type="figure">1 (b)</ref>. The pruned model would maintain the network structure which can be supported by the prevalent DNN acceleration framework such as TVM and TensorFlow-Lite. Early work <ref type="bibr">[30,</ref><ref type="bibr">72]</ref> incorporate &#8467; 1 or &#8467; 2 regularization in loss function to solve ilter/channel pruning problems. The method in <ref type="bibr">[28]</ref> selects ilters based on &#8467; 2 norm and updates the ilters that have been previously pruned. <ref type="bibr">[41,</ref><ref type="bibr">81]</ref> incorporate the advanced optimization solution framework ADMM (Alternating Direction Methods of Multipliers) to achieve dynamic regularization penalty, thereby improving accuracy. <ref type="bibr">[29]</ref> proposes to adopt Geometric Median, a classic robust estimator of centrality for data in Euclidean spaces. <ref type="bibr">[45]</ref> incorporates simulated annealing to automatically search the hyperparameters in weight pruning and accelerates the inference implementation as a further step.</p><p>On FPGAs, acceleration with pruning techniques have been investigated in several studies <ref type="bibr">[21,</ref><ref type="bibr">32,</ref><ref type="bibr">34,</ref><ref type="bibr">61,</ref><ref type="bibr">79]</ref>. Generally, FPGA implementations of DNNs adopt a large tiling size in the ilter and channel dimensions for high computation parallelism, but the irregularity in DNNs introduced by pruning will severely degrade the hardware utilization and prevent the FPGA implementations from achieving the maximum parallelism. Coarse-grained pruning like ilter pruning employed in <ref type="bibr">[21]</ref> is more hardware-friendly than unstructured pruning adopted in <ref type="bibr">[79]</ref>, but will result in relatively high accuracy loss.</p><p>On mobile devices, state-of-the-art mobile acceleration frameworks, including TF-Lite [1], TVM <ref type="bibr">[8]</ref>, and Alibaba Mobile Neural Network (MNN) <ref type="bibr">[37]</ref>, have limited support of weight sparsity of DNNs. They only support ilter/channel pruning/sparsity in nature, while does not accommodate the other versatile sparsity schemes. In fact, mobile CPU and GPU are instruction-based, general-purpose computing devices, with lexible compiler-level software code generation. This provides a natural advantage of supporting weight pruning techniques through compiler-assisted software optimizations. Recently, pattern-based pruning is proposed to prune the weights at an intra-kernel level by enforcing the locations of the remaining weights in a 3&#215;3 convolutional kernel to form a speciic kernel pattern <ref type="bibr">[50,</ref><ref type="bibr">60]</ref>. In pattern-based pruning, each kernel pattern reserves 4 non-zero weights out of the original 3 &#215; 3 kernel to match the single-instruction multiple-data (SIMD) architecture of embedded CPU/GPU processors to maximize the hardware throughput. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">COMPARISONS OF FPGA-BASED AND MOBILE-BASED SOLUTIONS 3.1 Methodology</head><p>To make a thorough analysis of modern FPGA-based and mobile-based DNN executions, we make a qualitative and quantitative comparison based on the state-of-the-art FPGA-based and mobile-based DNN solutions in terms of energy eiciency and accuracy.</p><p>As shown in Table <ref type="table">1</ref>, we collect recent year's representative FPGA-based DNN acceleration solutions for the image classiication task on ImageNet dataset <ref type="bibr">[15]</ref>, and the object detection task on Pascal VOC dataset <ref type="bibr">[20]</ref>, respectively. We also summarize the bit-width, frequency, number of DSPs, FPGA chip power (which is a better indicator than the whole board power), and FPGA platform used in each collected solution.</p><p>For the mobile-based solutions, we select three modern DNN acceleration frameworks including MNN <ref type="bibr">[37]</ref>, TVM <ref type="bibr">[8]</ref>, and TF-Lite <ref type="bibr">[1]</ref>. A Samsung Galaxy S20 smartphone is used as our mobile device, which integrates a Qualcomm Adreno 650 mobile GPU in the Qualcomm Snapdragon 865 processor using 7nm technology. Since its mobile GPU has lower power (1.5W) than the mobile CPU (3W), and to avoid the impact of resource contention caused by other programs running on the CPU during the measurement process, all the mobile-based results are tested using mobile GPU operating using 16-bit loating-point representations. Note that some high-end mobile devices integrate an NPU, which generally leads to higher energy eiciency. In this paper, we mainly focus on DNN executions using mobile GPU.  The VGG16, ResNet18, MobileNet networks, and YOLO networks are used for the comparison. We mainly conduct comparisons of model accuracy and execution energy eiciency. For the former, we use the top-1 accuracy on the ImageNet dataset and mean Average Precision (mAP) on the Pascal VOC dataset, whereas for the latter, we use Giga-operations per watt (GOPS/W) as the metric for the energy eiciency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Baseline Comparisons</head><p>First, in order to investigate the baseline performance (energy eiciency) of FPGA and mobile devices, we select representative solutions from both FPGA and mobile sides. Note that, we exclude the use of pruning techniques in this baseline comparison (hence the original, unpruned network is executed). 16-bit loating-point computation is utilized for the mobile baseline, as further quantization is not supported [2]. 32-bit or 16-bit ixed-point quantization is selected for FPGA baseline, and further low-bit quantization (i.e., 8-bit quantization or beyond) is considered to be extensions that will be discussed in Section 3.3. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MobileNetV2</head><p>(70.9, 21.4)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Top-1 Accuracy (%) Energy Efficiency (GOPS/W) ImageNet</head><p>ResNet-18 Figure <ref type="figure">2</ref> shows the energy eiciency comparison of FPGA-based and mobile-based solutions in terms of GOPS/W, &#347; the star shapes in the igure represent mobile-based solutions, whereas the circle, square, and triangle shapes represent FPGA-based solutions using 32-bit, 16-bit, and 8-bit weight representations, respectively. Since the model accuracy is considerably afected by the training recipe used (e.g., the hyperparameters, the augmentations, the training tricks). Even with the same model structure (e.g., VGG16 in Figure <ref type="figure">2</ref>), the accuracy can vary a lot. Therefore, we do not focus on the accuracy comparison. Some designs such as Zhang &amp; Li, 2017 <ref type="bibr">[80]</ref>, Ma et al.,2017 <ref type="bibr">[52]</ref>, and <ref type="bibr">Bai et al.,2018 [4]</ref> use high-end FPGAs (e.g., Virtex 690t and Arria-10 GX1150) that have more resources and computing power (as shown in Table <ref type="table">1</ref>), so they may not be considered edge devices. For MobileNetV2, an 8-bit FPGA-based solution is included for comparison since there is not a valid FPGA solution with 16-or 32-bit weight representation. The red dash line in the igure separates the results between the solutions based on FPGAs and mobile devices. From Figure <ref type="figure">2</ref>, we make the following key observations. Observation 1. We observe that both the mobile-based and FPGA-based solutions achieve higher energy eiciency on VGG16 than on MobileNetV2. This is because VGG16 has relatively regular network architecture with uniform 3 &#215; 3 kernel and requires more intensive computations for each layer, which can better leverage hardware parallelism. The MobileNetV2 introduce skip-connections and downsampling layers with 1 &#215; 1 kernels, which decrease the regularity of the network structure, and lower the energy eiciency in GOPS/W compared to VGG16.</p><p>Observation 2. From the mobile device perspective, three mobile-based solutions are compared (i.e., MNN, TVM, and TF-Lite). Since the same models are used, the accuracy is the same for all three solutions. MNN consistently achieves the highest energy eiciencies, outperforming TF-Lite by 2.2&#215; and 1.7&#215; on VGG16 and MobileNetV2, respectively. All these three frameworks can be used on diferent types of devices. The reason that MNN outperforms TVM and TF-Lite is that the MNN made more superior optimizations dedicated to mobile devices.</p><p>Observation 3. From the FPGA perspective, the <ref type="bibr">[80]</ref> is considered the best baseline solution with the highest energy eiciency and accuracy over other solutions on VGG16. This is because that <ref type="bibr">[80]</ref> efectively leverages the large number of DSPs in the high-end FPGA, and achieves high operating frequency by eicient implementation. Note that we cannot ind a FPGA-based solution on MobileNetV2 without using quantization, <ref type="bibr">[4]</ref>, which uses the 8-bit quantization, is used for the comparison on MobileNetV2.</p><p>Observation 4. Comparing the FPGA-based solutions to mobile-based solutions with the same network model, one can observe that the energy eiciency of mobile-based solutions are 1.5&#215; &#8764; 3.4&#215; and 1.9&#215; &#8764; 3.2&#215; higher than the best FPGA-based baseline solutions on VGG16 and MobileNetV2, respectively. This contradicts people's general belief, i.e., FPGAs would have better energy-eiciency compared to general computing mobile devices <ref type="bibr">[11,</ref><ref type="bibr">63]</ref>. One reason is that mobile devices always adopt the most advanced technologies. For example, the Samsung Galaxy S20 smartphone adopts the Qualcomm Snapdragon 865 processor using 7nm technology, whereas the high-end FPGAs such as the Arria-10 are still using 20nm technology. This provides mobile devices a natural advantage in terms of energy eiciency and operating frequency.</p><p>Takeaway. The network with higher structural regularity and more intensive computation is more likely to have higher energy eiciency in terms of GOPS/W on both FPGAs and mobile devices. While the efective use of DSPs can also improve FPGA execution energy eiciency, without the help of appropriate compression-based optimizations, the FPGAs are not competitive with mobile devices in all DNN executions we have studied.</p><p>However, we cannot yet make the inal conclusion about the potential of FPGAs and mobile devices. From Figure <ref type="figure">2</ref> one can also observe that the FPGA solutions with 16-bit weights show a signiicant advantage over FPGA solutions with 32-bit weights. This indicates a great potential of weight (and activation) quantization in improving the energy eiciency of FPGA-based designs. We will analyze the efect of quantization in details in Section 3.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Optimizations on FPGA Solutions</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1">Efectiveness Analysis of Various Optimizations.</head><p>In this section, we analyze the performance of various FPGA designs and explore the potential of quantization and other optimizations for improvement in energy eiciency of FPGA designs. Figure <ref type="figure">3</ref> compares the energy eiciency (GOPS/W) and accuracy of state-of-the-art FPGA-based solutions for DNNs on ImageNet dataset. The performance of FPGA implementations can be afected by the following factors:</p><p>(1) The type of FPGA device adopted in design. By comparing the two designs in <ref type="bibr">[70]</ref>, which adopt identical design methodology but on diferent types of FPGA boards, it can be seen that the design on the XC7Z045 device achieves higher energy eiciency (38.9 GOPS/W) than the one on the XC7Z020 device (27.7 GOPS/W) given the same operating frequency, as XC7Z045 has more available DSPs to accommodate more computations in each clock cycle while maintaining relatively low power consumption.</p><p>(2) Quantization with diferent bit-widths. Lower bit-width quantization generally leads to higher energy eiciency. Take the VGG16 designs shown in Figure <ref type="figure">3</ref> for instances, the 32-bit implementations attain energy eiciency from 9.4 to 15.7 GOPS/W, and for 16-bit this range is 13.6 to 47.7 GOPS/W. The design in <ref type="bibr">[23]</ref> is built on XC7Z020 with fewer DSPs than other implementations, but the 8-bit quantization assists this design to achieve comparable energy eiciency (41 GOPS/W) with the 16-bit VGG16 designs on larger FPGAs. This could be explained by the DSP consumption for computations in diferent bit-widths. While one MAC operation for 32-bit loating-point data may utilize ive DSPs <ref type="bibr">[33]</ref>, one MAC for 16-bit ixed-point data only requires one DSP. Quantization with lower bit-width will further enhance the performance by using one DSP for multiple operations, as discussed below. (3) Utilizing one DSP for multiple multiplications. Low bit-width (eg., 8-bit or even lower bit-witdh) quantization provides oppotunity for the FPGA implementations to manage more than one multiplications in each DSP. For instance, the design in <ref type="bibr">[75]</ref> leverages 8-bit quantization and decomposes each Xilinx DSP48E1 25-bit &#215; 18-bit multiplier into two 8-bit &#215; 8-bit multipliers. This makes the MobileNetV3 implementation, although with more complicated network architecture, enjoy higher energy eiciency (19.9 GOPS/W) than the MobileNetV2 design in <ref type="bibr">[4]</ref> (11.3 GOPS/W).</p><p>(4) Power-of-2 (P2) and Sum-of-power-of-2 (SP2) quantization that utilizes look-up tables (LUTs) only. The P2 scheme replaces each multiplication with a bit shift operation consuming LUTs which have higher energy eiciency and integration density than DSPs on FPGA. SP2 proposed in <ref type="bibr">[7]</ref> provides better data representation by adding two P2 results to replace one multiplication and therefore resulting in higher accuracy than P2. P2 consumes fewer LUTs than SP2, thus attaining higher energy eiciency, but incurs more signiicant accuracy loss. As a result, P2 and SP2 achieve energy eiciency of 24.7 GOPS/W and 21.9 GOPS/W, respectively, comparable with VGG16 designs.</p><p>(5) Mixed quantization scheme and mixed precision. The designs mentioned above are based on either ixed-point data only or P2/SP2 scheme only. One shortcoming is that these designs only consume one type of computation resources in FPGAs (i.e., either LUTs or DSPs), and thus cannot fully utilize the resources of FPGAs. In <ref type="bibr">[7]</ref>, the mixed-scheme (MS) design combines the 4-bit ixed-point scheme and 4-bit SP2 scheme to exploit both DSP and LUT resources on FPGA fully and balancedly. The ratio of ixed-point to SP2 quantization is determined according to the hardware resource analysis for a certain FPGA. The mixed-scheme and mixed-precision (MSP) design further introduces 8-bit ixed-point quantization into MS. Among ResNet-18 implementations, the one based on MSP performs the best (29.5 GOPS/W), better than MS (23.3 GOPS/W), because MSP has higher capability to maintain the accuracy and the irst and last layers are quantized, while these layers are not quantized in the MS design. As for the MobileNet-V2 implementation based on MS, the energy eiciency could reach 22.1 GOPS/W.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2">Analysis of Normalized GOPS.</head><p>As discussed above, the energy eiciency is afected by the FPGA board adopted as the implementation platform. To eliminate the impact of platforms as much as possible and illustrate the eicacy of quantization with various optimization techniques, we calculate the &#322;nominal" GOPS for each design through multiplying the number of used DSPs by the actual operating frequency, based on the following assumptions: (1) The DSP utilization rate is a key factor for the energy eiciency of the FPGA design; (2) Each DSP manages one ixed-point multiplication in most cases. We then normalize the GOPS in each design so that the normalized GOPS is the ratio of the actual GOPS to the nominal GOPS. Figure <ref type="figure">4</ref> displays the normalized GOPS versus accuracy on FPGA for DNNs on ImageNet dataset. Ideally, if all DSP are utilized, the normalized GOPS should be 1. A normalized GOPS less than 1 might result from two main reasons: (1) DSPs are required to handle other DNN operations than MACs, like the pooling operations that exist in almost all DNN models; (2) The irregularity in some models (especially the MobileNet models) degrades the overall performance. Another interesting observation is that in some cases the normalized GOPS is larger than 1. This can be achieved through optimization techniques like integrating multiple multiplications in one DSP, which is implemented in <ref type="bibr">[7,</ref><ref type="bibr">52]</ref>. Speciically, the work <ref type="bibr">[52]</ref> uses an Arria-10 FPGA, where each DSP can support two 18-bit &#215; 18-bit multiplications, and the DSP utilization rate in this design reaches 100%. In addition, the quantization schemes in <ref type="bibr">[7]</ref> all achieve normalized GOPS above 1 due to the low-bit-width precision, and the MSP scheme with normalized GOPS of 1.7 performs the best among all designs. Another technique for high normalized GOPS is managing DNN operations in the frequency domain through Fast Fourier Transform (FFT) <ref type="bibr">[76]</ref>.</p><p>From Figures <ref type="figure">3</ref> and<ref type="figure">4</ref>, we conclude that the potential optimization directions include (1) utilizing each DSP for multiple multiplications, and (2) quantizing with mixed scheme and mixed precision. However, the improvement resulting from these techniques is still not suicient for FPGA implementations to compete with mobile counterparts. Considering ResNet-18 designs as examples, the best-optimized FPGA design is based on MSP which achieves energy eiciency of 29.5 GOPS/W. However, the best baseline mobile design, namely MNN, achieves 118.2 GOPS/W that outperforms FPGA by 4&#215;.</p><p>Here, we raise the following questions: (1) Is this the best result that can be achieved through quantization techniques? (2) Is it possible to make FPGA-based solutions achieve better energy eiciency than mobile-based solutions with the technology currently in use? In the subsequent sections, we quantitatively answer these two questions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.3">Analysis of Binary &amp; Ternary uantization.</head><p>With more aggressive quantization strategy, i.e., binary or ternary weight quantization, the original MAC operations between input data (activation) and weights can be performed as addition/subtraction operations, which only need to use LUTs instead of DSPs. Thus, the energy eiciency can be further boosted.</p><p>Since very few FPGA-based works adopt binary/ternary quantization on the ImageNet dataset, we compare the energy eiciency of FPGA-based implementations that use binary quantization with non-quantized mobile-based solutions on the Pascal VOC dataset for the object detection task, as shown in Figure <ref type="figure">5</ref>. Note that we also include other representative works that are not using binary/ternary quantization to make more comprehensive comparisons.</p><p>Observation 1. It can be observed in Figure <ref type="figure">5</ref> that with binary quantization, <ref type="bibr">[57]</ref> successfully outperforms the best mobile-based solution MNN in energy eiciency by 1.15&#215;, and [59] also achieves 102.6 GOPS/W energy eiciency, which is slightly lower than MNN. However, compared to the original mAP of YOLOv2, there is a 9.2% and 12.64% mAP drop in <ref type="bibr">[57]</ref> and <ref type="bibr">[59]</ref>, respectively.</p><p>Observation 2. Besides the solutions with the binary weight quantization, the design <ref type="bibr">[17]</ref> also adopt several optimizations. It incorporates the block-circulant <ref type="bibr">[16]</ref> compression technique and replaces the original multiplication operations with Fast Fourier Transform (FFT)-based fast multiplications to reduce the weight storage and computation complexity. Moreover, higher energy eiciency can be achieved by mixing the 8-bit ixed-point quantization with P2 quantization to explore both DSP and LUT resources, where the other result only adopts 8-bit ixed-point quantization. With mixed quantization scheme, the energy eiciency is improved by 1.65&#215; compared to the 8-bit ixed-point quantization solution. This result more intuitively proves the efectiveness of mixed quantization scheme in improving the FPGA-based design energy eiciency. Although <ref type="bibr">[17]</ref> achieves comparable energy eiciency (i.e., 11% lower than MNN) under multiple optimizations, it needs to sacriice 3.3% model mAP (as reported in <ref type="bibr">[17]</ref>).</p><p>Takeaway. Based on our observations, we conclude that FPGA-based solutions can achieve higher energy eiciency compared to the mobile-based solutions when adopting binary or ternary quantization while leading to a signiicant accuracy loss. Otherwise, with the technology currently in wide usage, mobile-based solutions are more energy-eicient than FPGA-based solutions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">WEIGHT PRUNING OPTIMIZATION FRAMEWORK AND EVALUATIONS ON MOBILE DEVICES AND FPGAS</head><p>As we discussed in Section 3, even with aggressive optimizations, FPGA-based solutions cannot achieve comparable energy eiciency as a mobile-based solution without signiicant sacriice of accuracy. We are still wondering how good performance can be achieved for mobile devices with appropriate optimizations. Since quantization techniques are not favored by mobile GPU as mentioned in Section 2.1, we will adopt weight pruning to evaluate the improvement on mobile devices. And we will also evaluate the beneit of weight pruning on FPGAs. The conventional pruning schemes, including unstructured pruning and coarse-grained structured pruning, are either not hardware friendly or introducing non-negligible accuracy loss, so they are not desired to be used in DNN accelerations. To solve that issue, the pattern-based pruning scheme is proposed in <ref type="bibr">[51,</ref><ref type="bibr">60]</ref>, which is tailored to mobile devices. It prunes weights by enforcing the locations of the remaining weights in a 3&#215;3 convolutional kernel to form a speciic kernel pattern. And each kernel pattern reserves 4 non-zero weights out of the original 3 &#215; 3 kernel to match the single-instruction multiple-data (SIMD) architecture of mobile processors to maximize the hardware throughput.</p><p>However, the inherent disadvantage of the pattern-based pruning is that it can only be applied to 3&#215;3 CONV layers. In many networks, a large portion of weights and computations are contributed by non-3&#215;3 CONV layers (e.g., 1&#215;1 CONV layers and fully connected layers).</p><p>Thus, we propose a DNN acceleration framework, which mainly consists of two parts: 1) a uniied block-based pruning scheme for both mobile devices and FPGAs; 2) the compiler-based optimization for mobile devices. Our proposed block-based pruning scheme that combines the advantage of both unstructured pruning and structured pruning achieves high accuracy and hardware friendly simultaneously. It can be generally applied to all types of DNN layers and is compatible with both mobile devices and FPGAs. With our compiler-based optimization, the computation and storage reduction provided by weight pruning can be fully leveraged by mobile devices, and a more eicient DNN execution code can be generated for faster DNN inference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Our Compiler-assisted Acceleration Framework for Mobile Devices</head><p>To achieve the best performance for diferent DNNs on a speciic mobile device, our compiler framework incorporates multiple optimizations such as operator fusion, operation replacement, and constant folding to efectively optimize the DNN computation graph. And we also adopt scheduling, nested parallelism, dense </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2D Weight Matrix Format</head><p>Pruned weight Fig. <ref type="figure">6</ref>. Block-based pruning with 2D and 4D formats.</p><p>kernel reordering, and SIMD operation optimization techniques to enhance the execution eiciency. Autotuning is another key technique used in our framework, which can automatically obtain the best-suited runtime coniguration for each unique case, e.g. tilling size, loop permutation, and data format.</p><p>To quantitatively demonstrate the efectiveness and the superiority of our proposed compiler optimization, we compare our framework with mobile-based solutions MNN, TVM, and TF-Lite using dense models without incorporating weight pruning. As shown in Table <ref type="table">2</ref>, the dense models of VGG-16, ResNet-18 and MobileNet-V2 on ImageNet dataset are evaluated. Even without incorporating weight pruning, mobile-based DNN executions can be improved compared to the state-of-the-art mobile-based solutions.</p><p>In our framework, besides the conventional unstructured pruning and coarse-grained ilter/channel pruning, our optimized compiler also adds the lexibility in supporting a variety of pruning schemes including pattern-based pruning and our proposed block-based pruning. In fact, the unstructured and ilter/channel pruning schemes are just extreme cases of our block-based pruning as shall be seen next. This is not available in the existing mobile-based solutions such as MNN, TVM, and TF-Lite.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Block-based Pruning</head><p>Inspired by the pattern-based pruning, our block-based pruning also adopts a ine-grained pruning strategy while maintaining a high regularity. Figure <ref type="figure">6</ref> (a) and (b) show our proposed block-based pruning in 4D weight tensor format and 2D weight matrix format, respectively. We divide entire weight matrix of a DNN layer into a number of equal-sized blocks, which contains the weights from j channels of i ilters. Then we prune entire column(s) within each block to maintain the regular block shape. Moreover, our block-based pruning scheme requires pruning a group of weights at the same location of all ilters and all channels within a block to leverage hardware parallelism from both memory and computation perspectives. Note that the conventional unstructured pruning and coarse-grained structured pruning can be considered the special cases of our proposed block-based pruning with a block size of 1&#215;1 and the size of the whole weight matrix, respectively.</p><p>The key advantage of block-based pruning is that it can simultaneously achieve the advantages of unstructured pruning (high accuracy) and coarse-grained structured pruning (high hardware inference performance). The high accuracy is attributed to the structural lexibility. Since a iner pruning granularity is used in block-based pruning compared to coarse-grained structured pruning, higher structural lexibility is preserved when searching the desired pruning structures. On the other hand, the high hardware inference performance is attributed to the appropriate block size. Both FPGAs and mobile devices are limited by their computation resources. So even when the weight matrix is divided into multiple smaller blocks, each block can still be suicient to exploit high hardware parallelism. Generally, a smaller block size provides higher structural lexibility, which achieves higher accuracy, but at the cost of reduced speed. On the contrary, a larger block size can better leverage the hardware parallelism to achieve higher acceleration, but it may cause severer accuracy loss. Thus, the block size is critical and should be speciied depending on devices.</p><p>To determine an appropriate block size for mobile devices, we irst determine the number of channels contained in each block by considering the computation resource of the device. To achieve high parallelism, we use the same number of channels for each block as the length of the vector registers in the mobile GPU. Then we determine the number of ilters in each block by choosing the minimum number that can satisfy the inference speed requirement. A relatively larger block size is required for FPGAs than for mobile devices to fully leverage the hardware parallelism.</p><p>Note that the model with block-based sparsity can be obtained from oline training using diferent pruning algorithms such as group Lasso regularization <ref type="bibr">[72]</ref> or Alternating Direction Methods of Multipliers (ADMM) <ref type="bibr">[81]</ref>. There may be a diference in accuracy when adopting diferent pruning algorithms, but the energy eiciency will remain the same as long as the same pruning scheme and ratio are used. Since we are more focused on energy eiciency in this paper, we will not explore the performances of diferent pruning algorithms.</p><p>In this paper, we adopt the ADMM-based pruning method <ref type="bibr">[81]</ref> to obtain the pruned models with our proposed block-based sparsity. In speciic, we replace the unstructured sparsity constraints used in the ADMM-based method <ref type="bibr">[81]</ref> with our proposed block-based sparsity constraints mentioned above. And an advantage of using the ADMM-based method is that it can automatically determine the desirable column and row pruning rates for each block given a predeined pruning rate for the entire layer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Compiler Optimizations for Mobile Device</head><p>In this section, we introduce several critical compiler optimizations used in our framework, including general support for block-based pruning with diferent block sizes, a layer fusion mechanism to fuse diferent layers to shrink the intermediate result of each operator, and an auto-tuning strategy to further tune the performance for diferent models and diferent devices. 4.3.1 General Support for Block-based Pruning Scheme with Diferent Block Sizes. In general, a DNN model is constructed by stacking diferent layers (operators). In order to provide a layer-wise representation (LR) to describe each DNN layer, we design a domain-speciic language (DSL) to represent the DNN model. During the DNN execution, the computational graph is built based on the DSL and further computational graph optimizations can also be applied to the DSL.</p><p>In our block-based pruning, we adopt the constraint that makes the same location of all ilters and all channels within a block to be pruned. By pruning the same location in ilters, these ilters will skip reading the same input data, thus mitigating the memory pressure among the threads processing these ilters. And pruning the same locations across channels within a block ensures that all of these channels share the same computation indices, thus eliminating the computation divergence among the threads processing the channels within each block. Such a design makes the block-based pruning more eicient for computation-intensive CONV layers. To compress the model storage, we store the sparse (non-zero) weights into a compact storage format. Inspired by the traditional CSR format, we use blocked compressed storage that can further compress indices array. The same optimization method can be used for the layers with diferent block sizes, thanks to the layer-wise representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2">Layer Fusion Mechanism.</head><p>The layer fusion technique is used in our compiler optimization. It can fuse the computation operators in the computation graph to reduce the memory consumption of intermediate results and the number of operators, thereby efectively reducing the inference latency.</p><p>We identify all fusion candidates in a model based on two kinds of properties in the polynomial calculation: computation laws (i.e., associative property, commutative property, and distributive property) and data access patterns. Compared to the prior works using loop fusion <ref type="bibr">[3,</ref><ref type="bibr">5,</ref><ref type="bibr">6]</ref>, our fusion method is more aggressive without very expensive exploration. We only look for the fusion candidates that satisfy two constraints: (i) only explore the opportunities ofered speciically because of the above properties, and (ii) only consider two cost metrics in the fusion, which are enlarging the overall computation to improve the CPU/GPU utilization and reducing the memory access to improve the memory performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.3">Auto-tuning.</head><p>In general, DNN executions involve many performance-critical tuning parameters, such as the data placement on GPU memory, matrix tiling sizes, loop unrolling factors, etc. It is hard to identify the best-suited coniguration of these parameters manually. In our work, our compiler incorporates auto-tuning approaches as other DNN inference frameworks like TVM. To facilitate an eicient auto-tuning process, we utilize the Genetic Algorithm to explore the best coniguration automatically, which allows starting parameter search with initializing an arbitrary number of chromosomes, better exploring the parallelism. These optimizations, together with the layer-fusion optimization, render our framework to outperform other acceleration frameworks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Optimization Results for Mobile Device</head><p>By incorporating weight pruning with our compiler optimizations, our framework can further boost the mobilebased DNN executions. As shown in Figure <ref type="figure">7</ref>, we evaluate our framework on VGG16 and ResNet50 for image classiication task, YOLO-V4 for object detection task, and C3D for 3D activity detection task, on ImageNet <ref type="bibr">[15]</ref>, MS-COCO <ref type="bibr">[43]</ref>, and UCF-101 <ref type="bibr">[68]</ref> dataset, respectively. We compare the results of our framework with representative mobile-based solution MNN, TVM, and TF-Lite. To evaluate the efectiveness of our proposed block-based pruning scheme separately, we also compare our block-based pruning results to the dense models and the models pruned by coarse-grained ilter/channel pruning, while all the results are optimized by our compiler optimization.</p><p>From Figure <ref type="figure">7</ref>, we can observe that by adopting our compiler optimization and block-based pruning, a much lower inference latency (faster inference speed) can be consistently achieved than MNN, TVM, and TF-Lite. Our framework has no accuracy loss on VGG16 and ResNet50, and a slight accuracy degradation on YOLO-V4 and C3D.</p><p>We also show a clear advantage of our proposed block-based pruning schemes in comparison with coarsegrained structured (ilter/channel) pruning in terms of both accuracy and latency. Beneiting from the iner granularity, higher structural lexibility is preserved in our block-based pruning scheme for searching a more desired pruning structure, eventually leading to a higher accuracy. With higher accuracy maintained, a much higher compression rate can be achieved by using block-based pruning than coarse-grained structured pruning, resulting in a lower inference latency.</p><p>Besides, we compare the energy eiciency of our proposed framework to FPGA-based solutions and mobilebased solutions. We use frames per second per watt (FPS/W), which accounts for the end-to-end DNN inference speed, to demonstrate the improvement of energy eiciency achieved by our compiler optimizations and blockbased pruning. This serves as a better metric for evaluating the result of weight pruning as the total computation is reduced. As shown in Figure <ref type="figure">8</ref>, with similar accuracy, our proposed framework outperforms the best mobile-based solution MNN by 2.36&#215;, 3.26&#215;, and 7.71&#215; in terms of FPS/W on ResNet18, MobileNetV2, and VGG16, respectively.</p><p>Other than using Samsung Galaxy S20 smartphone, we also evaluate the energy eiciency performance of Samsung Galaxy S10 smartphone. The Samsung Galaxy S10 is less powerful than Samsung Galaxy S20, which has Qualcomm Snapdragon 855 chipset and Adreno 640 mobile GPU. In this way, we can evaluate the transferability of our framework and the energy eiciency comparison to the FPGA-based designs. The energy eiciency results using Samsung Galaxy S10 are shown in Table <ref type="table">3</ref>. It can be observed that our framework still consistently outperforms MNN, TVM, TF-Lite, and the FPGA-based designs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Optimization Results for FPGA</head><p>A key challenge in the FPGA acceleration under our block-based sparsity is the workload imbalance issue. Because block-based pruning may lead to diferent degrees of sparsity between diferent blocks and may not provide enough inner-block parallelism to fully exploit the large volume of computing resources on an FPGA.</p><p>The workload sharing technique <ref type="bibr">[67]</ref> can be used to address the workload imbalance issue. With the help of compiler-level code generation, the workload can be shared in an online and adaptive manner.</p><p>Figure <ref type="figure">9</ref> shows the speedup and accuracy results of our block-based pruning adopted on FPGA under diferent block sizes. The speedup is normalized to the dense (unpruned) model, with weight/activation in 8 bits. We can observe that, when using the block size of 16&#215;16 and 32&#215;32, our block-based pruning achieves much higher speedup compared to unstructured pruning while having a minor accuracy drop. This indicates our proposed block-based pruning can efectively accelerate DNN execution on FPGAs, while maintaining accuracy. Larger block size is required to fully exploit hardware parallelism on FPGAs to achieve a higher speedup, but it will also lead to a larger accuracy drop.</p><p>In a nutshell, our proposed block-based pruning can beneit both mobile-based and FPGA-based accelerations. It is important to be noted that, when both mobile devices and FPGAs have been applied with weight pruning, the hardware performance of both can be enhanced while maintaining high accuracy. It is still diicult for FPGAs to mitigate the gap and achieve comparable energy eiciency compared to mobile devices under the same accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">CONCLUSION</head><p>In this paper, we conduct a thorough analysis of FPGAs and mobile devices under both baseline executions as well as optimized executions where state-of-the-art optimizations are applied. We observe that, for DNN executions, it is very diicult for FPGAs to beat mobile devices when considering both energy-eiciency and model accuracy.</p><p>To further explore the beneit of weight pruning, we propose a novel hardware friendly pruning strategy that can be efectively applied to both mobile devices and FPGAs, along with compiler-assisted acceleration framework.  </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>We focus on of-the-shelf computing devices in this paper.</p></note>
		</body>
		</text>
</TEI>
