<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Pruning Parameterization with Bi-level Optimization for Efficient Semantic Segmentation on the Edge</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2023 June</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10417481</idno>
					<idno type="doi"></idno>
					<title level='j'>The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Changdi Yang</author><author>Pu Zhao</author><author>Yanyu Li</author><author>Wei Niu</author><author>Jiexiong Guan</author><author>Hao Tang</author><author>Minghai Qin</author><author>Bin Ren</author><author>Xue Lin</author><author>Yanzhi Wang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[With the ever-increasing popularity of edge devices, it is necessary to implement real-time segmentation on the edge for autonomous driving and many other applications. Vision Transformers (ViTs) have shown considerably stronger results for many vision tasks. However, ViTs with the fullattention mechanism usually consume a large number of computational resources, leading to difficulties for realtime inference on edge devices. In this paper, we aim to derive ViTs with fewer computations and fast inference speed to facilitate the dense prediction of semantic segmentation on edge devices. To achieve this, we propose a pruning parameterization method to formulate the pruning problem of semantic segmentation. Then we adopt a bi-level optimization method to solve this problem with the help of implicit gradients. Our experimental results demonstrate that we can achieve 38.9 mIoU on ADE20K val with a speed of 56.5 FPS on Samsung S21, which is the highest mIoU under the same computation constraint with real-time inference.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>module, which is the main cost of computations and latency (over 60%). To achieve this, we first formulate the problem with pruning parameterization to build a pruning framework with a soft mask as a representation of the pruning policy. With this soft mask, we further adopt thresholding to convert the soft mask into a binary mask so that the model is trained with actual pruned weights to obtain pruning results directly. This is significantly different from other methods <ref type="bibr">[27,</ref><ref type="bibr">47]</ref> to train with unpruned small non-zero weights and use fine-tuning to mitigate the performance degradation after applying pruning. Besides, to update the soft mask as long as the pruning policy, we adopt to straight though estimator (STE) method to make the soft mask differentiable. Thus, we can build the pruning parameterization framework with minimal overhead.</p><p>Based on this framework, we need to search the bestsuited layer width for each layer in the token pyramid module. It is non-trivial to perform the search. As the token pyramid module needs to extract multi-scale information from multiple spatial resolutions, the large hierarchical search space leads to difficulties of convergence. To resolve this problem, we adopt a bi-level optimization method. In the outer optimization, we try to obtain the pruning policy based on the pruning parameters (the soft mask). In the inner optimization, the optimized model weights with the best segmentation performance under this soft mask can be obtained. Compared with a typical pruning method, our work incorporates the implicit gradients with second-order derivatives to further guide the update of the soft mask and achieve better performance. Our experimental results demonstrate that we can achieve 38.9 mIoU (mean classwise intersection-over-union) on the ADE20K dataset with a speed of 56.5 FPS on Samsung S21, which is the highest mIoU under the same computation constraint with real-time inference speed. As demonstrated in Figure <ref type="figure">1</ref>, our models can achieve a better tradeoff between the mIoU and speed.</p><p>We summarize our contributions below, &#8226; We propose a pruning parameterization method to build a pruning framework with a soft mask. We further use a threshold-based method to convert the soft mask into the binary mask to perform actual pruning during model training and inference. Besides, STE is adopted to update the soft mask efficiently through gradient descent optimizers. &#8226; To solve the pruning problem formulated with the framework of pruning parameterization, we propose a bi-level optimization method to utilize implicit gradients for better results. We show that the second-order derivatives in the implicit gradients can be efficiently obtained through first-order derivatives, saving computations and memory. &#8226; Our experimental results demonstrate that we can achieve the highest mIoU under the same computation constraint on various datasets. Specifically, we can achieve 38.9 mIoU on the ADE20K dataset with a real-time inference speed of 56.5 FPS on the Samsung S21.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Real-Time Semantic Segmentation. Though semantic segmentation based on CNNs <ref type="bibr">[6,</ref><ref type="bibr">22,</ref><ref type="bibr">33,</ref><ref type="bibr">58,</ref><ref type="bibr">71]</ref> can achieve great performance, it typically costs large amounts of computations with slow inference speed. Furthermore, with the ever-increasing popularity of edge devices such as mobile phones, it is necessary to achieve fast inference speed for semantic segmentation on edge devices <ref type="bibr">[4]</ref>. Besides human designed lightweight models <ref type="bibr">[21,</ref><ref type="bibr">40,</ref><ref type="bibr">67]</ref>, neural architecture search (NAS) methods <ref type="bibr">[1,</ref><ref type="bibr">7,</ref><ref type="bibr">42,</ref><ref type="bibr">45,</ref><ref type="bibr">69]</ref> are also adopted to search lightweight models. As a pioneer in handcrafted real-time segmentation, ENet <ref type="bibr">[50]</ref> designs a lightweight model for fast inference. DeepLabV3+ <ref type="bibr">[6]</ref> adopts atrous separable convolution to reduce computation counts and uses the lightweight Mo-bileNetV2 <ref type="bibr">[56]</ref> as the backbone. BiSeNet <ref type="bibr">[66,</ref><ref type="bibr">67]</ref> and STDC <ref type="bibr">[21]</ref> utilizes a two-branch architecture, where the deep branch extracts spatial information, and the shallow branch learns details. SFNet <ref type="bibr">[40]</ref> uses flow alignment module to fuse context and spatial information.</p><p>Inspired by the recent success of NAS, some works automatically search lightweight segmentation models. To reduce the computation cost, FasterSeg <ref type="bibr">[7]</ref> incorporates latency regularization during search. DCNAS <ref type="bibr">[69]</ref> uses a densely connected search space and employs a gradientbased direct search method. NASViT <ref type="bibr">[24]</ref> proposes a gradient projection algorithm to deal with the gradient conflict issues and improve the convergence performance. HR-NAS <ref type="bibr">[15]</ref> keeps high-resolution representations in its encoder and the search process to maintain high accuracy in dense prediction. RTSeg <ref type="bibr">[43]</ref> redesigns backbone and proposes a latency-driven search framework.</p><p>Although many lightweight segmentation models are developed, it is still hard for them to run real-time inference on resource-and power-limited GPUs of edge devices. Vision Transformers. By utilizing the self-attention mechanism, ViTs <ref type="bibr">[17]</ref> can achieve competitive results against CNN models in vision tasks. Inspired by the great success of ViTs, many research efforts are devoted to dense prediction tasks with ViTs on complex datasets. SETR <ref type="bibr">[72]</ref> utilizes a ViT-based encoder to extract highlevel semantic information. Instead of per-pixel prediction, MaskFormer <ref type="bibr">[11]</ref> performs mask-based prediction with a customize backbone and a transformer-based decoder. With a transformer decoder to explore masked attention, mask2Former <ref type="bibr">[10]</ref> proposes a universal architecture and achieves SOTA performance for various segmentation tasks. MobileViT <ref type="bibr">[48]</ref> and MobileFormer <ref type="bibr">[9]</ref> mix CNN and ViT in their architectures, but they are not fast enough for real-time inference on edge devices. TopFormer <ref type="bibr">[68]</ref> uti-lizes c CNN-based token pyramid module to extract multilevel tokens and lightweight ViT blocks to extract semantic information. It achieves low latency and high accuracy on complex datasets. Neural Architecture Search and Pruning. NAS and pruning are commonly used to find lightweight models and reduce the computation overhead. Given a search space, NAS tries to identify a superior model automatically. Reinforcement learning (RL) based NAS <ref type="bibr">[51,</ref><ref type="bibr">75]</ref> and evolution-based NAS <ref type="bibr">[18,</ref><ref type="bibr">53]</ref> usually need to train and evaluate each candidate model, leading to tremendous searching cost. To mitigate this, gradient-based NAS <ref type="bibr">[5,</ref><ref type="bibr">12,</ref><ref type="bibr">29,</ref><ref type="bibr">46]</ref> is proposed to formulate a supernet with all candidate architectures and search the outstanding architecture through gradient descent methods. But it costs huge memory to incorporate all candidate architectures.</p><p>Network pruning is a compression technique to effectively reduce the DNN storage and computation cost <ref type="bibr">[31]</ref>. In this work, we focus on structured pruning <ref type="bibr">[39,</ref><ref type="bibr">61]</ref> to remove entire filters or channels of CONV layers, which can be accelerated effectively for fast inference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Problem Formulation with Pruning Parameterization</head><p>In our method, we first formulate the pruning problem with pruning parameterization. Previous pruning methods usually depend on the magnitudes of model weights and adopt the in-differentiable sorting operations, leading to inconsistent performance after applying pruning with sorting and additional overhead with fine-tuning. To mitigate this, we propose a pruning parameterization method, which uses a soft mask (rather than magnitudes of weights) to indicate whether to prune and get rid of sorting operations. With the STE method <ref type="bibr">[3]</ref>, we are able to represent the pruning with parameters and directly train the pruning like a typical model training. Then we formulate our problem with pruning parameterization and introduce our solution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Soft Mask Construction</head><p>We first introduce how to construct the soft mask. To accelerate the inference, we adopt channel pruning to search for a suitable width for each convolution (CONV) layer. Specifically, we insert a depth-wise CONV layer following each CONV layer that is supposed to be pruned as below,</p><p>where &#8857; denotes the convolution operation. w l &#8712; R o&#215;i&#215;k&#215;k is the weight parameters in l-th CONV layer, with o output channels, i input channels, and kernels of size k &#215; k. a l &#8712; R n&#215;o&#215;t&#215;t &#8242; represents the output features of the l-th layer, with o channels and t &#215; t &#8242; feature size. n denotes batch size. s l &#8712; R o&#215;1&#215;1&#215;1 is the weights of the depth-wise CONV layer. Each output channel of w l &#8857; a l-1 corresponds to one single element of s l . Thus s l can serve as a soft mask or pruning indicator for the l-th CONV layer to indicate whether to prune the corresponding channels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Forward and Backward Propagation</head><p>Since s l is only a soft mask with continuous values, it can not represent the binary operation of pruning. To solve this, we adopt a threshold, and the forward pass with the mask is represented as</p><p>where b l &#8712; {0, 1} o&#215;1&#215;1&#215;1 is the binarized s l , and &#964; is a threshold which is simply set to 0.5 in our case. Each element of b l corresponds to one output channel of w l &#8857; a l-1 .</p><p>In the forward pass, we first convert the soft mask into a binary mask through a threshold and then perform the depth-wise CONV to perform actual pruning. Thus the output channels corresponding to the zero elements in b l are pruned, and the rest channels are preserved. We show the proof in Appendix A.</p><p>Advantages of the Binary Operation. With the binary operation to obtain the binary mask, the pruned weights become zero during computations of both training and inference. This is different from some pruning works <ref type="bibr">[27,</ref><ref type="bibr">30,</ref><ref type="bibr">47]</ref> to perform training with small but non-zero weights and inference with binary weights, which has inconsistent performance between training and inference, and requires additional finetuning.</p><p>Why &#964; = 0.5. Since the s l parameters in Equation ( <ref type="formula">2</ref>) are initialized randomly between 0 and 1, we set &#964; = 0.5 as a threshold for the binary operation. &#964; can be set to other values (such as 0.3 or 0.6). The key point is that s l can be updated to increase above &#964; to keep the corresponding channel or decrease below &#964; to prune the channel. The binary operation in Equation ( <ref type="formula">2</ref>) is non-differential, leading to difficulties for back-propagation. To solve this, we propose to use STE <ref type="bibr">[3,</ref><ref type="bibr">64]</ref> for back-propagation below,</p><p>where we directly pass the gradients of b l to s l so that we can update the soft mask. Why Named Pruning Parameterization. With the binarization and STE method, the pruning process can be represented with the soft mask s = {s l }. We can update the soft mask to update the pruning policy based on its gradients. Thus s is the pruning parameters to denote and control pruning, i.e., pruning parameterization.</p><p>Difference with Other Pruning Works. Based on pruning parameterization, we decouple the pruning policy from model parameter magnitudes so that pruning does not further depend on the weight magnitudes. Unlike previous works on pruning to force pruned weights to be or as close as to zeros <ref type="bibr">[27,</ref><ref type="bibr">30,</ref><ref type="bibr">47]</ref>, our method does not have such a constraint that pruned weights should be zero. Instead, once the corresponding binary mask is turned from 1 to 0, the information in pruned channels is preserved rather than zeroed out since zeros in b l can block gradient flow to the corresponding weights. As a result, pruned channels are free to recover and contribute to accuracy if their corresponding elements in b l switch from 0 to 1.</p><p>Difference with Other Mask Methods. Although some other works also adopt indicator/mask-based pruning such as <ref type="bibr">[27,</ref><ref type="bibr">28,</ref><ref type="bibr">36,</ref><ref type="bibr">37]</ref>, our method is more straightforward and effective. For example, unlike our method to train the soft mask with STE directly, DAIS <ref type="bibr">[27]</ref> relaxes the binarized channel indicators to be continuous. To bridge the nonnegligible discrepancy between the continuous model and the target binarized model, it further uses an annealingbased procedure to steer the indicator convergence toward binarized states. Some works <ref type="bibr">[36,</ref><ref type="bibr">65]</ref> adopts batchnorm (BN) layers and uses the cumulative density function (CDF) of a Gaussian distribution as the mask variable, with a CDFbased loss function and the Gumbel-Softmax trick to update the mask at the cost of additional random sampling and complex gradient revision. Besides, the works <ref type="bibr">[23,</ref><ref type="bibr">65]</ref> create a mask to multiple the channels weights. This is different from our design with the depth-wise CONV operation for easy mask creation and direct mask training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Training Loss with Pruning Parameterization</head><p>Based on the soft mask, we can train and prune the model with the following loss function,</p><p>where L(w, s) is the cross-entropy loss, and L reg is the regularization term related to the sparsity or pruning ratio. For simplicity, we take Multiply-Accumulate operations (MACs) as the regularization rather than parameter number to estimate the on-device execution cost more precisely. &#946; can weight the loss and stabilize training. L reg can be defined as the squared &#8467; 2 norm of the difference between current MACs and target MACs C,</p><p>where o &#8242; l is the number of remaining channels after pruning, and t l &#215; t &#8242; l is the output feature size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Problem Formulation</head><p>With pruning parameterization, the per-layer width search problem can be formulated as follows</p><p>It is a bi-level optimization problem <ref type="bibr">[25,</ref><ref type="bibr">35]</ref>. In the inner optimization of Equation ( <ref type="formula">7</ref>), we optimize the model parameters under a given soft mask with a commonly used squared &#8467; 2 norm as a regularization to mitigate the overfitting problem. In the outer optimization of Equation ( <ref type="formula">6</ref>), we optimize the soft mask to minimize the loss. Each time before updating s, we first need to obtain w * .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Proposed Method with Bi-level Optimization</head><p>The objective is to find the soft mask (search a suitable width) for each layer to minimize the loss with optimized model weights. We adopt a bi-level pruning method to solve this problem. Compared with a typical gradient descent method to update the parameters with first-order derivatives, the bi-level optimization method incorporates implicit gradients with second-order derivatives to adjust the first-order term, leading to higher training efficiency with better convergence results. For easy of expression, since each channel has a corresponding mask, in this section, we broadcast the masks s and b so that the masks have the same dimension as the model weights w.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Bi-level Optimization with Implicit Gradients</head><p>From the inner optimization, w * is a function of s and different s can lead to different w * . Thus, to minimize L m (w * , s) in Problem <ref type="bibr">(6)</ref>, we need to compute the gradients with reference to s as below, dLm(w * , s) ds = dw * ds &#8711;wLm(w * , s) + &#8711;sLm(w * , s), <ref type="bibr">(8)</ref> where &#8711; w and &#8711; s denote the partial derivatives of the loss function with reference to w and s, respectively. dw * ds represents the vector-wise full derivative, and we omit the transpose expression. Since w * is implicitly defined as an optimization problem in Equation ( <ref type="formula">7</ref>), dw * ds is also known as implicit gradients <ref type="bibr">[52,</ref><ref type="bibr">54,</ref><ref type="bibr">55,</ref><ref type="bibr">70]</ref>. With g(w, s) = L(w, s) + 1 2&#955; w T w, dw * ds can be obtained through the following,</p><p>where &#8711; 2 sw and &#8711; 2 w are the second-order partial derivatives. We show the proof in Appendix B.</p><p>The Hessian matrix &#8711; 2 w g(w * , s) can be given by</p><p>where we adopt a Hessian-free approximation that &#8711; 2 w L(w * , s) = 0 as DNNs usually have piece-wise linear decision boundary with ReLU functions. Thus, Equation (8) can be transformed to</p><p>Compared with a typical gradient descent method, the bilevel optimization incorporates the second-order derivatives &#8711; 2 sw L(w * , s)&#8711; w L m (w * , s) to adjust the first-order term &#8711; s L m (w * , s).</p><p>It is difficult to obtain the second-order derivatives &#8711; 2 sw L(w * , s). Usually certain approximation methods such as finite difference <ref type="bibr">[8,</ref><ref type="bibr">20]</ref> may be adopted to save computation cost. But here we show that in this specific problem, we can directly obtain the analytical solution with first-order derivatives, which greatly saves computation efforts without approximation. Note that each channel has its corresponding mask, and we denote the masked channels as w b = w * &#8226; b where &#8226; means the element-wise multiplication and b is defined in Equation ( <ref type="formula">2</ref>). We can obtain that</p><p>where diag(&#8226;) represents formulating a diagonal matrix with the diagonal vector. We show the proof in Appendix C.</p><p>Then we can transform Equation ( <ref type="formula">11</ref>) into the following,</p><p>Thus although the implicit gradients incorporate secondorder derivatives, it can be analytically expressed with firstorder derivatives, greatly saving computation cost without any approximations.</p><p>In Equation ( <ref type="formula">13</ref>), we need to create two copies w b and w to obtain their derivatives, which are still not memory efficient. To further reduce the complexity, we can obtain &#8711; w b L(w b ) in the following,</p><p>The pruned weights (b = 0) do not contribute to the loss, so their gradients are zero. The difference between L(&#8226;) and L m (&#8226;) is just the regularization term L reg (&#8226;), which only relates to the sparsity and does not care the weight values. So &#8711; w b L(w b ) and &#8711; w L m (w) are equal if their weights are not pruned (b = 1). Combining Equation ( <ref type="formula">13</ref>) and ( <ref type="formula">14</ref>), we can obtain the following,</p><p>We can see that there is no need to keep a copy and compute the gradients of w b , thus saving memory. During practical implementation, since s and b are the channel-wise masks, we accumulate the gradients of all weights in each channel to update the channel mask following Equation <ref type="bibr">(15)</ref>. The Case without Implicit Gradients. In Equation ( <ref type="formula">8</ref>), we can omit the first term with implicit gradients. Thus we have dLm(w * ,s) ds = &#8711; s L m (w * , s) and there is no need to deal with the second-order term. It is easier to solve. But we demonstrate in the experiments that our method with implicit gradients can help to boost the performance without a significant increase of the computation cost in Section 5.4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Bi-level Optimization Framework</head><p>In each iteration, our framework has two steps, including the model weights training step and the mask updating step. In the first step, we update the weights w with a few training steps for a fixed mask s. Next in the second step, we update s with implicit gradients following Equation <ref type="bibr">(15)</ref>. Then we move on to the next iteration. In each step, we only update w or s without changing the other parameter. We discuss the advantages of the proposed method in the following. No Need for Finetuning. After training with the proposed method, we can obtain the final layer-wise pruning policy following s and the sparse model. Since the model is already trained with the binary masks, it can achieve good segmentation performance without further fine-tuning. First-Order Optimization. During the computation, we only adopt first-order optimization. Though we incorporate second-order derivatives in implicit gradients, we show that it can be analytically expressed with first-order derivatives in Equation ( <ref type="formula">12</ref>), which greatly saves computation cost. We further optimize the computation process in Equation ( <ref type="formula">15</ref>) to save memory cost. Recoverable Contribution. During pruning, though some channels are pruned, their weights are not set to 0 due to the protection of the mask. When their mask is updated from 0 to 1, they can recover and contribute the accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Datasets and Evaluation Metrics</head><p>ADE20K. ADE20K <ref type="bibr">[73,</ref><ref type="bibr">74]</ref> is a scene parsing dataset containing 25k images in the training set and 2k images in the validation set with 150 label classes. Cityscapes. Cityscapes <ref type="bibr">[14]</ref> is a dataset of urban street scenes from cars collected in 50 cities. It includes 5,000 finely annotated images, in which 2,975 images are used for training, 500 for validation, and 1,525 for testing. We exclude the extra training data with coarse labels throughout this paper. This dataset has 30 label classes, and 19 of them are used for segmentation. The resolution of images is 2048 &#215; 1024. Cityscapes dataset is an intensively studied benchmark for semantic segmentation, but it is challenging to perform real-time inference on such a high resolution. Pascal VOC. PASCAL Visual Object Classes (VOC) 2012 <ref type="bibr">[19]</ref> is a widely used dataset for semantic segmentation, classification, and object detection tasks. There are 1,464 images for training and 1,449 images for validation. We show the results on Pascal VOC in Table <ref type="table">4</ref>. Evaluation Metrics. For semantic segmentation evaluation, we use the mean of class-wise intersection-over-union (mIoU) to measure the accuracy performance. We introduce the details of mIoU in Appendix D. For our results, we run our algorithm three times and report the mean and variance of the mIoU performance. For the baseline methods, some Table <ref type="table">1</ref>. Comparison of our searched model and prior arts on the ADE20K val dataset. We compare with popular handcraft baselines in the first segment, NAS-based models in the second segment, pruning-based methods in the third segment and lightweight ViT-based models in the fourth segment. We measure the FPS on the Qualcomm Adreno 660 GPU of the Samsung Galaxy S21 mobile phone. Some FPS results are not available due to unsupported operations on the mobile device. methods cost too many resources and we can hardly rerun their experiments. Some methods provide their well-trained models, and we can test with the trained model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Category</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Experimental Settings</head><p>Train Settings. We perform the pruning to search for a suitable width for each CONV layer in the TopFormer model.</p><p>To enable a larger search space, we adopt the TopFormer architecture and use a larger per-layer width compared with the TopFormer-Base model. So our unpruned model has 4.1GMACs (with 39.9 mIoU), larger than the 1.8GMACs of the TopFormer-Base model. We use stochastic gradient descent (SGD) optimizer, and momentum is set to 0.9, and set the batch size to 8 on each GPU. For ADE20K, the initial learning rate is set to 1.2 &#215; 10 -4 , and the "poly" learning rate policy is applied. For the Cityscapes dataset, the initial learning rate is set to 3 &#215; 10 -4 , and we apply the "poly" learning rate policy. For PASCAL VOC 2012, we set the initial learning rate as 0.01. Learning rate value is determined as 1 -iter total iter 0.9</p><p>where iter refers to the current iteration number. For the ADE20K dataset, we incorporate data augmentation by resizing with the random ratio between 0.5 and 2.0 as well as random flipping. On the Cityscapes dataset, multiple random scaling {0.5, 0.75, 1.0, 2.0} and fixed size cropping of 512 &#215; 1024 are adopted for data augmentation. We choose the crop size for a better trade-off between mobile capacity and accuracy. We set hyperparameters &#946; = 0.01 and &#955; = 0.1 in the experiments. Test Settings. Instead of muti-scale testing, we employ single scale testing for a fair comparison. For the ADE20K dataset, we use 512 &#215; 512 as the input resolution. For the Cityscapes dataset, 512 &#215; 1024 (rather than 1024 &#215; 2048) is used as the inference resolution during testing for the following reasons. (i) In practice, we cannot use the resolution 1024 &#215; 2048 since it causes memory overflow problems on our selected mobile phone. (ii) Besides, since the screens on edge devices such as mobile phones are not very large, the resolution of 512 &#215; 1024 is good enough to serve on the small screens. (iii) Moreover, we find that this 512 &#215; 1024 resolution can greatly speedup the inference on the mobile phones without significant accuracy degradation. Experiment Environments. We train and prune the model using PyTorch 1.9 and CUDA 11.1 on 8 NVIDIA RTX TI-TAN GPUs. We measure the mobile latency on the GPU of an Samsung Galaxy S21 smartphone, with Qualcomm Snapdragon 888 mobile platform integrated with Qualcomm Kryo 680 Octa-core CPU and a Qualcomm Adreno 660 GPU. Note that for most baseline works, even the reduced resolution (512 &#215; 1024) can cause an out-of-memory problem on the selected mobile device. Compiler Framework on Mobile Devices. We need to compile the models before they can be executed on mobile devices. For TopFormer, we adopt the compiler TNN <ref type="bibr">[13]</ref> to report the speed performance, which is also used in the original TopFormer paper. To further improve the inference speed, we adopt several compiler optimization methods and Table <ref type="table">2</ref>. Our search results on the Cityscapes val dataset. We compare with popular handcraft baselines, NAS-based models, pruning based methods and lightweight ViT-based models. We measure the FPS on the Qualcomm Adreno 660 GPU of the Samsung Galaxy S21 mobile phone. Some FPS results are not available due to unsupported operations on the mobile device. develop an advanced compiler framework to compile our models and test their on-mobile speed. We show the details about our compiler optimization in Appendix E.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Experimental Results and Analysis</head><p>Segmentation Performance. Based on the TopFormer model, we obtain three models (Ours-Base, Ours-Small, and Ours-Tiny) with different computations. We show the comparison results on ADE20K in Table <ref type="table">1</ref> and Cityscapes in Table <ref type="table">2</ref>. (i) For ADE20K, we can observe that the humandesigned segmentation models and NAS-based models usually consume many computations (such as DeepLabV3+ with 26.9GMACs) in terms of MACs (multiply-accumulate operations). They can hardly achieve real-time inference on edge devices. Some NAS-based models with a small number of computations are not able to achieve high mIoU (such as HR-NAS-A with 33.2 mIoU). (ii) For the comparison with other pruning methods, we start from the same dense model and set the target GMACs to the same number (1.2GMACs) to make a fair comparison. We can see that our small model can achieve higher mIoU given the same GMACs. (iii) Compared with the ViT-based TopFormer, our models can achieve higher mIoUs with non-trivial improvements under the same computation budget (such as Ours-Small 37.5 mIoU v.s. 36.1 of TopFormer-Small under the same 1.2GMACs). Our models can achieve a faster inference speed (FPS higher than 50) on mobile phones compared with TopFormer models. We show the comparison with the baselines in terms of mIoU and FPS in Figure <ref type="figure">1</ref>.</p><p>We can achieve a better trade-off between mIoU and FPS. (i) On the Cityscapes dataset, compared with handcrafted CNN-based segmentation models, including BiSeNetV2 and STDC, our searched base model greatly reduces the GMACs (such as Ours-Base 3.6GMACs v.s. 24.6GMACs of BiSeNetV2) and achieves non-trivial better accuracy (Ours-Base 74.7 mIoU v.s. 73.4 mIoU of BiSeNetV2), as shown in Table <ref type="table">2</ref>. (ii) Compared with NAS-based methods such as FasterSeg, our models can achieve higher mIoU with much smaller computation costs. Other NAS methods consume too many computations (e.g., DCNAS with 300GMACs) to be deployed on edge devices. (iii) Compared with other pruning baselines, under the same computation budgets (2.4GMACs), our small model can achieve higher mIoU. (iv) Compared with transformerbased methods such as SegFormer-B0, similarly, our models achieve higher mIoU with less computations. Our small and tiny models have similar computations compared with TopFormer-Base and TopFormer-Tiny. But we can achieve higher mIoU. Speed Performance on Mobile Devices. Our base model can achieve 56.5 FPS on the mobile device (Samsung Galaxy S21), which implements real-time execution with competitive segmentation performance, as shown in Table 1. Other baseline methods except TopFormer can hardly achieve real-time segmentation on edge devices, usually with FPS lower than 10. Our faster inference speed than TopFormer is achieved with the compiler optimization techniques, detailed in Appendix E. Search Overhead. We show the search cost in Table <ref type="table">3</ref>. We only show the cost of our small model since our base, small and tiny models have similar search costs. It takes approximately 1.3 GPU days, which is smaller than the search cost of most other NAS-based segmentation methods. Our method can efficiently search out a compact model with Our method is general and can be applied to other model structures. We show the results for DeepLabV3+ <ref type="bibr">[6]</ref> on Cityscapes in Table <ref type="table">5</ref>. Our two searched models have fewer parameters and computations with accelerated on-mobile inference speed, while achieving better mIoU compared with the original DeepLabV3+ model. Other models such as BiSeNetV2 or STDC cost much more computations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Ablation Study</head><p>In our bi-level optimization, we incorporate implicit gradients in Equation <ref type="bibr">(11)</ref>. To demonstrate the advantages with implicit gradients, we consider the case without implicit gradients which omits the second term in Equation <ref type="bibr">(11)</ref> with just dLm(w * ,s) ds = &#8711; s L m (w * , s). We compare the performance of the solution without implicit gradients and our bilevel solution with implicit gradients on Cityscapes in Table 3. We can observe that our solution with implicit gradients has a search cost slightly higher than the solution without implicit gradients, demonstrating that our first-order solution for the second-order derivatives in Equation ( <ref type="formula">12</ref>) can effectively save computation cost. For mIoU, our method can achieve higher mIoU with non-trivial improvements, demonstrating that incorporating implicit gradients can effectively boost the performance. Hyperparameter Tuning. We show the results with various &#946; and &#955; in Appendix G. To compare with baselines under certain computations, we mainly show the results of our three sparse models (Ours-Base, Ours-Small, and Ours-Tiny). We show the results of more models under other computations in Appendix H.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>We propose pruning parameterization with the thresholding and STE methods to build a pruning framework. Based on the framework, we formulate the problem and propose a bi-level optimization method with the implicit gradients. Our experimental results demonstrate that we can achieve the highest mIoU under the same computation constraint on various datasets. Specifically, we can achieve 38.9 mIoU on the ADE20K with a real-time inference speed of 56.5 FPS on the Samsung S21.</p></div></body>
		</text>
</TEI>
