skip to main content

Title: TPrune: Efficient Transformer Pruning for Mobile Devices
The invention of Transformer model structure boosts the performance of Neural Machine Translation (NMT) tasks to an unprecedented level. Many previous works have been done to make the Transformer model more execution-friendly on resource-constrained platforms. These researches can be categorized into three key fields: Model Pruning, Transfer Learning, and Efficient Transformer Variants. The family of model pruning methods are popular for their simplicity in practice and promising compression rate and have achieved great success in the field of convolution neural networks (CNNs) for many vision tasks. Nonetheless, previous Transformer pruning works did not perform a thorough model analysis and evaluation on each Transformer component on off-the-shelf mobile devices. In this work, we analyze and prune transformer models at the line-wise granularity and also implement our pruning method on real mobile platforms. We explore the properties of all Transformer components as well as their sparsity features, which are leveraged to guide Transformer model pruning. We name our whole Transformer analysis and pruning pipeline as TPrune. In TPrune, we first propose Block-wise Structured Sparsity Learning (BSSL) to analyze Transformer model property. Then, based on the characters derived from BSSL, we apply Structured Hoyer Square (SHS) to derive the final pruned models. Comparing with the state-of-the-art Transformer pruning methods, TPrune is able to achieve a higher model compression rate with less performance degradation. Experimental results show that our pruned models achieve 1.16×–1.92× speedup on mobile devices with 0%–8% BLEU score degradation compared with the original Transformer model.  more » « less
Award ID(s):
1717657 1822085 1937435
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Cyber-Physical Systems
Page Range / eLocation ID:
1 to 22
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Deep convolutional neural network (DNN) has demonstrated phenomenal success and been widely used in many computer vision tasks. However, its enormous model size and high computing complexity prohibits its wide deployment into resource limited embedded system, such as FPGA and mGPU. As the two most widely adopted model compression techniques, weight pruning and quantization compress DNN model through introducing weight sparsity (i.e., forcing partial weights as zeros) and quantizing weights into limited bit-width values, respectively. Although there are works attempting to combine the weight pruning and quantization, we still observe disharmony between weight pruning and quantization, especially when more aggressive compression schemes (e.g., Structured pruning and low bit-width quantization) are used. In this work, taking FPGA as the test computing platform and Processing Elements (PE) as the basic parallel computing unit, we first propose a PE-wise structured pruning scheme, which introduces weight sparsification with considering of the architecture of PE. In addition, we integrate it with an optimized weight ternarization approach which quantizes weights into ternary values ({-1,0,+1}), thus converting the dominant convolution operations in DNN from multiplication-and-accumulation (MAC) to addition-only, as well as compressing the original model (from 32-bit floating point to 2-bit ternary representation) by at least 16 times. Then, we investigate and solve the coexistence issue between PE-wise Structured pruning and ternarization, through proposing a Weight Penalty Clipping (WPC) technique with self-adapting threshold. Our experiment shows that the fusion of our proposed techniques can achieve the best state-of-the-art ∼21× PE-wise structured compression rate with merely 1.74%/0.94% (top-1/top-5) accuracy degradation of ResNet-18 on ImageNet dataset. 
    more » « less
  2. There have been many recent attempts to extend the successes of convolutional neural networks (CNNs) from 2-dimensional (2D) image classification to 3-dimensional (3D) video recognition by exploring 3D CNNs. Considering the emerging growth of mobile or Internet of Things (IoT) market, it is essential to investigate the deployment of 3D CNNs on edge devices. Previous works have implemented standard 3D CNNs (C3D) on hardware platforms, however, they have not exploited model compression for acceleration of inference. This work proposes a hardware-aware pruning approach that can fully adapt to the loop tiling technique of FPGA design and is applied onto a novel 3D network called R(2+1)D. Leveraging the powerful ADMM, the proposed pruning method achieves simultaneous high accuracy and significant acceleration of computation on FPGA. With layer-wise pruning rates up to 10× and negligible accuracy loss, the pruned model is implemented on a Xilinx ZCU102 FPGA board, where the pruned model achieves 2.6× speedup compared with the unpruned version, and 2.3× speedup and 2.3× power efficiency improvement compared with state-of-the-art FPGA implementation of C3D. 
    more » « less
  3. Deep Neural Networks (DNNs) are pervasively used in a significant number of applications and platforms. To enhance the execution efficiency of large-scale DNNs, previous attempts focus mainly on client-server paradigms, relying on powerful external infrastructure, or model compression, with complicated pre-processing phases. Though effective, these methods overlook the optimization of DNNs on distributed mobile devices. In this work, we design and implement MeDNN, a local distributed mobile computing system with enhanced partitioning and deployment tailored for large-scale DNNs. In MeDNN, we first propose Greedy Two Dimensional Partition (GTDP), which can adaptively partition DNN models onto several mobile devices w.r.t. individual resource constraints. We also propose Structured Model Compact Deployment (SMCD), a mobile-friendly compression scheme which utilizes a structured sparsity pruning technique to further accelerate DNN execution. Experimental results show that, GTDP can accelerate the original DNN execution time by 1.86 – 2.44⇥ with 2 – 4 worker nodes. By utilizing SMCD, 26.5% of additional computing time and 14.2% of extra communication time are saved, on average, with negligible effect on the model accuracy. 
    more » « less
  4. The memristor crossbar array has emerged as an intrinsically suitable matrix computation and low-power acceleration framework for DNN applications. Many techniques such as memristor-based weight pruning and memristor-based quantization have been studied. However, the high accuracy solution for the above techniques is still waiting for unraveling. In this paper, we propose a memristor-based DNN framework which combines both structured weight pruning and quantization by incorporating ADMM algorithm for better pruning and quantization performance. We also discover the non-optimality of the ADMM solution in weight pruning and the unused data path in a structured pruned model. We design a software-hardware co-optimization framework which contains the first proposed Network Purification and Unused Path Removal algorithms targeting on post-processing a structured pruned model after ADMM steps. By taking memristor hardware constraints into our whole framework, we achieve extreme high compression rate with minimum accuracy loss. For quantizing structured pruned model, our framework achieves nearly no accuracy loss after quantizing weights to 8-bit memristor weight representation. We share our models at anonymous link 
    more » « less
  5. It has been shown by many researchers that transformers perform as well as convolutional neural networks in many computer vision tasks. Meanwhile, the large computational costs of its attention module hinder further studies and applications on edge devices. Some pruning methods have been developed to construct efficient vision transformers, but most of them have considered image classification tasks only. Inspired by these results, we propose SiDT, a method for pruning vision transformer backbones on more complicated vision tasks like object detection, based on the search of transformer dimensions. Experiments on CIFAR-100 and COCO datasets show that the backbones with 20% or 40% dimensions/parameters pruned can have similar or even better performance than the unpruned models. Moreover, we have also provided the complexity analysis and comparisons with the previous pruning methods. 
    more » « less