The memristor crossbar array has emerged as an intrinsically suitable matrix computation and low-power acceleration framework for DNN applications. Many techniques such as memristor-based weight pruning and memristor-based quantization have been studied. However, the high accuracy solution for the above techniques is still waiting for unraveling. In this paper, we propose a memristor-based DNN framework which combines both structured weight pruning and quantization by incorporating ADMM algorithm for better pruning and quantization performance. We also discover the non-optimality of the ADMM solution in weight pruning and the unused data path in a structured pruned model. We design a software-hardware co-optimization framework which contains the first proposed Network Purification and Unused Path Removal algorithms targeting on post-processing a structured pruned model after ADMM steps. By taking memristor hardware constraints into our whole framework, we achieve extreme high compression rate with minimum accuracy loss. For quantizing structured pruned model, our framework achieves nearly no accuracy loss after quantizing weights to 8-bit memristor weight representation. We share our models at anonymous link https://bit.ly/2VnMUy0.
more »
« less
Towards Real-Time Segmentation on the Edge
There have been many recent attempts to extend the successes of convolutional neural networks (CNNs) from 2-dimensional (2D) image classification to 3-dimensional (3D) video recognition by exploring 3D CNNs. Considering the emerging growth of mobile or Internet of Things (IoT) market, it is essential to investigate the deployment of 3D CNNs on edge devices. Previous works have implemented standard 3D CNNs (C3D) on hardware platforms, however, they have not exploited model compression for acceleration of inference. This work proposes a hardware-aware pruning approach that can fully adapt to the loop tiling technique of FPGA design and is applied onto a novel 3D network called R(2+1)D. Leveraging the powerful ADMM, the proposed pruning method achieves simultaneous high accuracy and significant acceleration of computation on FPGA. With layer-wise pruning rates up to 10× and negligible accuracy loss, the pruned model is implemented on a Xilinx ZCU102 FPGA board, where the pruned model achieves 2.6× speedup compared with the unpruned version, and 2.3× speedup and 2.3× power efficiency improvement compared with state-of-the-art FPGA implementation of C3D.
more »
« less
- PAR ID:
- 10417486
- Date Published:
- Journal Name:
- AAAI'23: The Thirty-Seventh AAAI Conference on Artificial Intelligence
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
It is appealing but challenging to achieve real-time deep neural network (DNN) inference on mobile devices because even the powerful modern mobile devices are considered “resource-constrained” when executing large-scale DNNs. It necessitates the sparse model inference via weight pruning, i.e., DNN weight sparsity, and it is desirable to design a new DNN weight sparsity scheme that can facilitate real-time inference on mobile devices while preserving a high sparse model accuracy. This paper designs a novel mobile inference acceleration framework GRIM that is General to both convolutional neural networks (CNNs) and recurrent neural networks (RNNs) and that achieves Real-time execution and high accuracy, leveraging fine-grained structured sparse model Inference and compiler optimizations for Mobiles. We start by proposing a new fine-grained structured sparsity scheme through the Block-based Column-Row (BCR) pruning. Based on this new fine-grained structured sparsity, our GRIM framework consists of two parts: (a) the compiler optimization and code generation for real-time mobile inference; and (b) the BCR pruning optimizations for determining pruning hyperparameters and performing weight pruning. We compare GRIM with Alibaba MNN, TVM, TensorFlow-Lite, a sparse implementation based on CSR, PatDNN, and ESE (a representative FPGA inference acceleration framework for RNNs), and achieve up to 14.08× speedup.more » « less
-
The ever-increasing number of layers, millions of parameters, and large data volume make deep learning workloads resource-intensive and power-hungry. In this paper, we develop a convolutional neural network (CNN) acceleration framework, named MLCNN, which explores algorithm-hardware co-design to achieve cross-layer cooperative optimization and acceleration. MLCNN dramatically reduces computation and on-off chip communication, improving CNN’s performance. To achieve this, MLCNN reorders the position of nonlinear activation layers and pooling layers, which we prove results in a negligible accuracy loss; then the convolutional layer and pooling layer are cooptimized by means of redundant multiplication elimination, local addition reuse, and global addition reuse. To the best of our knowledge, MLCNN is the first of its kind that incorporates cooperative optimization across convolutional, activation, and pooling layers. We further customize the MLCNN accelerator to take full advantage of cross-layer CNN optimization to reduce both computation and on-off chip communication. Our analysis shows that MLCNN can significantly reduce (up to 98%) multiplications and additions. We have implemented a prototype of MLCNN and evaluated its performance on several widely used CNN models using both an accelerator-level cycle and energy model and RTL implementation. Experimental results show that MLCNN achieves 3.2× speedup and 2.9× energy efficiency compared with dense CNNs. MLCNN’s optimization methods are orthogonal to other CNN acceleration techniques, such as quantization and pruning. Combined with quantization, our quantized MLCNN gains a 12.8× speedup and 11.3× energy efficiency compared with DCNN.more » « less
-
With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their efficiency drops dramatically for reasons: 1) the different dimensions within same-type layers, 2) the different convolution layers especially transposed and dilated convolutions, and 3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into machine learning frameworks. Therefore, we present a flexible, composable architecture called FlexCNN, which delivers high computation efficiency by employing dynamic tiling, layer fusion, and data layout optimizations. Additionally, we implement a novel versatile SA to process normal, transposed, and dilated convolutions efficiently. FlexCNN also uses a fully-pipelined software-hardware integration that alleviates the software overheads. Moreover, with an automated compilation flow, FlexCNN takes a CNN in the ONNX representation, performs a design space exploration, and generates an FPGA accelerator. The framework is tested using three complex CNNs: OpenPose, U-Net, and E-Net. The architecture optimizations achieve 2.3 × performance improvement. Compared to a standard SA, the versatile SA achieves close-to-ideal speedups, with up to 15.98 × and 13.42 × for transposed and dilated convolutions, with a 6% average area overhead. The pipelined integration leads to a 5 × speedup for OpenPose.more » « less
-
Recurrent Neural Networks (RNNs) are becoming increasingly important for time series-related applications which require efficient and real-time implementations. The two major types are Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. It is a challenging task to have real-time, efficient, and accurate hardware RNN implementations because of the high sensitivity to imprecision accumulation and the requirement of special activation function implementations. Recently two works have focused on FPGA implementation of inference phase of LSTM RNNs with model compression. First, ESE uses a weight pruning based compressed RNN model but suffers from irregular network structure after pruning. The second work C-LSTM mitigates the irregular network limitation by incorporating block-circulant matrices for weight matrix representation in RNNs, thereby achieving simultaneous model compression and acceleration. A key limitation of the prior works is the lack of a systematic design optimization framework of RNN model and hardware implementations, especially when the block size (or compression ratio) should be jointly optimized with RNN type, layer size, etc. In this paper, we adopt the block-circulant matrixbased framework, and present the Efficient RNN (E-RNN) framework for FPGA implementations of the Automatic Speech Recognition (ASR) application. The overall goal is to improve performance/energy efficiency under accuracy requirement. We use the alternating direction method of multipliers (ADMM) technique for more accurate block-circulant training, and present two design explorations providing guidance on block size and reducing RNN training trials. Based on the two observations, we decompose E-RNN in two phases: Phase I on determining RNN model to reduce computation and storage subject to accuracy requirement, and Phase II on hardware implementations given RNN model, including processing element design/optimization, quantization, activation implementation, etc. 1 Experimental results on actual FPGA deployments show that E-RNN achieves a maximum energy efficiency improvement of 37.4× compared with ESE, and more than 2× compared with C-LSTM, under the same accuracy.more » « less