skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: AutoQ: Automated Kernel-Wise Neural Network Quantization
Network quantization is one of the most hardware friendly techniques to enable the deployment of convolutional neural networks (CNNs) on low-power mobile devices. Recent network quantization techniques quantize each weight kernel in a convolutional layer independently for higher inference accuracy, since the weight kernels in a layer exhibit different variances and hence have different amounts of redundancy. The quantization bitwidth or bit number (QBN) directly decides the inference accuracy, latency, energy and hardware overhead. To effectively reduce the redundancy and accelerate CNN inferences, various weight kernels should be quantized with different QBNs. However, prior works use only one QBN to quantize each convolutional layer or the entire CNN, because the design space of searching a QBN for each weight kernel is too large. The hand-crafted heuristic of the kernel-wise QBN search is so sophisticated that domain experts can obtain only sub-optimal results. It is difficult for even deep reinforcement learning (DRL) DDPG-based agents to find a kernel-wise QBN configuration that can achieve reasonable inference accuracy. In this paper, we propose a hierarchical-DRL-based kernel-wise network quantization technique, AutoQ, to automatically search a QBN for each weight kernel, and choose another QBN for each activation layer. Compared to the models quantized by the state-of-the-art DRL-based schemes, on average, the same models quantized by AutoQ reduce the inference latency by 54.06%, and decrease the inference energy consumption by 50.69%, while achieving the same inference accuracy.  more » « less
Award ID(s):
1908992 1909509
PAR ID:
10167903
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
International Conference on Learning Representations
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The ever-growing parameter size and computation cost of Convolutional Neural Network (CNN) models hinder their deployment onto resource-constrained platforms. Network pruning techniques are proposed to remove the redundancy in CNN parameters and produce a sparse model. Sparse-aware accelerators are also proposed to reduce the computation cost and memory bandwidth requirements of inference by leveraging the model sparsity. The irregularity of sparse patterns, however, limits the efficiency of those designs. Researchers proposed to address this issue by creating a regular sparsity pattern through hardware-aware pruning algorithms. However, the pruning rate of these solutions is largely limited by the enforced sparsity patterns. This limitation motivates us to explore other compression methods beyond pruning. With two decoupled computation stages, we found that kernel decomposition could potentially take the processing of the sparse pattern off from the critical path of inference and achieve a high compression ratio without enforcing the sparse patterns. To exploit these advantages, we propose ESCALATE, an algorithm-hardware co-design approach based on kernel decomposition. At algorithm level, ESCALATE reorganizes the two computation stages of the decomposed convolution to enable a stream processing of the intermediate feature map. We proposed a hybrid quantization to exploit the different reuse frequency of each part of the decomposed weight. At architecture level, ESCALATE proposes a novel ‘Basis-First’ dataflow and its corresponding microarchitecture design to maximize the benefits brought by the decomposed convolution. 
    more » « less
  2. Convolutional neural networks (CNNs) play an important role in today's mobile and edge computing systems for vision-based tasks like object classification and detection. However, state-of-the-art methods on CNN acceleration are trapped in either limited practical latency speed-up on general computing platforms or latency speed-up with severe accuracy loss. In this paper, we propose a spatial-based dynamic CNN acceleration framework, NeuLens, for mobile and edge platforms. Specially, we design a novel dynamic inference mechanism, assemble region-aware convolution (ARAC) supernet, that peels off redundant operations inside CNN models as many as possible based on spatial redundancy and channel slicing. In ARAC supernet, the CNN inference flow is split into multiple independent micro-flows, and the computational cost of each can be autonomously adjusted based on its tiled-input content and application requirements. These micro-flows can be loaded into hardware like GPUs as single models. Consequently, its operation reduction can be well translated into latency speed-up and is compatible with hardware-level accelerations. Moreover, the inference accuracy can be well preserved by identifying critical regions on images and processing them in the original resolution with large micro-flow. Based on our evaluation, NeuLens outperforms baseline methods by up to 58% latency reduction with the same accuracy and by up to 67.9% accuracy improvement under the same latency/memory constraints. 
    more » « less
  3. The ever-increasing number of layers, millions of parameters, and large data volume make deep learning workloads resource-intensive and power-hungry. In this paper, we develop a convolutional neural network (CNN) acceleration framework, named MLCNN, which explores algorithm-hardware co-design to achieve cross-layer cooperative optimization and acceleration. MLCNN dramatically reduces computation and on-off chip communication, improving CNN’s performance. To achieve this, MLCNN reorders the position of nonlinear activation layers and pooling layers, which we prove results in a negligible accuracy loss; then the convolutional layer and pooling layer are cooptimized by means of redundant multiplication elimination, local addition reuse, and global addition reuse. To the best of our knowledge, MLCNN is the first of its kind that incorporates cooperative optimization across convolutional, activation, and pooling layers. We further customize the MLCNN accelerator to take full advantage of cross-layer CNN optimization to reduce both computation and on-off chip communication. Our analysis shows that MLCNN can significantly reduce (up to 98%) multiplications and additions. We have implemented a prototype of MLCNN and evaluated its performance on several widely used CNN models using both an accelerator-level cycle and energy model and RTL implementation. Experimental results show that MLCNN achieves 3.2× speedup and 2.9× energy efficiency compared with dense CNNs. MLCNN’s optimization methods are orthogonal to other CNN acceleration techniques, such as quantization and pruning. Combined with quantization, our quantized MLCNN gains a 12.8× speedup and 11.3× energy efficiency compared with DCNN. 
    more » « less
  4. Although state-of-the-art (SOTA) CNNs achieve outstanding performance on various tasks, their high computation demand and massive number of parameters make it difficult to deploy these SOTA CNNs onto resource-constrained devices. Previous works on CNN acceleration utilize low-rank approximation of the original convolution layers to reduce computation cost. However, these methods are very difficult to conduct upon sparse models, which limits execution speedup since redundancies within the CNN model are not fully exploited. We argue that kernel granularity decomposition can be conducted with low-rank assumption while exploiting the redundancy within the remaining compact coefficients. Based on this observation, we propose PENNI, a CNN model compression framework that is able to achieve model compactness and hardware efficiency simultaneously by (1) implementing kernel sharing in convolution layers via a small number of basis kernels and (2) alternately adjusting bases and coefficients with sparse constraints. Experiments show that we can prune 97% parameters and 92% FLOPs on ResNet18 CIFAR10 with no accuracy loss, and achieve 44% reduction in run-time memory consumption and a 53% reduction in inference latency. 
    more » « less
  5. High-quality 3D image recognition is an important component of many vision and robotics systems. However, the accurate processing of these images requires the use of compute-expensive 3D Convolutional Neural Networks (CNNs). To address this challenge, we propose the use of Spiking Neural Networks (SNNs) that are generated from iso-architecture CNNs and trained with quantization-aware gradient descent to optimize their weights, membrane leak, and firing thresholds. During both training and inference, the analog pixel values of a 3D image are directly applied to the input layer of the SNN without the need to convert to a spike-train. This significantly reduces the training and inference latency and results in high degree of activation sparsity, which yields significant improvements in computational efficiency. However, this introduces energy-hungry digital multiplications in the first layer of our models, which we propose to mitigate using a processing-in-memory (PIM) architecture. To evaluate our proposal, we propose a 3D and a 3D/2D hybrid SNN-compatible convolutional architecture and choose hyperspectral imaging (HSI) as an application for 3D image recognition. We achieve overall test accuracy of 98.68, 99.50, and 97.95% with 5 time steps (inference latency) and 6-bit weight quantization on the Indian Pines, Pavia University, and Salinas Scene datasets, respectively. In particular, our models implemented using standard digital hardware achieved accuracies similar to state-of-the-art (SOTA) with ~560.6× and ~44.8× less average energy than an iso-architecture full-precision and 6-bit quantized CNN, respectively. Adopting the PIM architecture in the first layer, further improves the average energy, delay, and energy-delay-product (EDP) by 30, 7, and 38%, respectively. 
    more » « less