Network quantization is one of the most hardware friendly techniques to enable the deployment of convolutional neural networks (CNNs) on low-power mobile devices. Recent network quantization techniques quantize each weight kernel in a convolutional layer independently for higher inference accuracy, since the weight kernels in a layer exhibit different variances and hence have different amounts of redundancy. The quantization bitwidth or bit number (QBN) directly decides the inference accuracy, latency, energy and hardware overhead. To effectively reduce the redundancy and accelerate CNN inferences, various weight kernels should be quantized with different QBNs. However, prior works use only one QBN to quantize each convolutional layer or the entire CNN, because the design space of searching a QBN for each weight kernel is too large. The hand-crafted heuristic of the kernel-wise QBN search is so sophisticated that domain experts can obtain only sub-optimal results. It is difficult for even deep reinforcement learning (DRL) DDPG-based agents to find a kernel-wise QBN configuration that can achieve reasonable inference accuracy. In this paper, we propose a hierarchical-DRL-based kernel-wise network quantization technique, AutoQ, to automatically search a QBN for each weight kernel, and choose another QBN for each activation layer. Compared to the models quantized by the state-of-the-art DRL-based schemes, on average, the same models quantized by AutoQ reduce the inference latency by 54.06%, and decrease the inference energy consumption by 50.69%, while achieving the same inference accuracy.
more »
« less
Sigma-Delta and distributed noise-shaping quantization methods for random Fourier features
Abstract We propose the use of low bit-depth Sigma-Delta and distributed noise-shaping methods for quantizing the random Fourier features (RFFs) associated with shift-invariant kernels. We prove that our quantized RFFs—even in the case of $$1$$-bit quantization—allow a high-accuracy approximation of the underlying kernels, and the approximation error decays at least polynomially fast as the dimension of the RFFs increases. We also show that the quantized RFFs can be further compressed, yielding an excellent trade-off between memory use and accuracy. Namely, the approximation error now decays exponentially as a function of the bits used. The quantization algorithms we propose are intended for digitizing RFFs without explicit knowledge of the application for which they will be used. Nevertheless, as we empirically show by testing the performance of our methods on several machine learning tasks, our method compares favourably with other state-of-the-art quantization methods.
more »
« less
- PAR ID:
- 10483390
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Information and Inference: A Journal of the IMA
- Volume:
- 13
- Issue:
- 1
- ISSN:
- 2049-8772
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Matrices are exceptionally useful in various fields of study as they provide a convenient framework to organize and manipulate data in a structured manner. However, modern matrices can involve billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Although prohibitively large, such matrices are often approximately low rank. We propose an algorithm that exploits this structure to obtain a low rank decomposition of any matrix A as A≈LR, where L and R are the low rank factors. The total number of elements in L and R can be significantly less than that in A. Furthermore, the entries of L and R are quantized to low precision formats −− compressing A by giving us a low rank and low precision factorization. Our algorithm first computes an approximate basis of the range space of A by randomly sketching its columns, followed by a quantization of the vectors constituting this basis. It then computes approximate projections of the columns of A onto this quantized basis. We derive upper bounds on the approximation error of our algorithm, and analyze the impact of target rank and quantization bit-budget. The tradeoff between compression ratio and approximation accuracy allows for flexibility in choosing these parameters based on specific application requirements. We empirically demonstrate the efficacy of our algorithm in image compression, nearest neighbor classification of image and text embeddings, and compressing the layers of LlaMa-7b. Our results illustrate that we can achieve compression ratios as aggressive as one bit per matrix coordinate, all while surpassing or maintaining the performance of traditional compression techniques.more » « less
-
We consider the post-training quantization problem, which discretizes the weights of pre-trained deep neural networks without re-training the model. We propose multipoint quantization, a quantization method that approximates a full-precision weight vector using a linear combination of multiple vectors of low-bit numbers; this is in contrast to typical quantization methods that approximate each weight using a single low precision number. Computationally, we construct the multipoint quantization with an efficient greedy selection procedure, and adaptively decides the number of low precision points on each quantized weight vector based on the error of its output. This allows us to achieve higher precision levels for important weights that greatly influence the outputs, yielding an 'effect of mixed precision' but without physical mixed precision implementations (which requires specialized hardware accelerators). Empirically, our method can be implemented by common operands, bringing almost no memory and computation overhead. We show that our method outperforms a range of state-of-the-art methods on ImageNet classification and it can be generalized to more challenging tasks like PASCAL VOC object detection.more » « less
-
Coarsely quantized MIMO signalling methods have gained popularity in the recent developments of massive MIMO as they open up opportunities for massive MIMO implementation using cheap and power-efficient radio-frequency front-ends. This paper presents a new one-bit MIMO precoding approach using spatial Sigma-Delta (∑Δ) modulation. In previous one-bit MIMO precoding research, one mainly focuses on using optimization to tackle the difficult binary signal optimization problem that arise from the precoding design. Our approach attempts a different route. Assuming angular MIMO channels, we apply ∑Δ modulation—a classical concept in analog-to-digital conversion of temporal signals—in space. The resulting ∑Δ precoding approach has two main advantages: First, we no longer need to deal with binary optimization in ∑Δ precoding design. Particularly, the binary signal restriction is replaced by convex signal amplitude constraints. Second, the impact of the quantization error can be well controlled via modulator design and under appropriate operating conditions. Through symbol error probability analysis, we reveal that the very large number of antennas in massive MIMO provides favorable operating conditions for ∑Δ precoding. In addition, we develop a new ∑Δ modulation architecture that is capable of adapting the channel to achieve nearly zero quantization error for a targeted user. Furthermore, we consider multi-user ∑Δ precoding using the zero-forcing and symbol-level precoding schemes. These two ∑Δ precoding schemes perform considerably better than their direct one-bit quantized counterparts, as simulation results show.more » « less
-
Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular full-precision counterparts. To maintain the same performance level especially at low bit-widths, QDNNs must be retrained. Their training involves piece-wise constant activation functions and discrete weights; hence, mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the full-precision weights and their quantization (the so-called blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bit-width such as binarization. In full quantization of ResNet-18 for ImageNet classification task, BCGD gives 64.36% top-1 accuracy with binary weights across all layers and 4-bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent for a two-linear-layer neural network model with Gaussian input data and prove that the expected coarse gradient correlates positively with the underlying true gradient.more » « less