Matrices are exceptionally useful in various fields of study as they provide a convenient framework to organize and manipulate data in a structured manner. However, modern matrices can involve billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Although prohibitively large, such matrices are often approximately low rank. We propose an algorithm that exploits this structure to obtain a low rank decomposition of any matrix A as A≈LR, where L and R are the low rank factors. The total number of elements in L and R can be significantly less than that in A. Furthermore, the entries of L and R are quantized to low precision formats −− compressing A by giving us a low rank and low precision factorization. Our algorithm first computes an approximate basis of the range space of A by randomly sketching its columns, followed by a quantization of the vectors constituting this basis. It then computes approximate projections of the columns of A onto this quantized basis. We derive upper bounds on the approximation error of our algorithm, and analyze the impact of target rank and quantization bit-budget. The tradeoff between compression ratio and approximation accuracy allows for flexibility in choosing these parameters based on specific application requirements. We empirically demonstrate the efficacy of our algorithm in image compression, nearest neighbor classification of image and text embeddings, and compressing the layers of LlaMa-7b. Our results illustrate that we can achieve compression ratios as aggressive as one bit per matrix coordinate, all while surpassing or maintaining the performance of traditional compression techniques.
more »
« less
QuanTaichi: a compiler for quantized simulations
High-resolution simulations can deliver great visual quality, but they are often limited by available memory, especially on GPUs. We present a compiler for physical simulation that can achieve both high performance and significantly reduced memory costs, by enabling flexible and aggressivequantization.Low-precision (quantized) numerical data types are used and packed to represent simulation states, leading to reduced memory space and bandwidth consumption. Quantized simulation allows higher resolution simulation with less memory, which is especially attractive on GPUs. Implementing a quantized simulator that has high performance and packs the data tightly for aggressive storage reduction would be extremely labor-intensive and error-prone using a traditional programming language. To make the creation of quantized simulation practical, we have developed a new set of language abstractions and a compilation system. A suite of tailored domain-specific optimizations ensure quantized simulators often run as fast as the full-precision simulators, despite the overhead of encoding-decoding the packed quantized data types. Our programming language and compiler, based onTaichi, allow developers to effortlessly switch between different full-precision and quantized simulators, to explore the full design space of quantization schemes, and ultimately to achieve a good balance between space and precision. The creation of quantized simulation with our system has large benefits in terms of memory consumption and performance, on a variety of hardware, from mobile devices to workstations with high-end GPUs. We can simulate with levels of resolution that were previously only achievable on systems with much more memory, such as multiple GPUs. For example, on asingleGPU, we can simulate a Game of Life with 20 billion cells (8× compression per pixel), an Eulerian fluid system with 421 million active voxels (1.6× compression per voxel), and a hybrid Eulerian-Lagrangian elastic object simulation with 235 million particles (1.7× compression per particle). At the same time, quantized simulations create physically plausible results. Our quantization techniques arecomplementaryto existing acceleration approaches of physical simulation: they can be used in combination with these existing approaches, such as sparse data structures, for even higher scalability and performance.
more »
« less
- Award ID(s):
- 2019786
- PAR ID:
- 10603380
- Publisher / Repository:
- Association for Computing Machinery (ACM)
- Date Published:
- Journal Name:
- ACM Transactions on Graphics
- Volume:
- 40
- Issue:
- 4
- ISSN:
- 0730-0301
- Format(s):
- Medium: X Size: p. 1-16
- Size(s):
- p. 1-16
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular full-precision counterparts. To maintain the same performance level especially at low bit-widths, QDNNs must be retrained. Their training involves piece-wise constant activation functions and discrete weights; hence, mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the full-precision weights and their quantization (the so-called blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bit-width such as binarization. In full quantization of ResNet-18 for ImageNet classification task, BCGD gives 64.36% top-1 accuracy with binary weights across all layers and 4-bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent for a two-linear-layer neural network model with Gaussian input data and prove that the expected coarse gradient correlates positively with the underlying true gradient.more » « less
-
Hamiltonian simulation is a central application of quantum computing, with significant potential in modeling physical systems and solving complex optimization problems. Existing compilers for such simulations typically focus on low-level representations based on Pauli operators, limiting programmability and offering no formal guarantees of correctness across the compilation pipeline. We introduce QBlue, a high-level, formally verified framework for compiling Hamiltonian simulations. QBlue is based on the formalism of second quantization, which provides a natural and expressive way to describe quantum particle systems using creation and annihilation operators. To ensure safety and correctness, QBlue includes a type system that tracks particle types and enforces Hermitian structure. The framework supports compilation to both digital and analog quantum circuits and captures multiple layers of semantics, from static constraints to dynamic evolution. All components of QBlue, including its language design, type system, and compilation correctness, are fully mechanized in the Rocq proof framework, making it the first end-to-end verified compiler for second-quantized Hamiltonian simulation.more » « less
-
We present LBW-Net, an efficient optimization based method for quantization and training of the low bit-width convolutional neural networks (CNNs). Specifically, we quantize the weights to zero or powers of 2 by minimizing the Euclidean distance between full-precision weights and quantized weights during back-propagation (weight learning). We characterize the combinatorial nature of the low bit-width quantization problem. For 2-bit (ternary) CNNs, the quantization of N weights can be done by an exact formula in O(N log N) complexity. When the bit-width is 3 and above, we further propose a semi-analytical thresholding scheme with a single free parameter for quantization that is computationally inexpensive. The free parameter is further determined by network retraining and object detection tests. The LBW-Net has several desirable advantages over full-precision CNNs, including considerable memory savings, energy efficiency, and faster deployment. Our experiments on PASCAL VOC dataset show that compared with its 32-bit floating-point counterpart, the performance of the 6-bit LBW-Net is nearly lossless in the object detection tasks, and can even do better in real world visual scenes, while empirically enjoying more than 4× faster deployment.more » « less
-
We consider the post-training quantization problem, which discretizes the weights of pre-trained deep neural networks without re-training the model. We propose multipoint quantization, a quantization method that approximates a full-precision weight vector using a linear combination of multiple vectors of low-bit numbers; this is in contrast to typical quantization methods that approximate each weight using a single low precision number. Computationally, we construct the multipoint quantization with an efficient greedy selection procedure, and adaptively decides the number of low precision points on each quantized weight vector based on the error of its output. This allows us to achieve higher precision levels for important weights that greatly influence the outputs, yielding an 'effect of mixed precision' but without physical mixed precision implementations (which requires specialized hardware accelerators). Empirically, our method can be implemented by common operands, bringing almost no memory and computation overhead. We show that our method outperforms a range of state-of-the-art methods on ImageNet classification and it can be generalized to more challenging tasks like PASCAL VOC object detection.more » « less
An official website of the United States government
