skip to main content

This content will become publicly available on October 17, 2024

Title: Matrix Compression via Randomized Low Rank and Low Precision Factorization
Matrices are exceptionally useful in various fields of study as they provide a convenient framework to organize and manipulate data in a structured manner. However, modern matrices can involve billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Although prohibitively large, such matrices are often approximately low rank. We propose an algorithm that exploits this structure to obtain a low rank decomposition of any matrix A as A≈LR, where L and R are the low rank factors. The total number of elements in L and R can be significantly less than that in A. Furthermore, the entries of L and R are quantized to low precision formats −− compressing A by giving us a low rank and low precision factorization. Our algorithm first computes an approximate basis of the range space of A by randomly sketching its columns, followed by a quantization of the vectors constituting this basis. It then computes approximate projections of the columns of A onto this quantized basis. We derive upper bounds on the approximation error of our algorithm, and analyze the impact of target rank and quantization bit-budget. The tradeoff between compression ratio and approximation accuracy allows for flexibility in choosing these parameters based on specific application requirements. We empirically demonstrate the efficacy of our algorithm in image compression, nearest neighbor classification of image and text embeddings, and compressing the layers of LlaMa-7b. Our results illustrate that we can achieve compression ratios as aggressive as one bit per matrix coordinate, all while surpassing or maintaining the performance of traditional compression techniques.  more » « less
Award ID(s):
Author(s) / Creator(s):
Publisher / Repository:
Date Published:
Journal Name:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The distance matrix of a dataset X of n points with respect to a distance function f represents all pairwise distances between points in X induced by f. Due to their wide applicability, distance matrices and related families of matrices have been the focus of many recent algorithmic works. We continue this line of research and take a broad view of algorithm design for distance matrices with the goal of designing fast algorithms, which are specifically tailored for distance matrices, for fundamental linear algebraic primitives. Our results include efficient algorithms for computing matrix-vector products for a wide class of distance matrices, such as the l1 metric for which we get a linear runtime, as well as a quadratic lower bound for any algorithm which computes a matrix-vector product for the l_infty case. Our upper bound results have many further downstream applications, including the fastest algorithm for computing a relative error low-rank approximation for the distance matrix induced by l1 and l2 functions and the fastest algorithm for computing an additive error lowrank approximation for the l2 metric, in addition to applications for fast matrix multiplication among others. We also give algorithms for constructing distance matrices and show that one can construct an approximate l2 distance matrix in time faster than the bound implied by the Johnson-Lindenstrauss lemma. 
    more » « less
  2. null (Ed.)
    Recently decentralized optimization attracts much attention in machine learning because it is more communication-efficient than the centralized fashion. Quantization is a promising method to reduce the communication cost via cutting down the budget of each single communication using the gradient compression. To further improve the communication efficiency, more recently, some quantized decentralized algorithms have been studied. However, the quantized decentralized algorithm for nonconvex constrained machine learning problems is still limited. Frank-Wolfe (a.k.a., conditional gradient or projection-free) method is very efficient to solve many constrained optimization tasks, such as low-rank or sparsity-constrained models training. In this paper, to fill the gap of decentralized quantized constrained optimization, we propose a novel communication-efficient Decentralized Quantized Stochastic Frank-Wolfe (DQSFW) algorithm for non-convex constrained learning models. We first design a new counterexample to show that the vanilla decentralized quantized stochastic Frank-Wolfe algorithm usually diverges. Thus, we propose DQSFW algorithm with the gradient tracking technique to guarantee the method will converge to the stationary point of non-convex optimization safely. In our theoretical analysis, we prove that to achieve the stationary point our DQSFW algorithm achieves the same gradient complexity as the standard stochastic Frank-Wolfe and centralized Frank-Wolfe algorithms, but has much less communication cost. Experiments on matrix completion and model compression applications demonstrate the efficiency of our new algorithm. 
    more » « less
  3. Tucker decomposition is a low-rank tensor approximation that generalizes a truncated matrix singular value decomposition (SVD). Existing parallel software has shown that Tucker decomposition is particularly effective at compressing terabyte-sized multidimensional scientific simulation datasets, computing reduced representations that satisfy a specified approximation error. The general approach is to get a low-rank approximation of the input data by performing a sequence of matrix SVDs of tensor unfoldings, which tend to be short-fat matrices. In the existing approach, the SVD is performed by computing the eigendecomposition of the Gram matrix of the unfolding. This method sacrifices some numerical stability in exchange for lower computation costs and easier parallelization. We propose using a more numerically stable though more computationally expensive way to compute the SVD by pre- processing with a QR decomposition step and computing an SVD of only the small triangular factor. The more numerically stable approach allows us to achieve the same accuracy with half the working precision (for example, single rather than double precision). We demonstrate that our method scales as well as the existing approach, and the use of lower precision leads to an overall reduction in running time of up to a factor of 2 when using 10s to 1000s of processors. Using the same working precision, we are also able to compute Tucker decompositions with much smaller approximation error. 
    more » « less
  4. Deep convolutional neural network (DNN) has demonstrated phenomenal success and been widely used in many computer vision tasks. However, its enormous model size and high computing complexity prohibits its wide deployment into resource limited embedded system, such as FPGA and mGPU. As the two most widely adopted model compression techniques, weight pruning and quantization compress DNN model through introducing weight sparsity (i.e., forcing partial weights as zeros) and quantizing weights into limited bit-width values, respectively. Although there are works attempting to combine the weight pruning and quantization, we still observe disharmony between weight pruning and quantization, especially when more aggressive compression schemes (e.g., Structured pruning and low bit-width quantization) are used. In this work, taking FPGA as the test computing platform and Processing Elements (PE) as the basic parallel computing unit, we first propose a PE-wise structured pruning scheme, which introduces weight sparsification with considering of the architecture of PE. In addition, we integrate it with an optimized weight ternarization approach which quantizes weights into ternary values ({-1,0,+1}), thus converting the dominant convolution operations in DNN from multiplication-and-accumulation (MAC) to addition-only, as well as compressing the original model (from 32-bit floating point to 2-bit ternary representation) by at least 16 times. Then, we investigate and solve the coexistence issue between PE-wise Structured pruning and ternarization, through proposing a Weight Penalty Clipping (WPC) technique with self-adapting threshold. Our experiment shows that the fusion of our proposed techniques can achieve the best state-of-the-art ∼21× PE-wise structured compression rate with merely 1.74%/0.94% (top-1/top-5) accuracy degradation of ResNet-18 on ImageNet dataset. 
    more » « less
  5. Low-rank approximation is a classic tool in data analysis, where the goal is to approximate a matrix A with a low-rank matrix L so as to minimize the error ||A-L||_F. However in many applications, approximating some entries is more important than others, which leads to the weighted low rank approximation problem. However, the addition of weights makes the low-rank approximation problem intractable. Thus many works have obtained efficient algorithms under additional structural assumptions on the weight matrix (such as low rank, and appropriate block structure). We study a natural greedy algorithm for weighted low rank approximation and develop a simple condition under which it yields bi-criteria approximation up to a small additive factor in the error. The algorithm involves iteratively computing the top singular vector of an appropriately varying matrix, and is thus easy to implement at scale. Our methods also allow us to study the problem of low rank approximation under L_p norm error. 
    more » « less