The distance matrix of a dataset X of n points with respect to a distance function
f represents all pairwise distances between points in X induced by f. Due to their
wide applicability, distance matrices and related families of matrices have been
the focus of many recent algorithmic works. We continue this line of research
and take a broad view of algorithm design for distance matrices with the goal of
designing fast algorithms, which are specifically tailored for distance matrices, for
fundamental linear algebraic primitives. Our results include efficient algorithms
for computing matrixvector products for a wide class of distance matrices, such
as the l1 metric for which we get a linear runtime, as well as a quadratic lower
bound for any algorithm which computes a matrixvector product for the l_infty case.
Our upper bound results have many
further downstream applications, including the fastest algorithm for computing
a relative error lowrank approximation for the distance matrix induced by l1
and l2 functions and the fastest algorithm for computing an additive error lowrank
approximation for the l2 metric, in addition to applications for fast matrix
multiplication among others. We also give algorithms for constructing distance
matrices and show that one can construct an approximate l2 distance matrix in
time faster than the bound implied by the JohnsonLindenstrauss lemma.
more »
« less
This content will become publicly available on October 17, 2024
Matrix Compression via Randomized Low Rank and Low Precision Factorization
Matrices are exceptionally useful in various fields of study as they provide a convenient framework to organize and manipulate data in a structured manner. However, modern matrices can involve billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Although prohibitively large, such matrices are often approximately low rank. We propose an algorithm that exploits this structure to obtain a low rank decomposition of any matrix A as A≈LR, where L and R are the low rank factors. The total number of elements in L and R can be significantly less than that in A. Furthermore, the entries of L and R are quantized to low precision formats −− compressing A by giving us a low rank and low precision factorization. Our algorithm first computes an approximate basis of the range space of A by randomly sketching its columns, followed by a quantization of the vectors constituting this basis. It then computes approximate projections of the columns of A onto this quantized basis. We derive upper bounds on the approximation error of our algorithm, and analyze the impact of target rank and quantization bitbudget. The tradeoff between compression ratio and approximation accuracy allows for flexibility in choosing these parameters based on specific application requirements. We empirically demonstrate the efficacy of our algorithm in image compression, nearest neighbor classification of image and text embeddings, and compressing the layers of LlaMa7b. Our results illustrate that we can achieve compression ratios as aggressive as one bit per matrix coordinate, all while surpassing or maintaining the performance of traditional compression techniques.
more »
« less
 Award ID(s):
 2037304
 NSFPAR ID:
 10482591
 Publisher / Repository:
 arXiv
 Date Published:
 Journal Name:
 arXivorg
 ISSN:
 23318422
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


null (Ed.)Recently decentralized optimization attracts much attention in machine learning because it is more communicationefficient than the centralized fashion. Quantization is a promising method to reduce the communication cost via cutting down the budget of each single communication using the gradient compression. To further improve the communication efficiency, more recently, some quantized decentralized algorithms have been studied. However, the quantized decentralized algorithm for nonconvex constrained machine learning problems is still limited. FrankWolfe (a.k.a., conditional gradient or projectionfree) method is very efficient to solve many constrained optimization tasks, such as lowrank or sparsityconstrained models training. In this paper, to fill the gap of decentralized quantized constrained optimization, we propose a novel communicationefficient Decentralized Quantized Stochastic FrankWolfe (DQSFW) algorithm for nonconvex constrained learning models. We first design a new counterexample to show that the vanilla decentralized quantized stochastic FrankWolfe algorithm usually diverges. Thus, we propose DQSFW algorithm with the gradient tracking technique to guarantee the method will converge to the stationary point of nonconvex optimization safely. In our theoretical analysis, we prove that to achieve the stationary point our DQSFW algorithm achieves the same gradient complexity as the standard stochastic FrankWolfe and centralized FrankWolfe algorithms, but has much less communication cost. Experiments on matrix completion and model compression applications demonstrate the efficiency of our new algorithm.more » « less

Tucker decomposition is a lowrank tensor approximation that generalizes a truncated matrix singular value decomposition (SVD). Existing parallel software has shown that Tucker decomposition is particularly effective at compressing terabytesized multidimensional scientific simulation datasets, computing reduced representations that satisfy a specified approximation error. The general approach is to get a lowrank approximation of the input data by performing a sequence of matrix SVDs of tensor unfoldings, which tend to be shortfat matrices. In the existing approach, the SVD is performed by computing the eigendecomposition of the Gram matrix of the unfolding. This method sacrifices some numerical stability in exchange for lower computation costs and easier parallelization. We propose using a more numerically stable though more computationally expensive way to compute the SVD by pre processing with a QR decomposition step and computing an SVD of only the small triangular factor. The more numerically stable approach allows us to achieve the same accuracy with half the working precision (for example, single rather than double precision). We demonstrate that our method scales as well as the existing approach, and the use of lower precision leads to an overall reduction in running time of up to a factor of 2 when using 10s to 1000s of processors. Using the same working precision, we are also able to compute Tucker decompositions with much smaller approximation error.more » « less

Deep convolutional neural network (DNN) has demonstrated phenomenal success and been widely used in many computer vision tasks. However, its enormous model size and high computing complexity prohibits its wide deployment into resource limited embedded system, such as FPGA and mGPU. As the two most widely adopted model compression techniques, weight pruning and quantization compress DNN model through introducing weight sparsity (i.e., forcing partial weights as zeros) and quantizing weights into limited bitwidth values, respectively. Although there are works attempting to combine the weight pruning and quantization, we still observe disharmony between weight pruning and quantization, especially when more aggressive compression schemes (e.g., Structured pruning and low bitwidth quantization) are used. In this work, taking FPGA as the test computing platform and Processing Elements (PE) as the basic parallel computing unit, we first propose a PEwise structured pruning scheme, which introduces weight sparsification with considering of the architecture of PE. In addition, we integrate it with an optimized weight ternarization approach which quantizes weights into ternary values ({1,0,+1}), thus converting the dominant convolution operations in DNN from multiplicationandaccumulation (MAC) to additiononly, as well as compressing the original model (from 32bit floating point to 2bit ternary representation) by at least 16 times. Then, we investigate and solve the coexistence issue between PEwise Structured pruning and ternarization, through proposing a Weight Penalty Clipping (WPC) technique with selfadapting threshold. Our experiment shows that the fusion of our proposed techniques can achieve the best stateoftheart ∼21× PEwise structured compression rate with merely 1.74%/0.94% (top1/top5) accuracy degradation of ResNet18 on ImageNet dataset.more » « less

Lowrank approximation is a classic tool in data analysis, where the goal is to approximate a matrix A with a lowrank matrix L so as to minimize the error AL_F. However in many applications, approximating some entries is more important than others, which leads to the weighted low rank approximation problem. However, the addition of weights makes the lowrank approximation problem intractable. Thus many works have obtained efficient algorithms under additional structural assumptions on the weight matrix (such as low rank, and appropriate block structure). We study a natural greedy algorithm for weighted low rank approximation and develop a simple condition under which it yields bicriteria approximation up to a small additive factor in the error. The algorithm involves iteratively computing the top singular vector of an appropriately varying matrix, and is thus easy to implement at scale. Our methods also allow us to study the problem of low rank approximation under L_p norm error.more » « less