Summary This paper describes efficient algorithms for computing rank‐revealing factorizations of matrices that are too large to fit in main memory (RAM), and must instead be stored on slow external memory devices such as disks (out‐of‐core or out‐of‐memory). Traditional algorithms for computing rank‐revealing factorizations (such as the column pivoted QR factorization and the singular value decomposition) are very communication intensive as they require many vector‐vector and matrix‐vector operations, which become prohibitively expensive when data is not in RAM. Randomization allows to reformulate new methods so that large contiguous blocks of the matrix are processed in bulk. The paper describes two distinct methods. The first is a blocked version of column pivoted Householder QR, organized as a “left‐looking” method to minimize the number of the expensive write operations. The second method results employs a UTV factorization. It is organized as an algorithm‐by‐blocks to overlap computations and I/O operations. As it incorporates power iterations, it is much better at revealing the numerical rank. Numerical experiments on several computers demonstrate that the new algorithms are almost as fast when processing data stored on slow memory devices as traditional algorithms are for data stored in RAM.
more »
« less
Algorithm 1022: Efficient Algorithms for Computing a RankRevealing UTV Factorization on Parallel Computing Architectures
Randomized singular value decomposition (RSVD) is by now a wellestablished technique for efficiently computing an approximate singular value decomposition of a matrix. Building on the ideas that underpin RSVD, the recently proposed algorithm “randUTV” computes a full factorization of a given matrix that provides lowrank approximations with nearoptimal error. Because the bulk of randUTV is cast in terms of communicationefficient operations such as matrixmatrix multiplication and unpivoted QR factorizations, it is faster than competing rankrevealing factorization methods such as columnpivoted QR in most highperformance computational settings. In this article, optimized randUTV implementations are presented for both sharedmemory and distributedmemory computing environments. For shared memory, randUTV is redesigned in terms of an algorithmbyblocks that, together with a runtime task scheduler, eliminates bottlenecks from data synchronization points to achieve acceleration over the standard blocked algorithm based on a purely forkjoin approach. The distributedmemory implementation is based on the ScaLAPACK library. The performance of our new codes compares favorably with competing factorizations available on both sharedmemory and distributedmemory architectures.
more »
« less
 Award ID(s):
 2012606
 NSFPAR ID:
 10356458
 Date Published:
 Journal Name:
 ACM Transactions on Mathematical Software
 Volume:
 48
 Issue:
 2
 ISSN:
 00983500
 Page Range / eLocation ID:
 1 to 42
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


QR factorization is a key tool in mathematics, computer science, operations research, and engineering. This paper presents the roundofferrorfree (REF) QR factorization framework comprising integerpreserving versions of the standard and the thin QR factorizations and associated algorithms to compute them. Specifically, the standard REF QR factorization factors a given matrix $A \in \Z^{m \times n}$ as $A=QDR$, where $Q \in \Z^{m \times m}$ has pairwise orthogonal columns, $D$ is a diagonal matrix, and $R \in \Z^{m \times n}$ is an upper trapezoidal matrix; notably, the entries of $Q$ and $R$ are integral, while the entries of $D$ are reciprocals of integers. In the thin REF QR factorization, $Q \in \Z^{m \times n}$ also has pairwise orthogonal columns, and $R \in \Z^{n \times n}$ is also an upper triangular matrix. In contrast to traditional (i.e., floatingpoint) QR factorizations, every operation used to compute these factors is integral; thus, REF QR is guaranteed to be an exact orthogonal decomposition. Importantly, the bitlength of every entry in the REF QR factorizations (and within the algorithms to compute them) is bounded polynomially. Notable applications of our REF QR factorizations include finding exact least squares or exact basic solutions (i.e., a rational ndimensional vector $x$) to any given full column rank or rank deficient linear system $A x = b$, respectively. In addition, our exact factorizations can be used as a subroutine within exact and/or highprecision quadratic programming. Altogether, REF QR provides a framework to obtain exact orthogonal factorizations of any rational matrix (as any rational/decimal matrix can be easily transformed into an integral matrix).more » « less

null (Ed.)A cumbersome operation in numerical analysis and linear algebra, optimization, machine learning and engineering algorithms; is inverting large fullrank matrices which appears in various processes and applications [1]. This has both numerical stability and complexity issues, as well as high expected time to compute. We address the latter issue, by proposing an algorithm which uses a blackbox least squares optimization solver as a subroutine, to give an estimate of the inverse (and pseudoinverse) of real nonsingular matrices; by estimating its columns. This also gives it the flexibility to be performed in a distributed manner, thus the estimate can be obtained a lot faster, and can be made robust to stragglers. Furthermore, we assume a centralized network with no message passing between the computing nodes, and do not require a matrix factorization; e.g. LU, SVD or QR decomposition beforehand.more » « less

Abstract The generalized singular value decomposition (GSVD) is a valuable tool that has many applications in computational science. However, computing the GSVD for large‐scale problems is challenging. Motivated by applications in hyper‐differential sensitivity analysis (HDSA), we propose new randomized algorithms for computing the GSVD which use randomized subspace iteration and weighted QR factorization. Detailed error analysis is given which provides insight into the accuracy of the algorithms and the choice of the algorithmic parameters. We demonstrate the performance of our algorithms on test matrices and a large‐scale model problem where HDSA is used to study subsurface flow.

Modern deep neural networks (DNNs) often require high memory consumption and large computational loads. In order to deploy DNN algorithms efficiently on edge or mobile devices, a series of DNN compression algorithms have been explored, including factorization methods. Factorization methods approximate the weight matrix of a DNN layer with the multiplication of two or multiple lowrank matrices. However, it is hard to measure the ranks of DNN layers during the training process. Previous works mainly induce lowrank through implicit approximations or via costly singular value decomposition (SVD) process on every training step. The former approach usually induces a high accuracy loss while the latter has a low efficiency. In this work, we propose SVD training, the first method to explicitly achieve lowrank DNNs during training without applying SVD on every step. SVD training first decomposes each layer into the form of its fullrank SVD, then performs training directly on the decomposed weights. We add orthogonality regularization to the singular vectors, which ensure the valid form of SVD and avoid gradient vanishing/exploding. Lowrank is encouraged by applying sparsityinducing regularizers on the singular values of each layer. Singular value pruning is applied at the end to explicitly reach a lowrank model. We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy, comparing to not only previous factorization methods but also stateoftheart filter pruning methods.more » « less