NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations

https://doi.org/10.1145/3727344

Al_Daas, Hussam; Ballard, Grey; Grigori, Laura; Kumar, Suraj; Rouse, Kathryn; Verite, Mathieu (April 2025, ACM Transactions on Parallel Computing)

In this article, we focus on the communication costs of three symmetric matrix computations: i) multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK) ii) adding the result of the multiplication of a matrix with the transpose of another matrix and the transpose of that result, known as a symmetric rank-2k update (SYR2K) iii) performing matrix multiplication with a symmetric input matrix (SYMM). All three computations appear in the Level 3 Basic Linear Algebra Subroutines (BLAS) and have wide use in applications involving symmetric matrices. We establish communication lower bounds for these kernels using sequential and distributed-memory parallel computational models, and we show that our bounds are tight by presenting communication-optimal algorithms for each setting. Our lower bound proofs rely on applying a geometric inequality for symmetric computations and analytically solving constrained nonlinear optimization problems. The symmetric matrix and its corresponding computations are accessed and performed according to a triangular block partitioning scheme in the optimal algorithms.
more » « less
Free, publicly-accessible full text available April 10, 2026
Communication Lower Bounds and Optimal Algorithms for Multiple Tensor-Times-Matrix Computation

https://doi.org/10.1137/22M1510443

Al_Daas, Hussam; Ballard, Grey; Grigori, Laura; Kumar, Suraj; Rouse, Kathryn (March 2024, SIAM Journal on Matrix Analysis and Applications)

Multiple tensor-times-matrix (Multi-TTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower bounds that determine how much data movement is required (under mild conditions) to perform the Multi-TTM computation in parallel. The crux of the proof relies on analytically solving a constrained, nonlinear optimization problem. We also present a parallel algorithm to perform this computation that organizes the processors into a logical grid with twice as many modes as the input tensor. We show that, with correct choices of grid dimensions, the communication cost of the algorithm attains the lower bounds and is therefore communication optimal. Finally, we show that our algorithm can significantly reduce communication compared to the straightforward approach of expressing the computation as a sequence of tensor-times-matrix operations when the input and output tensors vary greatly in size.
more » « less
Full Text Available
Fast Exact Leverage Score Sampling from Khatri-Rao Products with Applications to Tensor Decomposition

Bharadwaj, Vivek; Malik, Osman Asif; Murray, Riley; Grigori, Laura; Buluc, Aydin; Demmel, James (December 2023, Neural Information Processing Systems 2023)

We present a data structure to randomly sample rows from the Khatri-Rao product of several matrices according to the exact distribution of its leverage scores. Our proposed sampler draws each row in time logarithmic in the height of the Khatri-Rao product and quadratic in its column count, with persistent space overhead at most the size of the input matrices. As a result, it tractably draws samples even when the matrices forming the Khatri-Rao product have tens of millions of rows each. When used to sketch the linear least squares problems arising in CANDECOMP / PARAFAC tensor decomposition, our method achieves lower asymptotic complexity per solve than recent state-of-the-art methods. Experiments on billion-scale sparse tensors validate our claims, with our algorithm achieving higher accuracy than competing methods as the decomposition rank grows.
more » « less
An Improved Analysis and Unified Perspective on Deterministic and Randomized Low-Rank Matrix Approximation

https://doi.org/10.1137/21M1391316

Demmel, James; Grigori, Laura; Rusciano, Alexander (June 2023, SIAM Journal on Matrix Analysis and Applications)

We introduce a Generalized LU Factorization (GLU) for low-rank matrix approximation. We relate this to past approaches and extensively analyze its approximation properties. The established deterministic guarantees are combined with sketching ensembles satisfying Johnson-- Lindenstrauss properties to present complete bounds. Particularly good performance is shown for the subsampled randomized Hadamard transform (SRHT) ensemble. Moreover, the factorization is shown to unify and generalize many past algorithms, sometimes providing strictly better approximations. It also helps to explain the effect of sketching on the growth factor during Gaussian elimination.
more » « less
Full Text Available
Parallel Memory-Independent Communication Bounds for SYRK

https://doi.org/10.1145/3558481.3591072

Al Daas, Hussam; Ballard, Grey; Grigori, Laura; Kumar, Suraj; Rouse, Kathryn (June 2023, Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures)

In this paper, we focus on the parallel communication cost of multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK). SYRK requires half the computation of general matrix multiplication because of the symmetry of the output matrix. Recent work (Beaumont et al., SPAA '22) has demonstrated that the sequential I/O complexity of SYRK is also a constant factor smaller than that of general matrix multiplication. Inspired by this progress, we establish memory-independent parallel communication lower bounds for SYRK with smaller constants than general matrix multiplication, and we show that these constants are tight by presenting communication-optimal algorithms. The crux of the lower bound proof relies on extending a key geometric inequality to symmetric computations and analytically solving a constrained nonlinear optimization problem. The optimal algorithms use a triangular blocking scheme for parallel distribution of the symmetric output matrix and corresponding computation.
more » « less
Full Text Available
Brief Announcement: Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds

https://doi.org/10.1145/3490148.3538552

Al Daas, Hussam; Ballard, Grey; Grigori, Laura; Kumar, Suraj; Rouse, Kathryn (July 2022, Proceedings of the 34th Annual ACM Symposium on Parallelism in Algorithms and Architectures)

Communication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored the constant factors or not obtained the tightest possible values. Recent work has demonstrated that more careful analysis improves the best known constants for some classical matrix multiplication lower bounds and helps to identify more efficient algorithms that match the leading-order terms in the lower bounds exactly and improve practical performance. The main result of this work is the establishment of memory-independent communication lower bounds with tight constants for parallel matrix multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the aspect ratios of the matrices.
more » « less
Full Text Available

Search for: All records