The matricizedtensor times KhatriRao product (MTTKRP) computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors. We also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal. In particular, we show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation.
more »
« less
This content will become publicly available on March 31, 2025
Communication Lower Bounds and Optimal Algorithms for Multiple TensorTimesMatrix Computation
Multiple tensortimesmatrix (MultiTTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower bounds that determine how much data movement is required (under mild conditions) to perform the MultiTTM computation in parallel. The crux of the proof relies on analytically solving a constrained, nonlinear optimization problem. We also present a parallel algorithm to perform this computation that organizes the processors into a logical grid with twice as many modes as the input tensor. We show that, with correct choices of grid dimensions, the communication cost of the algorithm attains the lower bounds and is therefore communication optimal. Finally, we show that our algorithm can significantly reduce communication compared to the straightforward approach of expressing the computation as a sequence of tensortimesmatrix operations when the input and output tensors vary greatly in size.
more »
« less
 Award ID(s):
 1942892
 NSFPAR ID:
 10510811
 Publisher / Repository:
 SIAM
 Date Published:
 Journal Name:
 SIAM Journal on Matrix Analysis and Applications
 Volume:
 45
 Issue:
 1
 ISSN:
 08954798
 Page Range / eLocation ID:
 450 to 477
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Our goal is to establish lower bounds on the communication required to perform the MatricizedTensor Times KhatriRao Product (MTTKRP) computation on a distributedmemory parallel machine. MTTKRP is the bottleneck computation within algorithms for computing the CP tensor decomposition, which is an approximation by a sum of rankone tensors and frequently used in multidimensional data analysis. The main result of this paper is a communication lower bound that generalizes previous results, tightening the bound so that it is attainable even when the tensor dimensions vary (the tensor is not cubical) and when the number of processors is small relative to the tensor dimensions. The attainability of the bound proves that the algorithm that attains it, which is based on a block distribution of the tensor and communicating only factor matrices, is communication optimal. The proof technique utilizes an established inequality that relates computations to data access as well as a novel approach based on convex optimization.more » « less

The CP tensor decomposition is a lowrank approximation of a tensor. We present a distributedmemory parallel algorithm and implementation of an alternating optimization method for computing a CP decomposition of dense tensors that can enforce nonnegativity of the computed lowrank factors. The principal task is to parallelize the MatricizedTensor Times KhatriRao Product (MTTKRP) bottleneck subcomputation. The algorithm is computation efficient, using dimension trees to avoid redundant computation across MTTKRPs within the alternating method. Our approach is also communication efficient, using a data distribution and parallel algorithm across a multidimensional processor grid that can be tuned to minimize communication. We benchmark our software on synthetic as well as hyperspectral image and neuroscience dynamic functional connectivity data, demonstrating that our algorithm scales well to 100s of nodes (up to 4096 cores) and is faster and more general than the currently available parallel software.more » « less

In this paper, we focus on the parallel communication cost of multiplying a matrix with its transpose, known as a symmetric rankk update (SYRK). SYRK requires half the computation of general matrix multiplication because of the symmetry of the output matrix. Recent work (Beaumont et al., SPAA '22) has demonstrated that the sequential I/O complexity of SYRK is also a constant factor smaller than that of general matrix multiplication. Inspired by this progress, we establish memoryindependent parallel communication lower bounds for SYRK with smaller constants than general matrix multiplication, and we show that these constants are tight by presenting communicationoptimal algorithms. The crux of the lower bound proof relies on extending a key geometric inequality to symmetric computations and analytically solving a constrained nonlinear optimization problem. The optimal algorithms use a triangular blocking scheme for parallel distribution of the symmetric output matrix and corresponding computation.more » « less

null (Ed.)Motivated by the increasing need to understand the distributed algorithmic foundations of largescale graph computations, we study some fundamental graph problems in a messagepassing model for distributed computing where k ≥ 2 machines jointly perform computations on graphs with n nodes (typically, n >> k). The input graph is assumed to be initially randomly partitioned among the k machines, a common implementation in many realworld systems. Communication is pointtopoint, and the goal is to minimize the number of communication rounds of the computation. Our main contribution is the General Lower Bound Theorem , a theorem that can be used to show nontrivial lower bounds on the round complexity of distributed largescale data computations. This result is established via an informationtheoretic approach that relates the round complexity to the minimal amount of information required by machines to solve the problem. Our approach is generic, and this theorem can be used in a “cookbook” fashion to show distributed lower bounds for several problems, including nongraph problems. We present two applications by showing (almost) tight lower bounds on the round complexity of two fundamental graph problems, namely, PageRank computation and triangle enumeration . These applications show that our approach can yield lower bounds for problems where the application of communication complexity techniques seems not obvious or gives weak bounds, including and especially under a stochastic partition of the input. We then present distributed algorithms for PageRank and triangle enumeration with a round complexity that (almost) matches the respective lower bounds; these algorithms exhibit a round complexity that scales superlinearly in k , improving significantly over previous results [Klauck et al., SODA 2015]. Specifically, we show the following results: PageRank: We show a lower bound of Ὼ(n/k 2 ) rounds and present a distributed algorithm that computes an approximation of the PageRank of all the nodes of a graph in Õ(n/k 2 ) rounds. Triangle enumeration: We show that there exist graphs with m edges where any distributed algorithm requires Ὼ(m/k 5/3 ) rounds. This result also implies the first nontrivial lower bound of Ὼ(n 1/3 ) rounds for the congested clique model, which is tight up to logarithmic factors. We then present a distributed algorithm that enumerates all the triangles of a graph in Õ(m/k 5/3 + n/k 4/3 ) rounds.more » « less