Multiple tensortimesmatrix (MultiTTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower bounds that determine how much data movement is required (under mild conditions) to perform the MultiTTM computation in parallel. The crux of the proof relies on analytically solving a constrained, nonlinear optimization problem. We also present a parallel algorithm to perform this computation that organizes the processors into a logical grid with twice as many modes as the input tensor. We show that, with correct choices of grid dimensions, the communication cost of the algorithm attains the lower bounds and is therefore communication optimal. Finally, we show that our algorithm can significantly reduce communication compared to the straightforward approach of expressing the computation as a sequence of tensortimesmatrix operations when the input and output tensors vary greatly in size.
more »
« less
General MemoryIndependent Lower Bound for MTTKRP
Our goal is to establish lower bounds on the communication required to perform the MatricizedTensor Times KhatriRao Product (MTTKRP) computation on a distributedmemory parallel machine. MTTKRP is the bottleneck computation within algorithms for computing the CP tensor decomposition, which is an approximation by a sum of rankone tensors and frequently used in multidimensional data analysis. The main result of this paper is a communication lower bound that generalizes previous results, tightening the bound so that it is attainable even when the tensor dimensions vary (the tensor is not cubical) and when the number of processors is small relative to the tensor dimensions. The attainability of the bound proves that the algorithm that attains it, which is based on a block distribution of the tensor and communicating only factor matrices, is communication optimal. The proof technique utilizes an established inequality that relates computations to data access as well as a novel approach based on convex optimization.
more »
« less
 NSFPAR ID:
 10215020
 Date Published:
 Journal Name:
 SIAM Conference on Parallel Processing for Scientific Computing
 Page Range / eLocation ID:
 1  11
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


The matricizedtensor times KhatriRao product (MTTKRP) computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors. We also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal. In particular, we show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation.more » « less

TaShma, Amnon (Ed.)Threeplayer Number On the Forehead communication may be thought of as a threeplayer Number In the Hand promise model, in which each player is given the inputs that are supposedly on the other two players' heads, and promised that they are consistent with the inputs of the other players. The set of all allowed inputs under this promise may be thought of as an order3 tensor. We surprisingly observe that this tensor is exactly the matrix multiplication tensor, which is widely studied in the design of fast matrix multiplication algorithms. Using this connection, we prove a number of results about both Number On the Forehead communication and matrix multiplication, each by using known results or techniques about the other. For example, we show how the Laser method, a key technique used to design the best matrix multiplication algorithms, can also be used to design communication protocols for a variety of problems. We also show how known lower bounds for Number On the Forehead communication can be used to bound properties of the matrix multiplication tensor such as its zeroing out subrank. Finally, we substantially generalize known methods based on slicerank for studying communication, and show how they directly relate to the matrix multiplication exponent ω.more » « less

Recent works have shown that imposing tensor structures on the coefficient tensor in regression problems can lead to more reliable parameter estimation and lower sample complexity compared to vectorbased methods. This work investigates a new lowrank tensor model, called Low Separation Rank (LSR), in Generalized Linear Model (GLM) problems. The LSR model – which generalizes the wellknown Tucker and CANDECOMP/PARAFAC (CP) models, and is a special case of the Block Tensor Decomposition (BTD) model – is imposed onto the coefficient tensor in the GLM model. This work proposes a block coordinate descent algorithm for parameter estimation in LSRstructured tensor GLMs. Most importantly, it derives a minimax lower bound on the error threshold on estimating the coefficient tensor in LSR tensor GLM problems. The minimax bound is proportional to the intrinsic degrees of freedom in the LSR tensor GLM problem, suggesting that its sample complexity may be significantly lower than that of vectorized GLMs. This result can also be specialised to lower bound the estimation error in CP and Tuckerstructured GLMs. The derived bounds are comparable to tight bounds in the literature for Tucker linear regression, and the tightness of the minimax lower bound is further assessed numerically. Finally, numerical experiments on synthetic datasets demonstrate the efficacy of the proposed LSR tensor model for three regression types (linear, logistic and Poisson). Experiments on a collection of medical imaging datasets demonstrate the usefulness of the LSR model over other tensor models (Tucker and CP) on real, imbalanced data with limited available samples. License: Creative Commons Attribution 4.0 International (CC BY 4.0)more » « less

As techniques for faulttolerant quantum computation keep improving, it is natural to ask: what is the fundamental lower bound on space overhead? In this paper, we obtain a lower bound on the space overhead required for ϵ accurate implementation of a large class of operations that includes unitary operators. For the practically relevant case of subexponential depth and sublinear gate size, our bound on space overhead is tighter than the known lower bounds. We obtain this bound by connecting faulttolerant computation with a set of finite blocklength quantum communication problems whose accuracy requirements satisfy a joint constraint. The lower bound on space overhead obtained here leads to a strictly smaller upper bound on the noise threshold for noise that are not degradable. Our bound directly extends to the case where noise at the outputs of a gate are noni.i.d. but noise across gates are i.i.d.more » « less