Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

The CP tensor decomposition is used in applications such as machine learning and signal processing to discover latent lowrank structure in multidimensional data. Computing a CP decomposition via an alternating least squares (ALS) method reduces the problem to several linear least squares problems. The standard way to solve these linear least squares subproblems is to use the normal equations, which inherit special tensor structure that can be exploited for computational efficiency. However, the normal equations are sensitive to numerical illconditioning, which can compromise the results of the decomposition. In this paper, we develop versions of the CPALS algorithm using the QR decomposition and the singular value decomposition, which are more numerically stable than the normal equations, to solve the linear least squares problems. Our algorithms utilize the tensor structure of the CPALS subproblems efficiently, have the same complexity as the standard CPALS algorithm when the input is dense and the rank is small, and are shown via examples to produce more stable results when illconditioning is present. Our MATLAB implementation achieves the same running time as the standard algorithm for small ranks, and we show that the new methods can obtain lower approximation error.more » « lessFree, publiclyaccessible full text available June 5, 2024

In this paper, we focus on the parallel communication cost of multiplying a matrix with its transpose, known as a symmetric rankk update (SYRK). SYRK requires half the computation of general matrix multiplication because of the symmetry of the output matrix. Recent work (Beaumont et al., SPAA '22) has demonstrated that the sequential I/O complexity of SYRK is also a constant factor smaller than that of general matrix multiplication. Inspired by this progress, we establish memoryindependent parallel communication lower bounds for SYRK with smaller constants than general matrix multiplication, and we show that these constants are tight by presenting communicationoptimal algorithms. The crux of the lower bound proof relies on extending a key geometric inequality to symmetric computations and analytically solving a constrained nonlinear optimization problem. The optimal algorithms use a triangular blocking scheme for parallel distribution of the symmetric output matrix and corresponding computation.more » « lessFree, publiclyaccessible full text available June 17, 2024

Joint Nonnegative Matrix Factorization (JointNMF) is a hybrid method for mining information from datasets that contain both feature and connection information. We propose distributedmemory parallelizations of three algorithms for solving the JointNMF problem based on Alternating Nonnegative Least Squares, Projected Gradient Descent, and Projected GaussNewton. We extend wellknown communicationavoiding algorithms using a single processor grid case to our coupled case on two processor grids. We demonstrate the scalability of the algorithms on up to 960 cores (40 nodes) with 60\% parallel efficiency. The more sophisticated Alternating Nonnegative Least Squares (ANLS) and GaussNewton variants outperform the firstorder gradient descent method in reducing the objective on largescale problems. We perform a topic modelling task on a large corpus of academic papers that consists of over 37 million paper abstracts and nearly a billion citation relationships, demonstrating the utility and scalability of the methods.more » « lessFree, publiclyaccessible full text available June 21, 2024

Free, publiclyaccessible full text available February 28, 2024

Communication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored the constant factors or not obtained the tightest possible values. Recent work has demonstrated that more careful analysis improves the best known constants for some classical matrix multiplication lower bounds and helps to identify more efficient algorithms that match the leadingorder terms in the lower bounds exactly and improve practical performance. The main result of this work is the establishment of memoryindependent communication lower bounds with tight constants for parallel matrix multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the aspect ratios of the matrices.more » « less

The design and analysis of parallel algorithms are both fundamental to the set of highperformance, parallel, and distributed computing skills required to use modern computing resources efficiently. In this work, we present an approach of teaching parallel computing within an undergraduate algorithms course that combines the paradigms of dynamic programming and multithreaded parallelization. We have developed a visualization tool built with the Thread Safe Graphics Library that enables interactive demonstration of parallelization techniques for two fundamental dynamic programming problems, 0/1 Knapsack and Longest Common Subsequence. We describe the implementation of the tool, the realtime animation it produces, and the results of using it in class. The tool is publicly available to be used directly or as a basis on which to build visualizations of other parallel dynamic programming algorithms.more » « less

Tucker decomposition is a lowrank tensor approximation that generalizes a truncated matrix singular value decomposition (SVD). Existing parallel software has shown that Tucker decomposition is particularly effective at compressing terabytesized multidimensional scientific simulation datasets, computing reduced representations that satisfy a specified approximation error. The general approach is to get a lowrank approximation of the input data by performing a sequence of matrix SVDs of tensor unfoldings, which tend to be shortfat matrices. In the existing approach, the SVD is performed by computing the eigendecomposition of the Gram matrix of the unfolding. This method sacrifices some numerical stability in exchange for lower computation costs and easier parallelization. We propose using a more numerically stable though more computationally expensive way to compute the SVD by pre processing with a QR decomposition step and computing an SVD of only the small triangular factor. The more numerically stable approach allows us to achieve the same accuracy with half the working precision (for example, single rather than double precision). We demonstrate that our method scales as well as the existing approach, and the use of lower precision leads to an overall reduction in running time of up to a factor of 2 when using 10s to 1000s of processors. Using the same working precision, we are also able to compute Tucker decompositions with much smaller approximation error.more » « less

Matrix multiplication is one of the bottleneck computations for training the weights within deep neural networks. To speed up the training phase, we propose to use faster algorithms for matrix multiplication known as Arbitrary Precision Approximating (APA) algorithms. APA algorithms perform asymptotically fewer arithmetic operations than the classical algorithm, but they compute an approximate result with an error that can be made arbitrarily small in exact arithmetic. Practical APA algorithms provide significant reduction in computation time and still provide enough accuracy for many applications like neural network training. We demonstrate that APA algorithms can be efficiently implemented and parallelized for multicore CPUs to obtain up to 28% and 21% speedups over the fastest implementation of the classical algorithm using one core and 12 cores, respectively. Furthermore, using these algorithms to train a MultiLayer Perceptron (MLP) network yields no significant reduction in the training or testing error. Our performance results on a large MLP network show overall sequential and multithreaded performance improvements of up to 25% and 13%, respectively. We also demonstrate up to 15% improvement when training the fully connected layers of the VGG19 image classification network.more » « less