Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Undergraduate algorithms courses are a natural setting for teaching many of the theoretical ideas of parallel computing. Mergesort is a fundamental sequential divideand conquer algorithm often analyzed in such courses. In this work, we present a visualization tool to help demonstrate a novel PRAM algorithm for mergesort that is work efficient and has polylogarithmic span. Our implementation uses the ThreadSafe Graphics Library, which has an existing visualization of parallel mergesort. We demonstrate that our proposed algorithm has better work and span than the one currently visualized.more » « lessFree, publiclyaccessible full text available May 27, 2025

Undergraduate algorithms courses are a natural setting for teaching many of the theoretical ideas of parallel computing. Mergesort is a fundamental sequential divideand conquer algorithm often analyzed in such courses. In this work, we present a visualization tool to help demonstrate a novel PRAM algorithm for mergesort that is work efficient and has polylogarithmic span. Our implementation uses the ThreadSafe Graphics Library, which has an existing visualization of parallel mergesort. We demonstrate that our proposed algorithm has better work and span than the one currently visualized.more » « lessFree, publiclyaccessible full text available May 27, 2025

The Tucker tensor decomposition is a natural extension of the singular value decomposition (SVD) to multiway data. We propose to accelerate Tucker tensor decomposition algorithms by using randomization and parallelization. We present two algorithms that scale to large data and many processors, significantly reduce both computation and communication cost compared to previous deterministic and randomized approaches, and obtain nearly the same approximation errors. The key idea in our algorithms is to perform randomized sketches with Kroneckerstructured random matrices, which reduces computation compared to unstructured matrices and can be implemented using a fundamental tensor computational kernel. We provide probabilistic error analysis of our algorithms and implement a new parallel algorithm for the structured randomized sketch. Our experimental results demonstrate that our combination of randomization and parallelization achieves accurate Tucker decompositions much faster than alternative approaches. We observe up to a 16X speedup over the fastest deterministic parallel implementation on 3D simulation data.more » « lessFree, publiclyaccessible full text available April 30, 2025

In this work, we design, analyze, and optimize sequential and sharedmemory parallel algorithms for partitioned local depths (PaLD). Given a set of data points and pairwise distances, PaLD is a method for identifying strength of pairwise relationships based on relative distances, enabling the identification of strong ties within dense and sparse communities even if their sizes and withincommunity absolute distances vary greatly. We design two algorithmic variants that perform community structure analysis through triplet comparisons of pairwise distances. We present theoretical analyses of computation and communication costs and prove that the sequential algorithms are communication optimal, up to constant factors. We introduce performance optimization strategies that yield sequential speedups of up to 29x over a baseline sequential implementation and parallel speedups of up to 26.2x over optimized sequential implementations using up to 32 threads on an Intel multicore CPU.more » « lessFree, publiclyaccessible full text available March 5, 2025

In this work, we design, analyze, and optimize sequential and sharedmemory parallel algorithms for partitioned local depths (PaLD). Given a set of data points and pairwise distances, PaLD is a method for identifying strength of pairwise relationships based on relative distances, enabling the identification of strong ties within dense and sparse communities even if their sizes and withincommunity absolute distances vary greatly. We design two algorithmic variants that perform community structure analysis through triplet comparisons of pairwise distances. We present theoretical analyses of computation and communication costs and prove that the sequential algorithms are communication optimal, up to constant factors. We introduce performance optimization strategies that yield sequential speedups of up to 29x over a baseline sequential implementation and parallel speedups of up to 26.2x over optimized sequential implementations using up to 32 threads on an Intel multicore CPU.more » « lessFree, publiclyaccessible full text available March 5, 2025

Multiple tensortimesmatrix (MultiTTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower bounds that determine how much data movement is required (under mild conditions) to perform the MultiTTM computation in parallel. The crux of the proof relies on analytically solving a constrained, nonlinear optimization problem. We also present a parallel algorithm to perform this computation that organizes the processors into a logical grid with twice as many modes as the input tensor. We show that, with correct choices of grid dimensions, the communication cost of the algorithm attains the lower bounds and is therefore communication optimal. Finally, we show that our algorithm can significantly reduce communication compared to the straightforward approach of expressing the computation as a sequence of tensortimesmatrix operations when the input and output tensors vary greatly in size.more » « lessFree, publiclyaccessible full text available March 31, 2025

The CP tensor decomposition is used in applications such as machine learning and signal processing to discover latent lowrank structure in multidimensional data. Computing a CP decomposition via an alternating least squares (ALS) method reduces the problem to several linear least squares problems. The standard way to solve these linear least squares subproblems is to use the normal equations, which inherit special tensor structure that can be exploited for computational efficiency. However, the normal equations are sensitive to numerical illconditioning, which can compromise the results of the decomposition. In this paper, we develop versions of the CPALS algorithm using the QR decomposition and the singular value decomposition, which are more numerically stable than the normal equations, to solve the linear least squares problems. Our algorithms utilize the tensor structure of the CPALS subproblems efficiently, have the same complexity as the standard CPALS algorithm when the input is dense and the rank is small, and are shown via examples to produce more stable results when illconditioning is present. Our MATLAB implementation achieves the same running time as the standard algorithm for small ranks, and we show that the new methods can obtain lower approximation error.more » « less

In this paper, we focus on the parallel communication cost of multiplying a matrix with its transpose, known as a symmetric rankk update (SYRK). SYRK requires half the computation of general matrix multiplication because of the symmetry of the output matrix. Recent work (Beaumont et al., SPAA '22) has demonstrated that the sequential I/O complexity of SYRK is also a constant factor smaller than that of general matrix multiplication. Inspired by this progress, we establish memoryindependent parallel communication lower bounds for SYRK with smaller constants than general matrix multiplication, and we show that these constants are tight by presenting communicationoptimal algorithms. The crux of the lower bound proof relies on extending a key geometric inequality to symmetric computations and analytically solving a constrained nonlinear optimization problem. The optimal algorithms use a triangular blocking scheme for parallel distribution of the symmetric output matrix and corresponding computation.more » « less

Joint Nonnegative Matrix Factorization (JointNMF) is a hybrid method for mining information from datasets that contain both feature and connection information. We propose distributedmemory parallelizations of three algorithms for solving the JointNMF problem based on Alternating Nonnegative Least Squares, Projected Gradient Descent, and Projected GaussNewton. We extend wellknown communicationavoiding algorithms using a single processor grid case to our coupled case on two processor grids. We demonstrate the scalability of the algorithms on up to 960 cores (40 nodes) with 60\% parallel efficiency. The more sophisticated Alternating Nonnegative Least Squares (ANLS) and GaussNewton variants outperform the firstorder gradient descent method in reducing the objective on largescale problems. We perform a topic modelling task on a large corpus of academic papers that consists of over 37 million paper abstracts and nearly a billion citation relationships, demonstrating the utility and scalability of the methods.more » « less