NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A supernodal all-pairs shortest path algorithm

https://doi.org/10.1145/3332466.3374533

Sao, Piyush; Kannan, Ramakrishnan; Gera, Prasun; Vuduc, Richard (February 2020, Symposium on Principles and Practice of Parallel Program-ming)
null (Ed.)
We show how to exploit graph sparsity in the Floyd-Warshall algorithm for the all-pairs shortest path (Apsp) problem.Floyd-Warshall is an attractive choice for APSP on high-performing systems due to its structural similarity to solving dense linear systems and matrix multiplication. However, if sparsity of the input graph is not properly exploited,Floyd-Warshall will perform unnecessary asymptotic work and thus may not be a suitable choice for many input graphs. To overcome this limitation, the key idea in our approach is to use the known algebraic relationship between Floyd-Warshall and Gaussian elimination, and import several algorithmic techniques from sparse Cholesky factorization, namely, fill-in reducing ordering, symbolic analysis, supernodal traversal, and elimination tree parallelism. When combined, these techniques reduce computation, improve locality and enhance parallelism. We implement these ideas in an efficient shared memory parallel prototype that is orders of magnitude faster than an efficient multi-threaded baseline Floyd-Warshall that does not exploit sparsity. Our experiments suggest that the Floyd-Warshall algorithm can compete with Dijkstra’s algorithm (the algorithmic core of Johnson’s algorithm) for several classes sparse graphs.
more » « less
Full Text Available
Scalable Knowledge Graph Analytics at 136 Petaflop/s

https://doi.org/10.1109/SC41405.2020.00010

Kannan, Ramakrishnan; Sao, Piyush; Lu, Hao; Herrmannova, Drahomira; Thakkar, Vijay; Patton, Robert; Vuduc, Richard; Potok, Thomas (November 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis)
null (Ed.)
We are motivated by newly proposed methods for data mining large-scale corpora of scholarly publications, such as the full biomedical literature, which may consist of tens of millions of papers spanning decades of research. In this setting, analysts seek to discover how concepts relate to one another. They construct graph representations from annotated text databases and then formulate the relationship-mining problem as one of computing all-pairs shortest paths (APSP), which becomes a significant bottleneck. In this context, we present a new high-performance algorithm and implementation of the Floyd-Warshall algorithm for distributed-memory parallel computers accelerated by GPUs, which we call DSNAPSHOT (Distributed Accelerated Semiring All-Pairs Shortest Path). For our largest experiments, we ran DSNAPSHOT on a connected input graph with millions of vertices using 4, 096nodes (24,576GPUs) of the Oak Ridge National Laboratory's Summit supercomputer system. We find DSNAPSHOT achieves a sustained performance of 136×1015 floating-point operations per second (136petaflop/s) at a parallel efficiency of 90% under weak scaling and, in absolute speed, 70% of the best possible performance given our computation (in the single-precision tropical semiring or “min-plus” algebra). Looking forward, we believe this novel capability will enable the mining of scholarly knowledge corpora when embedded and integrated into artificial intelligence-driven natural language processing workflows at scale.
more » « less
Full Text Available
Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters

https://doi.org/10.1145/3431379.3460651

sao, Piyush; lu, Hao; Kannan, Ramakrishnan; Thakkar, Vijay; Vuduc, Richard; Potok, Thomas (June 2020, HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing)
null (Ed.)
We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matrix-multiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPI-collective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU on-demand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel Floyd-Warshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256~nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time.
more » « less
Full Text Available
A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

https://doi.org/10.1016/j.jpdc.2019.03.004

Sao, Piyush; Li, Xiaoye S.; Vuduc, Richard (September 2019, Journal of Parallel and Distributed Computing)

Full Text Available
A communication-avoiding 3D sparse triangular solver

https://doi.org/10.1145/3330345.3330357

Sao, Piyush; Kannan, Ramakrishnan; Li, Xiaoye Sherry; Vuduc, Richard (January 2019, ACM International Conference on Supercomputing)

Full Text Available
A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices

Sao, Piyush; Li, Xiaoye; Vuduc, Richard (May 2018, Proceedings - IEEE International Parallel and Distributed Processing Symposium)

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D sparse LU algorithm uses a three-dimensional PI process grid, aggressively exploits elimination tree parallelism and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., from 2D grid or mesh domains) and certain non-planar graphs (specifically for 3D grids and meshes). For planar graphs with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of O(sqrt(logn)) and latency by a factor of O(logn). For non-planar cases, our algorithm can reduce the per-process communication volume by 3× and latency by O(n^1/3) times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST. Our new 3D code achieves speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30.
more » « less
Full Text Available

Search for: All records