Title: A supernodal all-pairs shortest path algorithm

We show how to exploit graph sparsity in the Floyd-Warshall algorithm for the all-pairs shortest path (Apsp) problem.Floyd-Warshall is an attractive choice for APSP on high-performing systems due to its structural similarity to solving dense linear systems and matrix multiplication. However, if sparsity of the input graph is not properly exploited,Floyd-Warshall will perform unnecessary asymptotic work and thus may not be a suitable choice for many input graphs. To overcome this limitation, the key idea in our approach is to use the known algebraic relationship between Floyd-Warshall and Gaussian elimination, and import several algorithmic techniques from sparse Cholesky factorization, namely, fill-in reducing ordering, symbolic analysis, supernodal traversal, and elimination tree parallelism. When combined, these techniques reduce computation, improve locality and enhance parallelism. We implement these ideas in an efficient shared memory parallel prototype that is orders of magnitude faster than an efficient multi-threaded baseline Floyd-Warshall that does not exploit sparsity. Our experiments suggest that the Floyd-Warshall algorithm can compete with Dijkstra’s algorithm (the algorithmic core of Johnson’s algorithm) for several classes sparse graphs. more »« less

sao, Piyush; lu, Hao; Kannan, Ramakrishnan; Thakkar, Vijay; Vuduc, Richard; Potok, Thomas(
, HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing)

null
(Ed.)

We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matrix-multiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPI-collective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU on-demand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel Floyd-Warshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256~nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time.

Chuzhoy, Julia(
, Proceedings of the 2023 {ACM-SIAM} Symposium on Discrete Algorithms,{SODA} 2023)

Expander graphs play a central role in graph theory and algorithms. With a number of powerful algorithmic tools developed around them, such as the Cut-Matching game, expander pruning, expander decomposition, and algorithms for decremental All-Pairs Shortest Paths (APSP) in expanders, to name just a few, the use of expanders in the design of graph algorithms has become ubiquitous. Specific applications of interest to us are fast deterministic algorithms for cut problems in static graphs, and algorithms for dynamic distance-based graph problems, such as APSP. Unfortunately, the use of expanders in these settings incurs a number of drawbacks. For example, the best currently known algorithm for decremental APSP in constant-degree expanders can only achieve a (log n) O(1/ 2 ) -approximation with n 1+O( ) total update time for any . All currently known algorithms for the Cut Player in the Cut-Matching game are either randomized, or provide rather weak guarantees: expansion 1/(log n) 1/ with running time n 1+O( ) . This, in turn, leads to somewhat weak algorithmic guarantees for several central cut problems: the best current almost linear time deterministic algorithms for Sparsest Cut, Lowest Conductance Cut, and Balanced Cut can only achieve approximation factor (log n) ω(1). Lastly, when relying on expanders in distancebased problems, such as dynamic APSP, via current methods, it seems inevitable that one has to settle for approximation factors that are at least Ω(log n). In contrast, we do not have any negative results that rule out a factor-5 approximation with near-linear total update time. In this paper we propose the use of well-connected graphs, and introduce a new algorithmic toolkit for such graphs that, in a sense, mirrors the above mentioned algorithmic tools for expanders. One of these new tools is the Distanced Matching game, an analogue of the Cut-Matching game for well-connected graphs. We demonstrate the power of these new tools by obtaining better results for several of the problems mentioned above. First, we design an algorithm for decremental APSP in expanders with significantly better guarantees: in a constant-degree expander, the algorithm achieves (log n) 1+o(1)-approximation, with total update time n 1+o(1). We also obtain a deterministic algorithm for the Cut Player in the Cut-Matching game that achieves expansion 1 (log n) 5+o(1) in time n 1+o(1), deterministic almost linear-time algorithms for Sparsest Cut, Lowest-Conductance Cut, and Minimum Balanced Cut with approximation factors O(poly log n), as well as improved deterministic algorithm for Expander Decomposition. We believe that the use of well-connected graphs instead of expanders in various dynamic distance-based problems (such as APSP in general graphs) has the potential of providing much stronger guarantees, since we are no longer necessarily restricted to superlogarithmic approximation factors.

Schoeneman, Frank; Zola, Jaroslaw(
, Proceedings of the 48th International Conference on Parallel Processing)

Algorithms for computing All-Pairs Shortest-Paths (APSP) are critical building blocks underlying many practical applications. The standard sequential algorithms, such as Floyd-Warshall and Johnson, quickly become infeasible for large input graphs, necessitating parallel approaches. In this work, we propose, implement and thoroughly analyse different strategies for APSP on distributed memory clusters with Apache Spark. Our solvers are designed for large undirected weighted graphs, and differ in complexity and degree of reliance on techniques outside of pure Spark API. We demonstrate that the best performing solver is able to handle APSP problems with over 200,000 vertices on a 1024-core cluster. However, it requires auxiliary shared persistent storage to compensate for missing Spark functionality.

Motivated by the increasing need to understand the distributed algorithmic foundations of large-scale graph computations, we study some fundamental graph problems in a message-passing model for distributed computing where k ≥ 2 machines jointly perform computations on graphs with n nodes (typically, n >> k). The input graph is assumed to be initially randomly partitioned among the k machines, a common implementation in many real-world systems. Communication is point-to-point, and the goal is to minimize the number of communication rounds of the computation. Our main contribution is the General Lower Bound Theorem , a theorem that can be used to show non-trivial lower bounds on the round complexity of distributed large-scale data computations. This result is established via an information-theoretic approach that relates the round complexity to the minimal amount of information required by machines to solve the problem. Our approach is generic, and this theorem can be used in a “cookbook” fashion to show distributed lower bounds for several problems, including non-graph problems. We present two applications by showing (almost) tight lower bounds on the round complexity of two fundamental graph problems, namely, PageRank computation and triangle enumeration . These applications show that our approach can yield lower bounds for problems where the application of communication complexity techniques seems not obvious or gives weak bounds, including and especially under a stochastic partition of the input. We then present distributed algorithms for PageRank and triangle enumeration with a round complexity that (almost) matches the respective lower bounds; these algorithms exhibit a round complexity that scales superlinearly in k , improving significantly over previous results [Klauck et al., SODA 2015]. Specifically, we show the following results: PageRank: We show a lower bound of Ὼ(n/k 2 ) rounds and present a distributed algorithm that computes an approximation of the PageRank of all the nodes of a graph in Õ(n/k 2 ) rounds. Triangle enumeration: We show that there exist graphs with m edges where any distributed algorithm requires Ὼ(m/k 5/3 ) rounds. This result also implies the first non-trivial lower bound of Ὼ(n 1/3 ) rounds for the congested clique model, which is tight up to logarithmic factors. We then present a distributed algorithm that enumerates all the triangles of a graph in Õ(m/k 5/3 + n/k 4/3 ) rounds.

Ge, Lulu; Payani, Ali; Latapie, Hugo; Parhi, Keshab K.(
, 2023 57th Asilomar Conference on Signals, Systems, and Computers)

Hyperdimensional computing (HDC) has drawn significant attention due to its comparable performance with traditional machine learning techniques. HDC classifiers achieve high parallelism, consume less power, and are well-suited for edge
applications. Encoding approaches such as record-based encoding and N -gram-based encoding have been used to generate features from input signals and images. These features are mapped to hypervectors and are input to HDC classifiers. This paper considers the group-based classification of graphs constructed from time series. The graph is encoded to a hypervector and the graph hypervectors are used to train the HDC classifier. This paper applies HDC to brain graph classification using fMRI data. Both the record-based encoding and GrapHD encoding are explored. Experimental results show that 1) For HDC encoding approaches, GrapHD encoding can achieve comparable classification performance and require significantly less memory storage compared to record-based encoding. 2) The utilization of sparsity can achieve higher performance as compared to fully connected brain graphs. Both threshold strategy and the minimum redundancy maximum relevance (mRMR) algorithm are employed to generate sub-graphs, where mRMR achieves higher performance for three binary classification problems: emotion vs. gambling, emotion vs. no-task, and gambling vs. no-task. The corresponding AUCs are 0.87, 0.88, and 0.88, respectively.

Sao, Piyush, Kannan, Ramakrishnan, Gera, Prasun, and Vuduc, Richard. A supernodal all-pairs shortest path algorithm. Retrieved from https://par.nsf.gov/biblio/10204486. Symposium on Principles and Practice of Parallel Program-ming . Web. doi:10.1145/3332466.3374533.

Sao, Piyush, Kannan, Ramakrishnan, Gera, Prasun, & Vuduc, Richard. A supernodal all-pairs shortest path algorithm. Symposium on Principles and Practice of Parallel Program-ming, (). Retrieved from https://par.nsf.gov/biblio/10204486. https://doi.org/10.1145/3332466.3374533

@article{osti_10204486,
place = {Country unknown/Code not available},
title = {A supernodal all-pairs shortest path algorithm},
url = {https://par.nsf.gov/biblio/10204486},
DOI = {10.1145/3332466.3374533},
abstractNote = {We show how to exploit graph sparsity in the Floyd-Warshall algorithm for the all-pairs shortest path (Apsp) problem.Floyd-Warshall is an attractive choice for APSP on high-performing systems due to its structural similarity to solving dense linear systems and matrix multiplication. However, if sparsity of the input graph is not properly exploited,Floyd-Warshall will perform unnecessary asymptotic work and thus may not be a suitable choice for many input graphs. To overcome this limitation, the key idea in our approach is to use the known algebraic relationship between Floyd-Warshall and Gaussian elimination, and import several algorithmic techniques from sparse Cholesky factorization, namely, fill-in reducing ordering, symbolic analysis, supernodal traversal, and elimination tree parallelism. When combined, these techniques reduce computation, improve locality and enhance parallelism. We implement these ideas in an efficient shared memory parallel prototype that is orders of magnitude faster than an efficient multi-threaded baseline Floyd-Warshall that does not exploit sparsity. Our experiments suggest that the Floyd-Warshall algorithm can compete with Dijkstra’s algorithm (the algorithmic core of Johnson’s algorithm) for several classes sparse graphs.},
journal = {Symposium on Principles and Practice of Parallel Program-ming},
author = {Sao, Piyush and Kannan, Ramakrishnan and Gera, Prasun and Vuduc, Richard},
editor = {null}
}

Warning: Leaving National Science Foundation Website

You are now leaving the National Science Foundation website to go to a non-government website.

Website:

NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.