skip to main content

Title: A supernodal all-pairs shortest path algorithm
We show how to exploit graph sparsity in the Floyd-Warshall algorithm for the all-pairs shortest path (Apsp) problem.Floyd-Warshall is an attractive choice for APSP on high-performing systems due to its structural similarity to solving dense linear systems and matrix multiplication. However, if sparsity of the input graph is not properly exploited,Floyd-Warshall will perform unnecessary asymptotic work and thus may not be a suitable choice for many input graphs. To overcome this limitation, the key idea in our approach is to use the known algebraic relationship between Floyd-Warshall and Gaussian elimination, and import several algorithmic techniques from sparse Cholesky factorization, namely, fill-in reducing ordering, symbolic analysis, supernodal traversal, and elimination tree parallelism. When combined, these techniques reduce computation, improve locality and enhance parallelism. We implement these ideas in an efficient shared memory parallel prototype that is orders of magnitude faster than an efficient multi-threaded baseline Floyd-Warshall that does not exploit sparsity. Our experiments suggest that the Floyd-Warshall algorithm can compete with Dijkstra’s algorithm (the algorithmic core of Johnson’s algorithm) for several classes sparse graphs.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Symposium on Principles and Practice of Parallel Program-ming
Page Range / eLocation ID:
250 to 261
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matrix-multiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPI-collective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU on-demand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel Floyd-Warshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256~nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time. 
    more » « less
  2. Expander graphs play a central role in graph theory and algorithms. With a number of powerful algorithmic tools developed around them, such as the Cut-Matching game, expander pruning, expander decomposition, and algorithms for decremental All-Pairs Shortest Paths (APSP) in expanders, to name just a few, the use of expanders in the design of graph algorithms has become ubiquitous. Specific applications of interest to us are fast deterministic algorithms for cut problems in static graphs, and algorithms for dynamic distance-based graph problems, such as APSP. Unfortunately, the use of expanders in these settings incurs a number of drawbacks. For example, the best currently known algorithm for decremental APSP in constant-degree expanders can only achieve a (log n) O(1/ 2 ) -approximation with n 1+O( ) total update time for any . All currently known algorithms for the Cut Player in the Cut-Matching game are either randomized, or provide rather weak guarantees: expansion 1/(log n) 1/ with running time n 1+O( ) . This, in turn, leads to somewhat weak algorithmic guarantees for several central cut problems: the best current almost linear time deterministic algorithms for Sparsest Cut, Lowest Conductance Cut, and Balanced Cut can only achieve approximation factor (log n) ω(1). Lastly, when relying on expanders in distancebased problems, such as dynamic APSP, via current methods, it seems inevitable that one has to settle for approximation factors that are at least Ω(log n). In contrast, we do not have any negative results that rule out a factor-5 approximation with near-linear total update time. In this paper we propose the use of well-connected graphs, and introduce a new algorithmic toolkit for such graphs that, in a sense, mirrors the above mentioned algorithmic tools for expanders. One of these new tools is the Distanced Matching game, an analogue of the Cut-Matching game for well-connected graphs. We demonstrate the power of these new tools by obtaining better results for several of the problems mentioned above. First, we design an algorithm for decremental APSP in expanders with significantly better guarantees: in a constant-degree expander, the algorithm achieves (log n) 1+o(1)-approximation, with total update time n 1+o(1). We also obtain a deterministic algorithm for the Cut Player in the Cut-Matching game that achieves expansion 1 (log n) 5+o(1) in time n 1+o(1), deterministic almost linear-time algorithms for Sparsest Cut, Lowest-Conductance Cut, and Minimum Balanced Cut with approximation factors O(poly log n), as well as improved deterministic algorithm for Expander Decomposition. We believe that the use of well-connected graphs instead of expanders in various dynamic distance-based problems (such as APSP in general graphs) has the potential of providing much stronger guarantees, since we are no longer necessarily restricted to superlogarithmic approximation factors. 
    more » « less
  3. Algorithms for computing All-Pairs Shortest-Paths (APSP) are critical building blocks underlying many practical applications. The standard sequential algorithms, such as Floyd-Warshall and Johnson, quickly become infeasible for large input graphs, necessitating parallel approaches. In this work, we propose, implement and thoroughly analyse different strategies for APSP on distributed memory clusters with Apache Spark. Our solvers are designed for large undirected weighted graphs, and differ in complexity and degree of reliance on techniques outside of pure Spark API. We demonstrate that the best performing solver is able to handle APSP problems with over 200,000 vertices on a 1024-core cluster. However, it requires auxiliary shared persistent storage to compensate for missing Spark functionality. 
    more » « less
  4. null (Ed.)
    Motivated by the increasing need to understand the distributed algorithmic foundations of large-scale graph computations, we study some fundamental graph problems in a message-passing model for distributed computing where k ≥ 2 machines jointly perform computations on graphs with n nodes (typically, n >> k). The input graph is assumed to be initially randomly partitioned among the k machines, a common implementation in many real-world systems. Communication is point-to-point, and the goal is to minimize the number of communication rounds of the computation. Our main contribution is the General Lower Bound Theorem , a theorem that can be used to show non-trivial lower bounds on the round complexity of distributed large-scale data computations. This result is established via an information-theoretic approach that relates the round complexity to the minimal amount of information required by machines to solve the problem. Our approach is generic, and this theorem can be used in a “cookbook” fashion to show distributed lower bounds for several problems, including non-graph problems. We present two applications by showing (almost) tight lower bounds on the round complexity of two fundamental graph problems, namely, PageRank computation and triangle enumeration . These applications show that our approach can yield lower bounds for problems where the application of communication complexity techniques seems not obvious or gives weak bounds, including and especially under a stochastic partition of the input. We then present distributed algorithms for PageRank and triangle enumeration with a round complexity that (almost) matches the respective lower bounds; these algorithms exhibit a round complexity that scales superlinearly in k , improving significantly over previous results [Klauck et al., SODA 2015]. Specifically, we show the following results: PageRank: We show a lower bound of Ὼ(n/k 2 ) rounds and present a distributed algorithm that computes an approximation of the PageRank of all the nodes of a graph in Õ(n/k 2 ) rounds. Triangle enumeration: We show that there exist graphs with m edges where any distributed algorithm requires Ὼ(m/k 5/3 ) rounds. This result also implies the first non-trivial lower bound of Ὼ(n 1/3 ) rounds for the congested clique model, which is tight up to logarithmic factors. We then present a distributed algorithm that enumerates all the triangles of a graph in Õ(m/k 5/3 + n/k 4/3 ) rounds. 
    more » « less
  5. Abstract

    Task‐based execution of graph workloads allows various ordered and unordered implementations, with tasks representing dependencies between graph vertices and edges. This work explores graph algorithms in the context of ordered and unordered task‐based implementations, that trade‐off work‐efficiency with parallelism. The monotonicity of convergent graph solutions is the reason behind the trade‐off between work‐efficiency and parallelism. This trade‐off results in variable performance‐based choices within and across different machines (CPUs and GPUs), graph algorithms, implementations (ordered, relaxed, and unordered). Input graphs also augment this choice space, with this work analyzing temporally changing graphs in addition to the static graphs explored by prior works. These algorithmic and architectural choices are first explored in this work, and it is seen that different graph workload‐input combinations perform optimally on diverse architectural configurations. The resulting choice space is analyzed and this work represents it in the form of characteristic variables that correlate with each choice space. Using these characteristic variables, this work proposes analytical and neural network models to correlate these choice spaces to find the best performing implementation. The variables and the prediction models proposed in this work are also integrated with a state‐of‐the‐art performance predictor on a multiaccelerator setup, and shows geometric performance gains of 54% on a CPU, 14% on a GPU, and 31.5% in a multiaccelerator setup over baseline implementations without performance prediction.

    more » « less