 Award ID(s):
 1710371
 NSFPAR ID:
 10204486
 Date Published:
 Journal Name:
 Symposium on Principles and Practice of Parallel Programming
 Page Range / eLocation ID:
 250 to 261
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

null (Ed.)We present an optimized FloydWarshall (FloydWarshall) algorithm that computes the Allpairs shortest path (APSP) for GPU accelerated clusters. The FloydWarshall algorithm due to its structural similarities to matrixmultiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPIcollective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU ondemand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel FloydWarshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256~nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time.more » « less

Expander graphs play a central role in graph theory and algorithms. With a number of powerful algorithmic tools developed around them, such as the CutMatching game, expander pruning, expander decomposition, and algorithms for decremental AllPairs Shortest Paths (APSP) in expanders, to name just a few, the use of expanders in the design of graph algorithms has become ubiquitous. Specific applications of interest to us are fast deterministic algorithms for cut problems in static graphs, and algorithms for dynamic distancebased graph problems, such as APSP. Unfortunately, the use of expanders in these settings incurs a number of drawbacks. For example, the best currently known algorithm for decremental APSP in constantdegree expanders can only achieve a (log n) O(1/ 2 ) approximation with n 1+O( ) total update time for any . All currently known algorithms for the Cut Player in the CutMatching game are either randomized, or provide rather weak guarantees: expansion 1/(log n) 1/ with running time n 1+O( ) . This, in turn, leads to somewhat weak algorithmic guarantees for several central cut problems: the best current almost linear time deterministic algorithms for Sparsest Cut, Lowest Conductance Cut, and Balanced Cut can only achieve approximation factor (log n) ω(1). Lastly, when relying on expanders in distancebased problems, such as dynamic APSP, via current methods, it seems inevitable that one has to settle for approximation factors that are at least Ω(log n). In contrast, we do not have any negative results that rule out a factor5 approximation with nearlinear total update time. In this paper we propose the use of wellconnected graphs, and introduce a new algorithmic toolkit for such graphs that, in a sense, mirrors the above mentioned algorithmic tools for expanders. One of these new tools is the Distanced Matching game, an analogue of the CutMatching game for wellconnected graphs. We demonstrate the power of these new tools by obtaining better results for several of the problems mentioned above. First, we design an algorithm for decremental APSP in expanders with significantly better guarantees: in a constantdegree expander, the algorithm achieves (log n) 1+o(1)approximation, with total update time n 1+o(1). We also obtain a deterministic algorithm for the Cut Player in the CutMatching game that achieves expansion 1 (log n) 5+o(1) in time n 1+o(1), deterministic almost lineartime algorithms for Sparsest Cut, LowestConductance Cut, and Minimum Balanced Cut with approximation factors O(poly log n), as well as improved deterministic algorithm for Expander Decomposition. We believe that the use of wellconnected graphs instead of expanders in various dynamic distancebased problems (such as APSP in general graphs) has the potential of providing much stronger guarantees, since we are no longer necessarily restricted to superlogarithmic approximation factors.more » « less

Algorithms for computing AllPairs ShortestPaths (APSP) are critical building blocks underlying many practical applications. The standard sequential algorithms, such as FloydWarshall and Johnson, quickly become infeasible for large input graphs, necessitating parallel approaches. In this work, we propose, implement and thoroughly analyse different strategies for APSP on distributed memory clusters with Apache Spark. Our solvers are designed for large undirected weighted graphs, and differ in complexity and degree of reliance on techniques outside of pure Spark API. We demonstrate that the best performing solver is able to handle APSP problems with over 200,000 vertices on a 1024core cluster. However, it requires auxiliary shared persistent storage to compensate for missing Spark functionality.more » « less

null (Ed.)Motivated by the increasing need to understand the distributed algorithmic foundations of largescale graph computations, we study some fundamental graph problems in a messagepassing model for distributed computing where k ≥ 2 machines jointly perform computations on graphs with n nodes (typically, n >> k). The input graph is assumed to be initially randomly partitioned among the k machines, a common implementation in many realworld systems. Communication is pointtopoint, and the goal is to minimize the number of communication rounds of the computation. Our main contribution is the General Lower Bound Theorem , a theorem that can be used to show nontrivial lower bounds on the round complexity of distributed largescale data computations. This result is established via an informationtheoretic approach that relates the round complexity to the minimal amount of information required by machines to solve the problem. Our approach is generic, and this theorem can be used in a “cookbook” fashion to show distributed lower bounds for several problems, including nongraph problems. We present two applications by showing (almost) tight lower bounds on the round complexity of two fundamental graph problems, namely, PageRank computation and triangle enumeration . These applications show that our approach can yield lower bounds for problems where the application of communication complexity techniques seems not obvious or gives weak bounds, including and especially under a stochastic partition of the input. We then present distributed algorithms for PageRank and triangle enumeration with a round complexity that (almost) matches the respective lower bounds; these algorithms exhibit a round complexity that scales superlinearly in k , improving significantly over previous results [Klauck et al., SODA 2015]. Specifically, we show the following results: PageRank: We show a lower bound of Ὼ(n/k 2 ) rounds and present a distributed algorithm that computes an approximation of the PageRank of all the nodes of a graph in Õ(n/k 2 ) rounds. Triangle enumeration: We show that there exist graphs with m edges where any distributed algorithm requires Ὼ(m/k 5/3 ) rounds. This result also implies the first nontrivial lower bound of Ὼ(n 1/3 ) rounds for the congested clique model, which is tight up to logarithmic factors. We then present a distributed algorithm that enumerates all the triangles of a graph in Õ(m/k 5/3 + n/k 4/3 ) rounds.more » « less

Abstract Task‐based execution of graph workloads allows various ordered and unordered implementations, with tasks representing dependencies between graph vertices and edges. This work explores graph algorithms in the context of ordered and unordered task‐based implementations, that trade‐off work‐efficiency with parallelism. The monotonicity of convergent graph solutions is the reason behind the trade‐off between work‐efficiency and parallelism. This trade‐off results in variable performance‐based choices within and across different machines (CPUs and GPUs), graph algorithms, implementations (ordered, relaxed, and unordered). Input graphs also augment this choice space, with this work analyzing temporally changing graphs in addition to the static graphs explored by prior works. These algorithmic and architectural choices are first explored in this work, and it is seen that different graph workload‐input combinations perform optimally on diverse architectural configurations. The resulting choice space is analyzed and this work represents it in the form of characteristic variables that correlate with each choice space. Using these characteristic variables, this work proposes analytical and neural network models to correlate these choice spaces to find the best performing implementation. The variables and the prediction models proposed in this work are also integrated with a state‐of‐the‐art performance predictor on a multiaccelerator setup, and shows geometric performance gains of 54% on a CPU, 14% on a GPU, and 31.5% in a multiaccelerator setup over baseline implementations without performance prediction.