skip to main content


Title: Optimizing Ordered Graph Algorithms with GraphIt
Many graph problems can be solved using ordered parallel graph algorithms that achieve significant speedup over their unordered counterparts by reducing redundant work. This paper introduces a new priority-based extension to GraphIt, a domain-specific language for writing graph applications, to simplify writing high-performance parallel ordered graph algorithms. The extension enables vertices to be processed in a dynamic order while hiding low-level implementation details from the user. We extend the compiler with new program analyses, transformations, and code generation to produce fast implementations of ordered parallel graph algorithms. We also introduce bucket fusion, a new performance optimization that fuses together different rounds of ordered algorithms to reduce synchronization overhead, resulting in 1.2x--3x speedup over the fastest existing ordered algorithm implementations on road networks with large diameters. With the extension, GraphIt achieves up to 3x speedup on six ordered graph algorithms over state-of-the-art frameworks and hand-optimized implementations (Julienne, Galois, and GAPBS) that support ordered algorithms.  more » « less
Award ID(s):
1845763
NSF-PAR ID:
10137101
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
International Symposium on Code Generation and Optimization
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which allow graph algorithms to be expressed in a performant, succinct, composable, and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction.Exploiting output sparsityallows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in “GraphBLAST”, the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse andGBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework ,while offering a simpler and more concise programming model. 
    more » « less
  2. null (Ed.)
    Graph processing workloads are memory intensive with irregular access patterns and large memory footprint resulting in low data locality. Their popular software implementations typically employ either Push or Pull style propagation of changes through the graph over multiple iterations that follow the Bulk Synchronous Model. The performance of these algorithms on traditional computing systems is limited by random reads/writes of vertex values, synchronization overheads, and additional overheads for tracking active sets of vertices or edges across iterations. In this paper, we present GraphPulse, a hardware framework for asynchronous graph processing with event-driven scheduling that overcomes the performance limitations of software frameworks. Event-driven computation model enables a parallel dataflow-style execution where atomic updates and active sets tracking are inherent to the model; thus, scheduling complexity is reduced and scalability is enhanced. The dataflow nature of the architecture also reduces random reads of vertex values by carrying the values in the events themselves. We capitalize on the update properties commonly present in graph algorithms to coalesce in-flight events and substantially reduce the event storage requirement and the processing overheads incurred. GraphPulse event-model naturally supports asynchronous graph processing, enabling substantially faster convergence by exploiting available parallelism, reducing work, and eliminating synchronization at iteration boundaries. The framework provides easy to use programming interface for faster development of hardware graph accelerators. A single GraphPulse accelerator achieves up to 74x speedup (28x on average) over Ligra, a state of the art software framework, running on a 12 core CPU. It also achieves an average of 6.2x speedup over Graphicionado, a state of the art graph processing accelerator. 
    more » « less
  3. We design and implement parallel graph coloring algorithms on the GPU using two different abstractions—one datacentric (Gunrock), the other linear-algebra-based (GraphBLAS). We analyze the impact of variations of a baseline independent-set algorithm on quality and runtime. We study how optimizations such as hashing, avoiding atomics, and a max-min heuristic affect performance. Our Gunrock graph coloring implementation has a peak 2x speed-up, a geomean speed-up of 1.3x and produces 1.6x more colors over previous hardwired state-of-theart implementations on real-world datasets. Our GraphBLAS implementation of Luby’s algorithm produces 1.9x fewer colors than the previous state-of-the-art parallel implementation at the cost of 3x extra runtime, and 1.014x fewer colors than a greedy, sequential algorithm with a geomean speed-up of 2.6x. 
    more » « less
  4. We design and implement parallel graph coloring algorithms on the GPU using two different abstractions—one data-centric (Gunrock), the other linear-algebra-based (GraphBLAS). We analyze the impact of variations of a baseline independent-set algorithm on quality and runtime. We study how optimizations such as hashing, avoiding atomics, and a max-min heuristic affect performance. Our Gunrock graph coloring implementation has a peak 2x speed-up, a geomean speed-up of 1.3x and produces 1.6x more colors over previous hardwired state-of-the-art implementations on real-world datasets. Our GraphBLAS implementation of Luby's algorithm produces 1.9x fewer colors than the previous state-of-the-art parallel implementation at the cost of 3x extra runtime, and 1.014x fewer colors than a greedy, sequential algorithm with a geomean speed-up of 2.6x. 
    more » « less
  5. null (Ed.)
    We develop and study FPGA implementations of algorithms for charged particle tracking based on graph neural networks. The two complementary FPGA designs are based on OpenCL, a framework for writing programs that execute across heterogeneous platforms, and hls4ml, a high-level-synthesis-based compiler for neural network to firmware conversion. We evaluate and compare the resource usage, latency, and tracking performance of our implementations based on a benchmark dataset. We find a considerable speedup over CPU-based execution is possible, potentially enabling such algorithms to be used effectively in future computing workflows and the FPGA-based Level-1 trigger at the CERN Large Hadron Collider. 
    more » « less