NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

https://doi.org/10.1145/3466795

Yang, Carl; Buluç, Aydın; Owens, John D. (January 2021, ACM transactions on mathematical software)
null (Ed.)
High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which allow graph algorithms to be expressed in a performant, succinct, composable, and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction.Exploiting output sparsityallows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in “GraphBLAST”, the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse andGBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework ,while offering a simpler and more concise programming model.
more » « less
Full Text Available
Fast BFS-Based Triangle Counting on GPUs

https://doi.org/10.1109/HPEC.2019.8916434

Wang, Leyuan; Owens, John D. (September 2019, Proceedings of the IEEE High Performance Extreme Computing Conference)

In this paper, we propose a novel method, GSM, to compute graph matching (subgraph isomorphism) on GPUs. Unlike previous formulations of graph matching, our approach is BFS-based and thus can be mapped onto GPUs in a massively parallel fashion. Our implementation uses the Gunrock program- ming model and we evaluate our implementation in runtime and memory consumption compared with previous state-of-the- art work. We sustain a peak traversed-edges-per-second (TEPS) rate of nearly 10 GTEPS. Our algorithm is the most scalable and parallel among all existing GPU implementations and also outperforms all existing CPU distributed implementations. This work specifically focuses on leveraging our implementation on the triangle counting problem for the Subgraph Isomorphism Graph Challenge, demonstrating a geometric mean speedup over the 2018 champion of 3.84x
more » « less
Full Text Available
Accelerating DNN Inference with GraphBLAS and the GPU

https://doi.org/10.1109/HPEC.2019.8916498

Wang, Xiaoyun; Lin, Zhongyi; Yang, Carl; Owens, John D. (September 2019, Proceedings of the IEEE High Performance Extreme Computing Conference)

This work addresses the 2019 Sparse Deep Neural Network Graph Challenge with an implementation of this challenge using the GraphBLAS programming model. We demonstrate our solution to this challenge with GraphBLAST, a GraphBLAS implementation on the GPU, and compare it to SuiteSparse, a GraphBLAS implementation on the CPU. The GraphBLAST implementation is 1.94× faster than Suite-Sparse; the primary opportunity to increase performance on the GPU is a higher-performance sparse-matrix-times-sparse-matrix (SpGEMM) kernel.
more » « less
Full Text Available
LAGraph: A Community Effort to Collect Graph Algorithms Built on Top of the GraphBLAS

Mattson, Timothy; Davis, Timothy A.; Kumar, Manoj; Buluç, Aydin; McMillan, Scott; Moreira, José; Yang, Carl (May 2019, Proceedings of the Workshop on Graphs, Architectures, Programming, and Learning)

In 2013, we released a position paper to launch a community effort to define a common set of building blocks for constructing graph algorithms in the language of linear algebra. This led to the GraphBLAS. We released a specification for the C programming language binding to the GraphBLAS in 2017. Since that release, multiple libraries that conform to the GraphBLAS C specification have been produced. In this position paper, we launch the next phase of this ongoing community effort: a project to assemble a set of high level graph algorithms built on top of the GraphBLAS. While many of these algorithms are well-known with high quality implementations available, they have not been assembled in one place and integrated with the GraphBLAS. We call this project the LAGraph graph algorithms project and with this position paper, we put out a call for collaborators to join us. While the initial goal is to just assemble these algorithms into a single framework, the long term goal is a library of production-worthy code, with the LAGraph library serving as an open source repository of verified graph algorithms that use the GraphBLAS.
more » « less
Full Text Available
Graph Coloring on the GPU

https://doi.org/10.1109/IPDPSW.2019.00046

Osama, Muhammad; Truong, Minh; Yang, Carl; Buluc, Aydin; Owens, John (May 2019, Proceedings of the Workshop on Graphs, Architectures, Programming, and Learning)

We design and implement parallel graph coloring algorithms on the GPU using two different abstractions—one datacentric (Gunrock), the other linear-algebra-based (GraphBLAS). We analyze the impact of variations of a baseline independent-set algorithm on quality and runtime. We study how optimizations such as hashing, avoiding atomics, and a max-min heuristic affect performance. Our Gunrock graph coloring implementation has a peak 2x speed-up, a geomean speed-up of 1.3x and produces 1.6x more colors over previous hardwired state-of-theart implementations on real-world datasets. Our GraphBLAS implementation of Luby’s algorithm produces 1.9x fewer colors than the previous state-of-the-art parallel implementation at the cost of 3x extra runtime, and 1.014x fewer colors than a greedy, sequential algorithm with a geomean speed-up of 2.6x.
more » « less
Full Text Available
LAGraph: A Community Effort to Collect Graph Algorithms Built on Top of the GraphBLAS

https://doi.org/10.1109/IPDPSW.2019.00053

Mattson, Tim; Davis, Timothy A.; Kumar, Manoj; Buluc, Aydin; McMillan, Scott; Moreira, Jose; Yang, Carl (May 2019, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW))

In 2013, we released a position paper to launch a community effort to define a common set of building blocks for constructing graph algorithms in the language of linear algebra. This led to the GraphBLAS. We released a specification for the C programming language binding to the GraphBLAS in 2017. Since that release, multiple libraries that conform to the GraphBLAS C specification have been produced. In this position paper, we launch the next phase of this ongoing community effort: a project to assemble a set of high level graph algorithms built on top of the GraphBLAS. While many of these algorithms are well-known with high quality implementations available, they have not been assembled in one place and integrated with the GraphBLAS. We call this project the LAGraph graph algorithms project and with this position paper, we put out a call for collaborators to join us. While the initial goal is to just assemble these algorithms into a single framework, the long term goal is a library of production-worthy code, with the LAGraph library serving as an open source repository of verified graph algorithms that use the GraphBLAS.
more » « less
Full Text Available
Graph Coloring on the GPU

Osama, Muhammad; Truong, Minh; Yang, Carl; Buluç, Aydın; Owens, John D. (May 2019, Proceedings of the Workshop on Graphs, Architectures, Programming, and Learning)

We design and implement parallel graph coloring algorithms on the GPU using two different abstractions—one data-centric (Gunrock), the other linear-algebra-based (GraphBLAS). We analyze the impact of variations of a baseline independent-set algorithm on quality and runtime. We study how optimizations such as hashing, avoiding atomics, and a max-min heuristic affect performance. Our Gunrock graph coloring implementation has a peak 2x speed-up, a geomean speed-up of 1.3x and produces 1.6x more colors over previous hardwired state-of-the-art implementations on real-world datasets. Our GraphBLAS implementation of Luby's algorithm produces 1.9x fewer colors than the previous state-of-the-art parallel implementation at the cost of 3x extra runtime, and 1.014x fewer colors than a greedy, sequential algorithm with a geomean speed-up of 2.6x.
more » « less
Full Text Available
Implementing Push-Pull Efficiently in GraphBLAS

https://doi.org/10.1145/3225058.3225122

Yang, Carl; Buluc, Aydin; Owens, John D. (August 2018, Proceedings of the International Conference on Parallel Processing)

We factor Beamer's push-pull, also known as direction-optimized breadth-first-search (DOBFS) into 3 separable optimizations, and analyze them for generalizability, asymptotic speedup, and contribution to overall speedup. We demonstrate that masking is critical for high performance and can be generalized to all graph algorithms where the sparsity pattern of the output is known a priori. We show that these graph algorithm optimizations, which together constitute DOBFS, can be neatly and separably described using linear algebra and can be expressed in the GraphBLAS linear-algebra-based framework. We provide experimental evidence that with these optimizations, a DOBFS expressed in a linear-algebra-based graph framework attains competitive performance with state-of-the-art graph frameworks on the GPU and on a multi-threaded CPU, achieving 101 GTEPS on a Scale 22 RMAT graph.
more » « less
Full Text Available
Design Principles for Sparse Matrix Multiplication on the GPU

https://doi.org/10.1007/978-3-319-96983-1_48

Yang, Carl; Buluc, Aydin; Owens, John D. (August 2018, Euro-Par 2018: Proceedings of the 24th International European Conference on Parallel and Distributed Computing)

We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients---(i) merge-based load-balancing and (ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and a 31.7% geomean speedup over state-of-the-art SpMM implementations on real-world datasets.
more » « less
Full Text Available
Graphs, betweenness centrality, and the GPU: technical perspective

https://doi.org/10.1145/3230483

Owens, John D. (July 2018, Communications of the ACM)

Full Text Available

« Prev Next »

Search for: All records