We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal performance for TS-SpGEMM. To address this limitation, we develop a novel distributed-memory algorithm tailored for TS SpGEMM. Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TSSpGEMM algorithm attains 5x performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter.
more »
« less
This content will become publicly available on July 28, 2026
AIRES: Accelerating Out-of-Core GCNs via Algorithm-System Co-Design
Graph convolutional networks (GCNs) are fundamental in various scientific applications, ranging from biomedical protein-protein interactions (PPI) to large-scale recommendation systems. An essential component for modeling graph structures in GCNs is sparse general matrix-matrix multiplication (SpGEMM). As the size of graph data continues to scale up, SpGEMMs are often conducted in an out-of-core fashion due to limited GPU memory space in resource-constrained systems. Albeit recent efforts that aim to alleviate the memory constraints of out-of-core SpGEMM through either GPU feature caching, hybrid CPU-GPU memory layout, or performing the computation in sparse format, current systems suffer from both high I/O latency and GPU under-utilization issues. In this paper, we first identify the problems of existing systems, where sparse format data alignment and memory allocation are the main performance bottlenecks, and propose AIRES, a novel algorithm-system co-design solution to accelerate out-of-core SpGEMM computation for GCNs. Specifically, from the algorithm angle, AIRES proposes to alleviate the data alignment issues on the block level for matrices in sparse formats and develops a tiling algorithm to facilitate row block-wise alignment. On the system level, AIRES employs a three-phase dynamic scheduling that features a dual-way data transfer strategy utilizing a tiered memory system: integrating GPU memory, GPU Direct Storage (GDS), and host memory to reduce I/O latency and improve throughput. Evaluations show that AIRES significantly outperforms the state-of-the-art methods, achieving up to 1.8× lower latency in real-world graph processing benchmarks.
more »
« less
- Award ID(s):
- 1907765
- PAR ID:
- 10637801
- Publisher / Repository:
- IEEE
- Date Published:
- Journal Name:
- Proceedings
- ISSN:
- 2160-052X
- ISBN:
- 979-8-3315-9552-4
- Page Range / eLocation ID:
- 1 to 8
- Format(s):
- Medium: X
- Location:
- Vancouver, BC, Canada
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. It is well-known for the poor locality and high memory bandwidth requirement. In conventional architectures, they incur a significant amount of data movements and energy consumption which motivates several hardware graph processing accelerators. The current graph processing accelerators rely on memory access optimizations or placing computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to accelerate graph processing with analog computation. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suitable for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01X (up to 132.67X) speedup and a 33.82X energy saving on geometric mean compared to a CPU baseline system. Compared to GPU, GRAPHR achieves 1.69X to 2.19X speedup and consumes 4.77X to 8.91X less energy. GRAPHR gains a speedup of 1.16X to 4.12X, and is 3.67X to 10.96X more energy efficiency compared to PIM-based architecture.more » « less
-
Graph neural networks (GNNs) have emerged as a powerful tool to process graph-based data in fields like communication networks, molecular interactions, chemistry, social networks, and neuroscience. GNNs are characterized by the ultra-sparse nature of their adjacency matrix that necessitates the development of dedicated hardware beyond general-purpose sparse matrix multipliers. While there has been extensive research on designing dedicated hardware accelerators for GNNs, few have extensively explored the impact of the sparse storage format on the efficiency of the GNN accelerators. This paper proposes SCV-GNN with the novel sparse compressed vectors (SCV) format optimized for the aggregation operation. We use Z-Morton ordering to derive a data-locality-based computation ordering and partitioning scheme. The paper also presents how the proposed SCV-GNN is scalable on a vector processing system. Experimental results over various datasets show that the proposed method achieves a geometric mean speedup of 7.96× and 7.04× over CSC and CSR aggregation operations, respectively. The proposed method also reduces the memory traffic by a factor of 3.29× and 4.37× over compressed sparse column (CSC) and compressed sparse row (CSR), respectively. Thus, the proposed novel aggregation format reduces the latency and memory access for GNN inference.more » « less
-
Due to increasing data sparsity in scientific data sets and pruned neural networks, it becomes more challenging to compute with these kinds of sparse data sets efficiently. Several works discuss efficient sparse matrix-vector multiplication (SpMV). However, because of index irregularity in compact stored matrices, sparse matrix-vector multiplication (SpGEMM) still suffers from the trade-off between space and efficiency of computation. In this work, we propose PrGEMM, a multiple reduction scheme which (1) computes SpGEMM under compact storage format without expansion of the operands, (2) by using index lookahead, computes and compares multiple index-data pairs at the same time with no order violation of indices. We evaluate our work with the matrices with different sizes in the SuiteSparse data set. Our work can achieve 3.3x of execution cycle improvement compared to the state-of-the-art SpGEMM scheme.more » « less
-
This work addresses the 2019 Sparse Deep Neural Network Graph Challenge with an implementation of this challenge using the GraphBLAS programming model. We demonstrate our solution to this challenge with GraphBLAST, a GraphBLAS implementation on the GPU, and compare it to SuiteSparse, a GraphBLAS implementation on the CPU. The GraphBLAST implementation is 1.94× faster than Suite-Sparse; the primary opportunity to increase performance on the GPU is a higher-performance sparse-matrix-times-sparse-matrix (SpGEMM) kernel.more » « less
An official website of the United States government
