Sparse matrix-matrix multiplication (SpMM) is a critical computational kernel in numerous scientific and machine learning applications. SpMM involves massive irregular memory accesses and poses great challenges to conventional cache-based computer architectures. Recently dedicated SpMM accelerators have been proposed to enhance SpMM performance. However, current SpMM accelerators still face challenges in adapting to varied sparse patterns, fully exploiting inherent parallelism, and optimizing cache performance. To address these issues, we introduce ACES, a novel SpMM accelerator in this study. First, ACES features an adaptive execution flow that dynamically adjusts to diverse sparse patterns. The adaptive execution flow balances parallel computing efficiency and data reuse. Second, ACES incorporates locality-concurrency co-optimizations within the global cache. ACES utilizes a concurrency-aware cache management policy, which considers data locality and concurrency for optimal replacement decisions. Additionally, the integration of a non-blocking buffer with the global cache enhances concurrency and reduces computational stalls. Third, the hardware architecture of ACES is designed to integrate all innovations. The architecture ensures efficient support across the adaptive execution flow, advanced cache optimizations, and fine-grained parallel processing. Our performance evaluation demonstrates that ACES significantly outperforms existing solutions, providing a 2.1× speedup and marking a substantial advancement in SpMM acceleration.
more »
« less
Rethinking Tiling and Dataflow for SpMM Acceleration: A Graph Transformation Framework
Sparse Matrix Dense Matrix Multiplication (SpMM) is a fundamental computation kernel across various domains, including scientific computing, machine learning, and graph processing. Despite extensive research, existing approaches optimize SpMM using loop transformations and linear algebra principles, which (1) poorly handle unstructured sparsity patterns, (2) rely on empirical methods to explore data reuse opportunities, and (3) enforce rigid coordinate alignment, compromising data locality. In this paper, we demonstrate that these limitations stem from the fundamental matrix representation and traditional dataflows of SpMM (e.g., inner-product, outer-product, and Gustavson). We propose Aquila, a graph transformation framework that reformulates SpMM computations as a graph optimization problem, leveraging graph theory to reinterpret tiling and dataflow. First, on the theoretical side, we introduce vertex decomposition and adaptive depth traversal (ADT) to enable non-contiguous tiling, where nonzero elements from discontinuous rows and columns are clustered by connectivity rather than following matrix dimensionality. This approach quantifies data reuse and improves data locality beyond traditional loop transformations while maintaining output equivalence. Second, on the algorithm side, we develop a pull-after-push (PaP) dataflow that simultaneously enhances the dense matrix data reuse while eliminating synchronization issues in output matrix accumulation. Third, building on our theoretical approach and dataflow, we present a versatile accelerator architecture that handles a variety of SpMM kernels with diverse data sizes and sparsity patterns in a unified architecture. Additionally, we introduce a bidirectional fiber tree (BFT) format to support the proposed graph-oriented dataflow in contrast to traditional column or row-major access. Evaluation across diverse sparse datasets shows Aquila achieves speedups of 4.3 ×, 3.4 ×, 3.7 ×, 2.9 ×, and 2.7 × in execution time and up to 4.8 × improvements in energy efficiency compared to state-of-the-art accelerators.
more »
« less
- Award ID(s):
- 2441973
- PAR ID:
- 10668333
- Publisher / Repository:
- ACM
- Date Published:
- Page Range / eLocation ID:
- 1535 to 1548
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Sparse matrix dense matrix multiplication (SpMM) is commonly used in applications ranging from scientific computing to graph neural networks. Typically, when SpMM is executed in a distributed platform, communication costs dominate. Such costs depend on how communication is scheduled. If it is scheduled in a sparsity-unaware manner, such as with collectives, execution is often inefficient due to unnecessary data transfers. On the other hand, if communication is scheduled in a fine-grained sparsity-aware manner, communicating only the necessary data, execution can also be inefficient due to high software overhead. We observe that individual sparse matrices often contain regions that are denser and regions that are sparser. Based on this observation, we develop a model that partitions communication into sparsity-unaware and sparsity-aware components. Leveraging the partition, we develop a new algorithm that performs collective communication for the denser regions, and fine-grained, one-sided communication for the sparser regions. We call the algorithm Two-Face. We show that Two-Face attains an average speedup of 2.11x over prior work when evaluated on a 4096-core supercomputer. Additionally, Two-Face scales well with the machine size.more » « less
-
An important sparse tensor computation is sparse-tensor-dense-matrix multiplication (SpTM), which is used in tensor decomposition and applications. SpTM is a multi-dimensional analog to sparse-matrix-dense-matrix multiplication (SpMM). In this article, we employ a hierarchical tensor data layout that can unfold a multidimensional tensor to derive a 2D matrix, making it possible to compute SpTM using SpMM kernel implementations for GPUs. We compare two SpMM implementations to the state-of-the-art PASTA sparse tensor contraction implementation using: (1) SpMM with hierarchical tensor data layout; and, (2) unfolding followed by an invocation of cuSPARSE’s SpMM. Results show that SpMM can outperform PASTA 70.9% of the time, but none of the three approaches is best overall. Therefore, we use a decision tree classifier to identify the best performing sparse tensor contraction kernel based on precomputed properties of the sparse tensor.more » « less
-
We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients---(i) merge-based load-balancing and (ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and a 31.7% geomean speedup over state-of-the-art SpMM implementations on real-world datasets.more » « less
-
We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal performance for TS-SpGEMM. To address this limitation, we develop a novel distributed-memory algorithm tailored for TS SpGEMM. Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TSSpGEMM algorithm attains 5x performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter.more » « less
An official website of the United States government

