We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal performance for TS-SpGEMM. To address this limitation, we develop a novel distributed-memory algorithm tailored for TS SpGEMM. Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TSSpGEMM algorithm attains 5x performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter.
more »
« less
Improving Performance and Scalability of Algebraic Multigrid through a Specialized MATVEC
Algebraic Multigrid (AMG) is an extremely popular linear system solver and/or preconditioner approach for matrices obtained from the discretization of elliptic operators. However, its performance and scalability for large systems obtained from unstructured discretizations seem less consistent than for geometric multigrid (GMG). To a large extent, this is due to loss of sparsity at the coarser grids and the resulting increased cost and poor scalability of the matrix-vector multiplication. While there have been attempts to address this concern by designing sparsification algorithms, these affect the overall convergence. In this work, we focus on designing a specialized matrix-vector multiplication (matvec) that achieves high performance and scalability for a large variation in the levels of sparsity. We evaluate distributed and shared memory implementations of our matvec operator and demonstrate the improvements to its scalability and performance in AMG hierarchy and finally, we compare it with PETSc.
more »
« less
- PAR ID:
- 10067892
- Date Published:
- Journal Name:
- IEEE High Performance Extreme Computing Conference
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Multigrid is one of the most effective methods for solving elliptic PDEs. It is algorithmically optimal and is robust when combined with Krylov methods. Algebraic multigrid is especially attractive due to its blackbox nature. This however comes at the cost of increased setup costs that can be significant in case of systems where the system matrix changes frequently making it difficult to amortize the setup cost. In this work, we investigate several strategies for performing lazy updates to the multigrid hierarchy corresponding to changes in the system matrix. These include delayed updates, value updates without changing structure, process local changes, and full updates. We demonstrate that in many cases, the overhead of building the AMG hierarchy can be mitigated for rapidly changing system matrices.more » « less
-
Abstract Problems arising in Earth's mantle convection involve finding the solution to Stokes systems with large viscosity contrasts. These systems contain localized features which, even with adaptive mesh refinement, result in linear systems that can be on the order of 109or more unknowns. One common approach for preconditioning to the velocity block of these systems is to apply an Algebraic Multigrid (AMG) V‐cycle (as is done in the ASPECT software, for example), however, we find that AMG is lacking robustness with respect to problem size and number of parallel processes. Additionally, we see an increase in iteration counts with refinement when using AMG. In contrast, the Geometric Multigrid (GMG) method, by using information about the geometry of the problem, should offer a more robust option.Here we present a matrix‐free GMG V‐cycle which works on adaptively refined, distributed meshes, and we will compare it against the current AMG preconditioner (Trilinos ML) used in theASPECT1software. We will demonstrate the robustness of GMG with respect to problem size and show scaling up to 114,688 cores and 217 billion unknowns. All computations are run using the open‐source, finite element librarydeal.II.2more » « less
-
Sparse matrices are very common types of information used in scientific and machine learning applications including deep neural networks. Sparse data representations lead to storage efficiencies by avoiding storing zero values. However, sparse representations incur metadata computational overheads – soft- ware first needs to find row/column locations of non-zero val- ues before performing necessary computations. Such metadata accesses involve indirect memory accesses (of the form a[b[i]] where a[.] and b[.] are large arrays) and they are cache and prefetch-unfriendly, resulting in frequent load stalls. In this paper, we will explore a dedicated hardware for a memory-side accelerator called Hardware Helper Thread (HHT) that performs all the necessary index computations to fetch only the nonzero elements from sparse matrix and sparse vector and supply those values to the primary core, creating heterogeneity within a single CPU core. We show both performance gains and energy savings of HHT for sparse matrix-dense vector multiplication (SpMV) and sparse matrix- sparse vector multiplication (SpMSpV). The ASIC HHT shows average performance gains ranging between 1.7 and 3.5 de- pending on the sparsity levels, vector-widths used by RISCV vector instructions and if the Vector (in Matrix-Vector multi- plication) is sparse or dense. We also show energy savings of 19% on average when ASIC HHT is used compared to baseline (for SpMV), and the HHT requires 38.9% of a RISCV core areamore » « less
-
Simulating the dynamics of discretized interacting structures whose relationship is dictated by a kernel function gives rise to a large dense matrix. We propose a multigrid solver for such a matrix that exploits not only its data-sparsity resulting from the decay of the kernel function but also the regularity of the geometry of the structures and the quantities of interest distributed on them. Like the well-known multigrid method for large sparse matrices arising from boundary-value problems, our method requires a smoother for removing high-frequency terms in solution errors, a strategy for coarsening a grid, and a pair of transfer operators for exchanging information between two grids. We develop new techniques for these processes that are tailored to a kernel function acting on discretized interacting structures. They are matrix-free in the sense that there is no need to construct the large dense matrix. Numerical experiments on a variety of bio-inspired microswimmers immersed in a Stokes flow demonstrate the effectiveness and efficiency of the proposed multigrid solver. In the case of free swimmers that must maintain force and torque balance, additional sparse rows and columns need to be appended to the dense matrix above. We develop a matrix-free fast solver for this bordered matrix as well, in which the multigrid method is a key component.more » « less
An official website of the United States government

