skip to main content


Title: Addressing Irregular Patterns of Matrix Computations on GPUs and Their Impact on Applications Powered by Sparse Direct Solvers
Many scientific applications rely on sparse direct solvers for their numerical robustness. However, performance optimization for these solvers remains a challenging task, especially on GPUs. This is due to workloads of small dense matrices that are different in size. Matrix decompositions on such irregular workloads are rarely addressed on GPUs. This paper addresses irregular workloads of matrix computations on GPUs, and their application to accelerate sparse direct solvers. We design an interface for the basic matrix operations supporting problems of different sizes. The interface enables us to develop irrLU-GPU, an LU decomposition on matrices of different sizes. We demonstrate the impact of irrLU-GPU on sparse direct LU solvers using NVIDIA and AMD GPUs. Experimental results are shown for a sparse direct solver based on a multifrontal sparse LU decomposition applied to linear systems arising from the simulation, using finite element discretization on unstructured meshes, of a high-frequency indefinite Maxwell problem.  more » « less
Award ID(s):
1740250
NSF-PAR ID:
10393116
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
International Conference for High Performance Computing Networking Storage and Analysis
ISSN:
2167-4337
Page Range / eLocation ID:
354-367
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The use of batched matrix computations recently gained a lot of interest for applications, where the same operation is applied to many small independent matrices. The batched computational pattern is frequently encountered in applications of data analytics, direct/iterative solvers and preconditioners, computer vision, astrophysics, and more, and often requires specific designs for vectorization and extreme parallelism to map well on today's high-end many-core architectures. This has led to the development of optimized software for batch computations, and to an ongoing community effort to develop standard interfaces for batched linear algebra software. Furthering these developments, we present GPU design and optimization techniques for high-performance batched one-sided factorizations of millions of tiny matrices (of size 32 and less). We quantify the effects and relevance of different techniques in order to select the best-performing LU, QR, and Cholesky factorization designs. While we adapt common optimization techniques, such as optimal memory traffic, register blocking, and concurrency control, we also show that a different mindset and techniques are needed when matrices are tiny, and in particular, sub-vector/warp in size. The proposed routines are part of the MAGMA library and deliver significant speedups compared to their counterparts in currently available vendor-optimized libraries. Notably, we tune the developments for the newest V100 GPU from NVIDIA to show speedups of up to 11.8×. 
    more » « less
  2. SPEX Left LU is a software package for exactly solving unsymmetric sparse linear systems. As a component of the sparse exact (SPEX) software package, SPEX Left LU can be applied to any input matrix, A , whose entries are integral, rational, or decimal, and provides a solution to the system \( Ax = b \) , which is either exact or accurate to user-specified precision. SPEX Left LU preorders the matrix A with a user-specified fill-reducing ordering and computes a left-looking LU factorization with the special property that each operation used to compute the L and U matrices is integral. Notable additional applications of this package include benchmarking the stability and accuracy of state-of-the-art linear solvers and determining whether singular-to-double-precision matrices are indeed singular. Computationally, this article evaluates the impact of several novel pivoting schemes in exact arithmetic, benchmarks the exact iterative solvers within Linbox, and benchmarks the accuracy of MATLAB sparse backslash. Most importantly, it is shown that SPEX Left LU outperforms the exact iterative solvers in run time on easy instances and in stability as the iterative solver fails on a sizeable subset of the tested (both easy and hard) instances. The SPEX Left LU package is written in ANSI C, comes with a MATLAB interface, and is distributed via GitHub, as a component of the SPEX software package, and as a component of SuiteSparse. 
    more » « less
  3. LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13--32.65X. Further, our out-of-core implementation achieves a speedup of 1.2--2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective. 
    more » « less
  4. LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13--32.65X. Further, our out-of-core implementation achieves a speedup of 1.2--2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective. 
    more » « less
  5. We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU. 
    more » « less