skip to main content


Title: A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines
This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.  more » « less
Award ID(s):
1740250
NSF-PAR ID:
10289431
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Mathematical Software
Volume:
47
Issue:
3
ISSN:
0098-3500
Page Range / eLocation ID:
1 to 23
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We approach the problem of implementing mixed-datatype support within the general matrix multiplication ( gemm ) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A , B , and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the matrix product and accumulation are allowed to take place in a precision different from the storage precisions of either A or B , is also discussed. We first break the problem into orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations. 
    more » « less
  2. The use of batched matrix computations recently gained a lot of interest for applications, where the same operation is applied to many small independent matrices. The batched computational pattern is frequently encountered in applications of data analytics, direct/iterative solvers and preconditioners, computer vision, astrophysics, and more, and often requires specific designs for vectorization and extreme parallelism to map well on today's high-end many-core architectures. This has led to the development of optimized software for batch computations, and to an ongoing community effort to develop standard interfaces for batched linear algebra software. Furthering these developments, we present GPU design and optimization techniques for high-performance batched one-sided factorizations of millions of tiny matrices (of size 32 and less). We quantify the effects and relevance of different techniques in order to select the best-performing LU, QR, and Cholesky factorization designs. While we adapt common optimization techniques, such as optimal memory traffic, register blocking, and concurrency control, we also show that a different mindset and techniques are needed when matrices are tiny, and in particular, sub-vector/warp in size. The proposed routines are part of the MAGMA library and deliver significant speedups compared to their counterparts in currently available vendor-optimized libraries. Notably, we tune the developments for the newest V100 GPU from NVIDIA to show speedups of up to 11.8×. 
    more » « less
  3. We vary the inflow properties in a finite-volume solver to investigate their effects on the computed cyclonic motion in a right-cylindrical vortex chamber. The latter comprises eight tangential injectors through which steady-state air is introduced under incompressible and inviscid conditions. To minimize cell skewness around injectors, a fine tetrahedral mesh is implemented first and then converted into polyhedral elements, namely, to improve convergence characteristics and precision. Once convergence is achieved, our principal variables are evaluated and compared using a range of inflow parameters. These include the tangential injector speed, count, diameter, and elevation. The resulting computations show that well-resolved numerical simulations can properly predict the forced vortex behavior that dominates in the core region as well as the free vortex tail that prevails radially outwardly, beyond the point of peak tangential speed. It is also shown that augmenting the mass influx by increasing the number of injectors, injector size, or average injection speed further amplifies the vortex strength and all peak velocities while shifting the mantle radially inwardly. Overall, the axial velocity is found to be the most sensitive to vertical displacements of the injection plane. By raising the injection plane to the top half portion of the chamber, the flow character is markedly altered, and an axially unidirectional vortex is engendered, particularly, with no upward motion or mantle formation. Conversely, the tangential and radial velocities are found to be axially independent and together with the pressure distribution prove to be the least sensitive to injection plane relocations. 
    more » « less
  4. We vary the inflow properties in a finite-volume solver to investigate their effects on the computed cyclonic motion in a right-cylindrical vortex chamber. The latter comprises eight tangential injectors through which steady-state air is introduced under incompressible and inviscid conditions. To minimize cell skewness around injectors, a fine tetrahedral mesh is implemented first and then converted into polyhedral elements, namely, to improve convergence characteristics and precision. Once convergence is achieved, our principal variables are evaluated and compared using a range of inflow parameters. These include the tangential injector speed, count, diameter, and elevation. The resulting computations show that well-resolved numerical simulations can properly predict the forced vortex behavior that dominates in the core region as well as the free vortex tail that prevails radially outwardly, beyond the point of peak tangential speed. It is also shown that augmenting the mass influx by increasing the number of injectors, injector size, or average injection speed, further amplifies the vortex strength and all peak velocities while shifting the mantle radially inwardly. Overall, the axial velocity is found to be the most sensitive to vertical displacements of the injection plane. By raising the injection plane to the top half portion of the chamber, the flow character is markedly altered, and an axially unidirectional vortex is engendered, particularly, with no upward motion or mantle formation. Conversely, the tangential and radial velocities are found to be axially independent and together with the pressure distribution prove to be the least sensitive to injection plane relocations. 
    more » « less
  5. We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU. 
    more » « less