NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

PAQR: Pivoting Avoiding QR factorization

https://doi.org/10.1109/IPDPS54959.2023.00040

Sid-Lakhdar, Wissam; Cayrols, Sebastien; Bielich, Daniel; Abdelfattah, Ahmad; Luszczek, Piotr; Gates, Mark; Tomov, Stanimire; Johansen, Hans; Williams-Young, David; Davis, Timothy; et al (May 2023, 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS))
Evolution of the SLATE linear algebra library

https://doi.org/10.1177/10943420241286531

Gates, Mark; Abdelfattah, Ahmad; Akbudak, Kadir; Al_Farhan, Mohammed; Alomairy, Rabab; Bielich, Daniel; Burgess, Treece; Cayrols, Sébastien; Lindquist, Neil; Sukkari, Dalal; et al (September 2024, The International Journal of High Performance Computing Applications)

SLATE (Software for Linear Algebra Targeting Exascale) is a distributed, dense linear algebra library targeting both CPU-only and GPU-accelerated systems, developed over the course of the Exascale Computing Project (ECP). While it began with several documents setting out its initial design, significant design changes occurred throughout its development. In some cases, these were anticipated: an early version used a simple consistency flag that was later replaced with a full-featured consistency protocol. In other cases, performance limitations and software and hardware changes prompted a redesign. Sequential communication tasks were parallelized; host-to-host MPI calls were replaced with GPU device-to-device MPI calls; more advanced algorithms such as Communication Avoiding LU and the Random Butterfly Transform (RBT) were introduced. Early choices that turned out to be cumbersome, error prone, or inflexible have been replaced with simpler, more intuitive, or more flexible designs. Applications have been a driving force, prompting a lighter weight queue class, nonuniform tile sizes, and more flexible MPI process grids. Of paramount importance has been building a portable library that works across several different GPU architectures – AMD, Intel, and NVIDIA – while keeping a clean and maintainable codebase. Here we explore the evolving design choices and their effects, both in terms of performance and software sustainability.
more » « less
Matrix multiplication on batches of small matrices in half and half-complex precisions

https://doi.org/10.1016/j.jpdc.2020.07.001

Abdelfattah, Ahmad; Tomov, Stanimire; Dongarra, Jack (November 2020, Journal of Parallel and Distributed Computing)

Full Text Available
Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs

https://doi.org/10.1109/HPEC43674.2020.9286214

Brown, Cade; Abdelfattah, Ahmad; Tomov, Stanimire; Dongarra, Jack (September 2020, 2020 IEEE High Performance Extreme Computing Conference (HPEC))
null (Ed.)
Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs.We use auto tools to convert the CUDA code in MAGMA to the Heterogeneous-Computing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements.
more » « less
Full Text Available
A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines

https://doi.org/10.1145/3431921

Abdelfattah, Ahmad; Costa, Timothy; Dongarra, Jack; Gates, Mark; Haidar, Azzam; Hammarling, Sven; Higham, Nicholas J.; Kurzak, Jakub; Luszczek, Piotr; Tomov, Stanimire; et al (June 2021, ACM Transactions on Mathematical Software)
null (Ed.)
This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.
more » « less
Full Text Available
A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Haidar, Azzam; Abdelfattah, Ahmad; Zounon, Mawussi; Tomov, Stanimire; Dongarra, Jack (May 2018, IEEE transactions on parallel and distributed systems)

We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.
more » « less
Full Text Available

Search for: All records