skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: GreenMD: Energy-efficient Matrix Decomposition on Heterogeneous Multi-GPU Systems
The current trend of performance growth in HPC systems is accompanied by a massive increase in energy consumption. In this article, we introduce GreenMD, an energy-efficient framework for heterogeneous systems for LU factorization utilizing multi-GPUs. LU factorization is a crucial kernel from the MAGMA library, which is highly optimized. Our aim is to apply DVFS to this application by leveraging slacks intelligently on both CPUs and multiple GPUs. To predict the slack times, accurate performance models are developed separately for both CPUs and GPUs based on the algorithmic knowledge and manufacturer’s specifications. Since DVFS does not reduce static energy consumption, we also develop undervolting techniques for both CPUs and GPUs. Reducing voltage below threshold values may give rise to errors; hence, we extract the minimum safe voltages ( V safeMin ) for the CPUs and GPUs utilizing a low overhead profiling phase and apply them before execution. It is shown that GreenMD improves the CPU, GPU, and total energy about 59%, 21%, and 31%, respectively, while delivering similar performance to the state-of-the-art linear algebra MAGMA library.  more » « less
Award ID(s):
1907401
PAR ID:
10439767
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
ACM Transactions on Parallel Computing
Volume:
10
Issue:
2
ISSN:
2329-4949
Page Range / eLocation ID:
1 to 23
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs.We use auto tools to convert the CUDA code in MAGMA to the Heterogeneous-Computing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements. 
    more » « less
  2. LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13--32.65X. Further, our out-of-core implementation achieves a speedup of 1.2--2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective. 
    more » « less
  3. LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13--32.65X. Further, our out-of-core implementation achieves a speedup of 1.2--2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective. 
    more » « less
  4. Dynamic voltage and frequency scaling (DVFS) is a well-known technique to reduce the power and/or energy consumption of various applications. While most processors provide chip-level DVFS, where the frequency and voltage of the cores in a chip can only be changed all together; core-level DVFS, where each core can be controlled independently, requires core-level voltage regulators in hardware and only is supported in production in Haswell generation among Intel processors. The finer grained control that per-core DVFS provides can lead to higher energy efficiency compared to chip-level DVFS especially for the unsynchronized, unstructured parallel applications when carefully applied. Ability to do per-core DVFS opens up new doors for different optimizations within runtime systems. We implement an intelligent energy efficient runtime module which uses a fine-grained function level per-core DVFS approach. Our module finds the energy-optimal frequency for each phase/function/kernel of the application over the first few iterations and applies the optimal frequency for each function. We test our implementation on Haswell processors and show that our algorithm enables 4% to 35% energy reduction over chip-level DVFS with as much as performance. 
    more » « less
  5. The end of Moore’s Law and Dennard scaling has driven the proliferation of heterogeneous systems with accelerators, including CPUs, GPUs, and FPGAs, each with distinct architectures, compilers, and programming environments. GPUs excel at massively parallel processing for tasks like deep learning training and graphics rendering, while FPGAs offer hardware-level flexibility and energy efficiency for low-latency, high-throughput applications. In contrast, CPUs, while general-purpose, often fall short in high-parallelism or power-constrained applications. This architectural diversity makes it challenging to compare these accelerators effectively, leading to uncertainty in selecting optimal hardware and software tools for specific applications. To address this challenge, we introduce HeteroBench, a versatile benchmark suite for heterogeneous systems. HeteroBench allows users to evaluate multi-compute kernel applications across various accelerators, including CPUs, GPUs (from NVIDIA, AMD, Intel), and FPGAs (AMD), supporting programming environments of Python, Numba-accelerated Python, serial C++, OpenMP (both CPUs and GPUs), OpenACC and CUDA for GPUs, and Vitis HLS for FPGAs. This setup enables users to assign kernels to suitable hardware platforms, ensuring comprehensive device comparisons. What makes HeteroBench unique is its vendor-agnostic, cross-platform approach, spanning diverse domains such as image processing, machine learning, numerical computation, and physical simulation, ensuring deeper insights for HPC optimization. Extensive testing across multiple systems provides practical reference points for HPC practitioners, simplifying hardware selection and performance tuning for both developers and end-users alike. This suite may assist to make more informed decision on AI/ML deployment and HPC development, making it an invaluable resource for advancing academic research and industrial applications. 
    more » « less