skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: OpenMP Kernel Language Extensions for Performance Portable GPU Codes
In contemporary high-performance computing architectures, the integration of GPU accelerators has become increasingly prevalent. To harness the full potential of these accelerators, developers often resort to vendor-specific kernel languages, such as CUDA. While this approach ensures optimal efficiency, it inherently compromises portability and engenders vendor dependency. Existing portable programming models, such as OpenMP, while promising, demand extensive code rewriting due to their foundamental difference from kernel languages. In this work, we introduce extensions to LLVM OpenMP, transforming it into a versatile and performance portable kernel language for GPU programming. These extensions allow for the seamless porting of programs from kernel languages to high-performance OpenMP GPU programs with minimal modifications. To evaluate our extension, we implemented a proof-of-concept prototype that contains a subset of extensions we proposed. We ported six established CUDA proxy and benchmark applications and evaluated their performance on both AMD and NVIDIA platforms. By comparing with native versions (HIP and CUDA), our results show that OpenMP, augmented with our extensions, can not only match but also in some cases exceed the performance of kernel languages, thereby offering performance portability with minimal effort from application developers.  more » « less
Award ID(s):
2113996
PAR ID:
10534558
Author(s) / Creator(s):
; ; ;
Corporate Creator(s):
Editor(s):
Badia, Rosa M; Mohror, Kathryn
Publisher / Repository:
Association for Computing Machinery
Date Published:
Edition / Version:
SC-W '23
ISSN:
9798400707858
ISBN:
9798400707858
Page Range / eLocation ID:
876–883
Subject(s) / Keyword(s):
CUDA, GPU, HIP, OpenMP
Format(s):
Medium: X Size: 0.5
Size(s):
0.5
Location:
Denver, CO,USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Programming to achieve high performance for NVIDIA GPUs using CUDA has been known to be challenging. A GPU has hundreds or thousands of cores that a program must exhibit sufficient parallelism to achieve maximum GPU utilization. A system with GPU accelerators has a heterogeneous and deep memory system that programmers must effectively and correctly use to fully take advantage of the GPU's parallelism capability. In this paper, we present CUDAMicroBench, a collection of fourteen microbenchmarks that demonstrate performance challenges in CUDA programming and techniques to optimize the CUDA programs to address these challenges. It also includes examples and techniques for using advanced CUDA features such as data shuffling between threads, dynamic parallelism, etc that can help users optimize the CUDA program for performance. The microbenchmark can be used for evaluating the performance of GPU architectures, the memory systems of GPU itself and of the whole system architectures, and for evaluating the effectiveness of compiler and performance tools for performance analysis. It can be used to help users understand the complexity of heterogeneous GPU-accelerator systems through examples and guide users for performance optimization. It is released as BSD-licensed open-source from https://github.com/passlab/CUDAMicroBench.git. 
    more » « less
  2. NA (Ed.)
    While parallelism remains the main source of performance,architectural implementations and programming modelschange with each new hardware generation, often leadingto costly application re-engineering. Most tools for perfor-mance portability require manual and costly application port-ing to yet another programming model.We propose an alternative approach that automaticallytranslates programs written in one programming model(CUDA), into another (CPU threads) based on Polygeist/MLIR.Our approach includes a representation of parallel constructsthat allows conventional compiler transformations to ap-ply transparently and without modification a nd enablesparallelism-specific optimizations. We evaluate our frame-work by transpiling and optimizing the CUDA Rodinia bench-mark suite for a multi-core CPU and achieve a 58% geomeanspeedup over handwritten OpenMP code. Further, we showhow CUDA kernels from PyTorch can efficiently run andscale on the CPU-only Supercomputer Fugaku without userintervention. Our PyTorch compatibility layer making use oftranspiled CUDA PyTorch kernels outperforms the PyTorchCPU native backend by 2.7×. 
    more » « less
  3. As FPGAs and GPUs continue to make inroads into high-performance computing (HPC), the need for languages and frameworks that offer performance, productivity, and portability across heterogeneous platforms, such as FPGAs and GPUs, continues to grow. OpenCL and SYCL have emerged as frameworks that offer cross-platform functional portability between FPGAs and GPUs. While functional portability across a diverse set of platforms is an important feature of portable frameworks, achieving performance portability often requires vendor and platform-specific optimizations. Achieving performance portability, therefore, comes at the expense of productivity. This paper presents a quantification of the tradeoffs between performance, portability, and productivity of OpenCL and SYCL. It extends and complements our prior work on quantifying performance-productivity tradeoffs between Verilog and OpenCL for the FPGA. In addition to evaluating the performance-productivity tradeoffs between OpenCL and SYCL, this work quantifies the performance portability (PP) of OpenCL and SYCL as well as their code convergence (CC), i.e., a measure of productivity across different platforms (e.g., FPGA and GPU). Using two applications as case studies (i.e., edge detection using the Sobel filter, and graph link prediction using the Jaccard similarity index), we characterize the tradeoffs between performance, portability, and productivity. Our results show that OpenCL and SYCL offer complementary tradeoffs. While OpenCL delivers better performance portability than SYCL, SYCL offers better code convergence and a 1.6× improvement in source lines of code over OpenCL. 
    more » « less
  4. CUDA is designed specifically for NVIDIA GPUs and is not compatible with non-NVIDIA devices. Enabling CUDA execution on alternative backends could greatly benefit the hardware community by fostering a more diverse software ecosystem. To address the need for portability, our objective is to develop a framework that meets key requirements, such as extensive coverage, comprehensive end-to-end support, superior performance, and hardware scalability. Existing solutions that translate CUDA source code into other high-level languages, however, fall short of these goals. In contrast to these source-to-source approaches, we present a novel framework, CuPBoP , which treats CUDA as a portable language in its own right. Compared to two commercial source-to-source solutions, CuPBoP offers a broader coverage and superior performance for the CUDA-to-CPU migration. Additionally, we evaluate the performance of CuPBoP against manually optimized CPU programs, highlighting the differences between CPU programs derived from CUDA and those that are manually optimized. Furthermore, we demonstrate the hardware scalability of CuPBoP by showcasing its successful migration of CUDA to AMD GPUs. To promote further research in this field, we have released CuPBoP as an open-source resource. 
    more » « less
  5. Experience shows that on today's high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for distributed applications is to guarantee high utilization of all available resources, including local or remote acceleration cards on a cluster while fully using all the available CPU resources and the integration of the GPU work into the overall programming model. For the integration of CUDA code we extended HPX, a general purpose C++ run time system for parallel and distributed applications of any scale, and enabled asynchronous data transfers from and to the GPU device and the asynchronous invocation of CUDA kernels on this data. Both operations are well integrated into the general programming model of HPX which allows to seamlessly overlap any GPU operation with work on the main cores. Any user defined CUDA kernel can be launched on any (local or remote) GPU device available to the distributed application. We present asynchronous implementations for the data transfers and kernel launches for CUDA code as part of a HPX asynchronous execution graph. Using this approach we can combine all remotely and locally available acceleration cards on a cluster to utilize its full performance capabilities. Overhead measurements show, that the integration of the asynchronous operations (data transfer + launches of the kernels) as part of the HPX execution graph imposes no additional computational overhead and significantly eases orchestrating coordinated and concurrent work on the main cores and the used GPU devices. 
    more » « less