skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 2, 2026

Title: Unlocking the power of parallel computing: GPU technologies for ocean forecasting
Operational ocean forecasting systems (OOFSs) are complex engines that must execute ocean models with high performance to provide timely products and datasets. Significant computational resources are then needed to run high-fidelity models, and, historically, the technological evolution of microprocessors has constrained data-parallel scientific computation. Today, graphics processing units (GPUs) offer a rapidly growing and valuable source of computing power rivaling the traditional CPU-based machines: the exploitation of thousands of threads can significantly accelerate the execution of many models, ranging from traditional HPC workloads of finite difference, finite volume, and finite element modelling through to the training of deep neural networks used in machine learning (ML) and artificial intelligence. Despite the advantages, GPU usage in ocean forecasting is still limited due to the legacy of CPU-based model implementations and the intrinsic complexity of porting core models to GPU architectures. This review explores the potential use of GPU in ocean forecasting and how the computational characteristics of ocean models can influence the suitability of GPU architectures for the execution of the overall value chain: it discusses the current approaches to code (and performance) portability, from CPU to GPU, including tools that perform code transformation, easing the adaptation of Fortran code for GPU execution (like PSyclone), the direct use of OpenACC directives (like ICON-O), the adoption of specific frameworks that facilitate the management of parallel execution across different architectures, and the use of new programming languages and paradigms.  more » « less
Award ID(s):
2103942
PAR ID:
10635289
Author(s) / Creator(s):
;
Publisher / Repository:
Copernicus
Date Published:
Journal Name:
State of the planet
ISSN:
2752-0706
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Experience shows that on today's high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for distributed applications is to guarantee high utilization of all available resources, including local or remote acceleration cards on a cluster while fully using all the available CPU resources and the integration of the GPU work into the overall programming model. For the integration of CUDA code we extended HPX, a general purpose C++ run time system for parallel and distributed applications of any scale, and enabled asynchronous data transfers from and to the GPU device and the asynchronous invocation of CUDA kernels on this data. Both operations are well integrated into the general programming model of HPX which allows to seamlessly overlap any GPU operation with work on the main cores. Any user defined CUDA kernel can be launched on any (local or remote) GPU device available to the distributed application. We present asynchronous implementations for the data transfers and kernel launches for CUDA code as part of a HPX asynchronous execution graph. Using this approach we can combine all remotely and locally available acceleration cards on a cluster to utilize its full performance capabilities. Overhead measurements show, that the integration of the asynchronous operations (data transfer + launches of the kernels) as part of the HPX execution graph imposes no additional computational overhead and significantly eases orchestrating coordinated and concurrent work on the main cores and the used GPU devices. 
    more » « less
  2. Abstract Cloud microphysics is one of the most time‐consuming components in a climate model. In this study, we port the cloud microphysics parameterization in the Community Atmosphere Model (CAM), known as Parameterization of Unified Microphysics Across Scales (PUMAS), from CPU to GPU to seek a computational speedup. The directive‐based methods (OpenACC and OpenMP target offload) are determined as the best fit specifically for our development practices, which enable a single version of source code to run either on the CPU or GPU, and yield a better portability and maintainability. Their performance is first examined in a PUMAS stand‐alone kernel and the directive‐based methods can outperform a CPU node as long as there is enough computational burden on the GPU. A consistent behavior is observed when we run PUMAS on the GPU in a practical CAM simulation. A 3.6× speedup of the PUMAS execution time, including data movement between CPU and GPU, is achieved at a coarse horizontal resolution (8 NVIDIA V100 GPUs against 36 Intel Skylake CPU cores). This speedup further increases up to 5.4× at a high resolution (24 NVIDIA V100 GPUs against 108 Intel Skylake CPU cores), which highlights the fact that GPU favors larger problem size. This study demonstrates that using GPU in a CAM simulation can save noticeable computational costs even with a small portion of code being GPU‐enabled. Therefore, we are encouraged to port more parameterizations to GPU to take advantage of its computational benefit. 
    more » « less
  3. Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this article, we design an instruction roofline model for AMD GPUs using AMD’s ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application’s performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work. 
    more » « less
  4. Alternating Least Square (ALS) is a classic algorithm to solve matrix factorization widely used in recommendation systems. Existing efforts focus on parallelizing ALS on multi-/many-core platforms to handle large datasets. Recently, an optimized ALS variant called eALS was proposed, and it yields significantly lower time complexity and higher recommending accuracy than ALS. However, it is challenging to parallelize eALS on modern parallel architectures (e.g., CPUs and GPUs) because: 1) eALS’ data dependence prevents it from fine-grained parallel execution, thus eALS cannot fully utilize GPU's massive parallelism, 2) the sparsity of input data causes poor data locality and unbalanced workload, and 3) its large memory usage cannot fit into GPU's limited on-device memory, particularly for real-world large datasets. This paper proposes an efficient CPU/GPU heterogeneous recommendation system based on fast eALS for the first time (called HEALS) that consists of a set of system optimizations. HEALS employs newly designed architecture-adaptive data formats to achieve load balance and good data locality on CPU and GPU. HEALS also presents a CPU/GPU collaboration model that can explore both task parallelism and data parallelism. HEALS also optimizes this collaboration model with data communication overlapping and dynamic workload partition between CPU and GPU. Moreover, HEALS is further enhanced by various parallel techniques (e.g., loop unrolling, vectorization, and GPU parallel reduction). Evaluation results show that HEALS outperforms other state-of-the-art baselines in both performance and recommendation quality. Particularly, HEALS achieves up to 5.75 x better performance than a state-of-the-art ALS GPU library. This work also demonstrates the possibility of conducting fast recommendations on large datasets with constrained (or relaxed) hardware resources, e.g, a single CPU/GPU node. 
    more » « less
  5. Integrated CPU-GPU architecture provides excellent acceleration capabilities for data parallel applications on embedded platforms while meeting the size, weight and power (SWaP) requirements. However, sharing of main memory between CPU applications and GPU kernels can severely affect the execution of GPU kernels and diminish the performance gain provided by GPU. For example, in the NVIDIA Jetson TX2 platform, an integrated CPU-GPU architecture, we observed that, in the worst case, the GPU kernels can suffer as much as 3X slowdown in the presence of co-running memory intensive CPU applications. In this paper, we propose a software mechanism, which we call BWLOCK++, to protect the performance of GPU kernels from co-scheduled memory intensive CPU applications. 
    more » « less