Accurate simulation of earthquake scenarios is essential for advancing seismic hazard analysis and risk mitigation strategies. At the San Diego Supercomputer Center (SDSC), our research focuses on optimizing the performance and reliability of large-scale earthquake simulations using the AWP-ODC software. By implementing GPU-aware MPI calls, we enable direct data processing within GPU memory, eliminating the need for explicit data transfers between CPU and GPU. This GPU-aware MPI achieves nearly ideal parallel efficiency at full scale across both Nvidia and AMD GPUs, leveraging the MVAPICH-PLUS support on Frontier at Oak Ridge National Laboratory and Vista at the Texas Advanced Computing Center. We utilized the MVAPICH-Plus 4.0 compiler to enable ZFP compression, which significantly enhances inter-node communication efficiency – a critical improvement given the communication bottleneck inherent in large-scale simulations. Our GPU-aware AWP-ODC versions include linear forward, topography and nonlinear Iwan-type solvers with discontinuous mesh support. On the Frontier system with MVAPICH 4.0, Hip-aware MPI calls on MI250X GPUs deliver nearly ideal weak-scaling speedup up to 8,192 nodes for both linear and topography versions. On TACC’s Vista system, CUDA-aware MPI calls on GH200 GPUs substantially outperform their non-GPU-aware counterparts across all three solver versions. This poster will present a detailed evaluation of GPU-aware AWP-ODC using MVAPICH, including the impact of ZFP message compression compared to the native versions. Our results highlight the importance of Mvapich support for GPU-ware MPI and on-the-fly compression techniques for accelerating and scaling earthquake simulations.
more »
« less
This content will become publicly available on August 20, 2026
Optimization of a Finite Difference Wave Propagation code with GPU-aware MPI support using MVAPICH2
We integrate GPU-aware MVAPICH2 in AWP-ODC, a scalable finite difference code for wave propagation in nonlinear media. On OLCF Frontier, HIP-aware MVAPICH2 yields a 17.8% T2S improvement over the non-GPU-aware version and achieves 95% parallel efficiency on 65,536 AMD MI250X GCDs. On TACC Vista, CUDA-aware MVAPICH2 delivers a 3.5% performance gain across 2-256 Nvidia GH200 GPUs, with parallel efficiencies of 82% in the linear case and 92% in the computationally more intense nonlinear case. We deploy the code for production-scale earthquake simulations on leadership-class systems
more »
« less
- Award ID(s):
- 2311833
- PAR ID:
- 10630251
- Publisher / Repository:
- Annual MVAPICH USER Group (MUG) Meeting
- Date Published:
- Format(s):
- Medium: X
- Location:
- Columbus
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We have ported and verified the topography version of AWP-ODC, with discontinuous mesh feature enabled, to HIP so that it runs on AMD MI250X GPUs. 103.3% parallel efficiency was benchmarked on Frontier between 8 and 4,096 nodes or up to 32,768 GCDs. Frontier is a two exaflop/s computing system based on the AMD Radeon Instinct GPUs and EPYC CPUs, a Leadership Computing Facility at Oak Ridge National Laboratory (ORNL). This HIP topography code has been used in the production runs on Frontier, a primary computing engine currently utilizing the 2024 SCEC INCITE allocation, a 700K node-hours supercomputing time award. Furthermore, we implemented ROCm-Aware GPU direct support in the topo code, and demonstrated 14% additional reduction in time-to-solution up to 4,096 nodes. The AWP-ODC-Topo code is also tuned on TACC Vista, an Arm-based NVIDIA GH200 Grace Hopper Superchip, with excellent performance demonstrated. This poster will demonstrate the studies of weak scaling and the performance characteristics on GPUs. We discuss the efforts of verifying the ROCm-Aware development, and utilizing high-performance MVAPICH libraries with the on-the-fly compression on modern GPU clusters.more » « less
-
AWP-ODC is a 4th-order finite difference code used for linear wave propagation, Iwan-type nonlinear dynamic rupture and wave propagation, and Strain Green Tensor simulation2. We have ported and verified the linear and topography version of AWP-ODC, with discontinuous mesh as well as topography, to HIP so that it can also run on AMD GPUs. The topography code achieved a 99.6% parallel efficiency on 4,096 nodes on Frontier, a Leadership Computing Facility at Oak Ridge National Laboratory. We have also implemented CUDA-aware features and on-the-fly GDR compression in the linear version of the ported HIP code. These enhancements significantly improve data transfer efficiency between GPUs, reducing communication overhead and boosting overall performance. We have also extended CUDA-aware features to the topography version and are actively working on incorporating GDR compression into this version as well. We see 154% benefits over IMPI in MVAPICH2-GDR with CUDA-aware support and on-the-fly compression for linear AWP-ODC on Lonestar-6 A100 nodes. Furthermore, we have successfully integrated a checkpointing feature into the nonlinear IWAN version of AWP-ODC, prepared for future extreme-scale simulation during Texascale Days of Frontera at TACC.more » « less
-
Pellizzoni, Rodolfo (Ed.)Scheduling real-time tasks that utilize GPUs with analyzable guarantees poses a significant challenge due to the intricate interaction between CPU and GPU resources, as well as the complex GPU hardware and software stack. While much research has been conducted in the real-time research community, several limitations persist, including the absence or limited availability of GPU-level preemption, extended blocking times, and/or the need for extensive modifications to program code. In this paper, we propose GCAPS, a GPU Context-Aware Preemptive Scheduling approach for real-time GPU tasks. Our approach exerts control over GPU context scheduling at the device driver level and enables preemption of GPU execution based on task priorities by simply adding one-line macros to GPU segment boundaries. In addition, we provide a comprehensive response time analysis of GPU-using tasks for both our proposed approach as well as the default Nvidia GPU driver scheduling that follows a work-conserving round-robin policy. Through empirical evaluations and case studies, we demonstrate the effectiveness of the proposed approaches in improving taskset schedulability and response time. The results highlight significant improvements over prior work as well as the default scheduling approach, with up to 40% higher schedulability, while also achieving predictable worst-case behavior on Nvidia Jetson embedded platforms.more » « less
-
Python's ease of use and rich collection of numeric libraries make it an excellent choice for rapidly developing scientific applications. However, composing these libraries to take advantage of complex heterogeneous nodes is still difficult. To simplify writing multi-device code, we created Parla, a heterogeneous task-based programming framework that fully supports Python's scientific programming stack. Parla's API is based on Python decorators and allows users to wrap code in Parla tasks for parallel execution. Parla arrays enable automatic movement of data between devices. The Parla runtime handles resource-aware mapping, scheduling, and execution of tasks. Compared to other Python tasking systems, Parla is unique in its parallelization of tasks within a single process, its GPU context and resource-aware runtime, and its design around gradual adoption to provide easy migration of and integration into existing Python applications. We show that Parla can achieve performance competitive with hand-optimized code while improving ease of development.more » « less
An official website of the United States government
