skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific Workloads
The engineering samples of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchips were tested using different benchmarks and scientific applications. The benchmarks include HPCC and HPCG. The real application-based benchmark includes AI-Benchmark-Alpha (a TensorFlow benchmark), Gromacs, OpenFOAM, and ROMS. The performance was compared to multiple Intel, AMD, ARM CPUs and several x86 with NVIDIA GPU systems. A brief energy efficiency estimate was performed based on TDP values. We found that in HPCC benchmark tests, the per-core performance of Grace is similar to or faster than AMD Milan cores, and the high core count often allows NVIDIA Grace CPU Superchip to have per-node performance similar to Intel Sapphire Rapids with High Bandwidth Memory: slower in matrix multiplication (by 17%) and FFT (by 6%), faster in Linpack (by 9%)). In scientific applications, the NVIDIA Grace CPU Superchip performance is slower by 6% to 18% in Gromacs, faster by 7% in OpenFOAM, and right between HBM and DDR modes of Intel Sapphire Rapids in ROMS. The combined CPU-GPU performance in Gromacs is significantly faster (by 20% to 117% faster) than any tested x86-NVIDIA GPU system. Overall, the new NVIDIA Grace Hopper Superchip and NVIDIA Grace CPU Superchip Superchip are high-performance and most likely energy-efficient solutions for HPC centers.  more » « less
Award ID(s):
2137603
PAR ID:
10498735
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
ACM
Date Published:
ISBN:
9798400716522
Page Range / eLocation ID:
36 to 44
Format(s):
Medium: X
Location:
Nagoya Japan
Sponsoring Org:
National Science Foundation
More Like this
  1. This study presents a comprehensive benchmarking analysis of the Arm-based AmpereOne A192-32X CPU, a high-performance but low power processor designed for cloud-native workloads characterized by high core occupancy, imperfectly-vectorized or even pure scalar software, limited need for high floating-point performance, and, increasingly, AI inference. These traits also characterize much of academic research computing. Hence a thorough investigation of this novel CPU seeking to characterize its strengths and weaknesses on academic workloads, including traditional HPC codes for which it was not designed, will shed light on its relevance in a research setting. We report comparative analyses with contemporary CPUs (Intel Sapphire Rapids, AMD EPYC, NVIDIA Grace-Grace) and illustrate AmpereOne’s architectural advantages in handling parallel workloads and optimizing power consumption. The CPUs are compared in terms of performance and power consumption using a wide range of applications covering different workloads and disciplines. 
    more » « less
  2. Abstract The landscape of high performance computing (HPC) has witnessed exponential growth in processor diversity, architectural complexity, and performance scalability. With an ever-increasing demand for faster and more efficient computing solutions to address an array of scientific, engineering, and societal challenges, the selection of processors for specific applications becomes paramount. Achieving optimal performance requires a deep understanding of how diverse processors interact with diverse workloads, making benchmarking a fundamental practice in the field of HPC. Here, we present preliminary results observed over such benchmarks and applications and a comparison of Intel Sapphire Rapids and Skylake-X, AMD Milan, and Fujitsu A64FX processors in terms of runtime performance, memory bandwidth utilization, and energy consumption. The examples focus specifically on the Sapphire Rapids processor with and without high-bandwidth memory (HBM). An additional case study reports the performance gains from using Intel’s Advanced Matrix Extensions (AMX) instructions, and how they along with HBM can be leveraged to accelerate AI workloads. These initial results aim to give a rough comparison of the processors rather than a detailed analysis and should prove timely and relevant for researchers who may be interested in using Sapphire Rapids for their scientific workloads. 
    more » « less
  3. Modern High Performance Computing (HPC) systems are built with innovative system architectures and novel programming models to further push the speed limit of computing. The increased complexity poses challenges for performance portability and performance evaluation. The Standard Performance Evaluation Corporation (SPEC) has a long history of producing industry-standard benchmarks for modern computer systems. SPEC’s newly released SPEChpc 2021 benchmark suites, developed by the High Performance Group, are a bold attempt to provide a fair and objective benchmarking tool designed for stateof-the-art HPC systems. With the support of multiple host and accelerator programming models, the suites are portable across both homogeneous and heterogeneous architectures. Different workloads are developed to fit system sizes ranging from a few compute nodes to a few hundred compute nodes. In this work we present our first experiences in performance benchmarking the new SPEChpc2021 suites and evaluate their portability and basic performance characteristics on various popular and emerging HPC architectures, including x86 CPU, NVIDIA GPU, and AMD GPU. This study provides a first-hand experience of executing the SPEChpc 2021 suites at scale on production HPC systems, discusses real-world use cases, and serves as an initial guideline for using the benchmark suites. 
    more » « less
  4. Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this article, we design an instruction roofline model for AMD GPUs using AMD’s ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application’s performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work. 
    more » « less
  5. We have ported and verified the topography version of AWP-ODC, with discontinuous mesh feature enabled, to HIP so that it runs on AMD MI250X GPUs. 103.3% parallel efficiency was benchmarked on Frontier between 8 and 4,096 nodes or up to 32,768 GCDs. Frontier is a two exaflop/s computing system based on the AMD Radeon Instinct GPUs and EPYC CPUs, a Leadership Computing Facility at Oak Ridge National Laboratory (ORNL). This HIP topography code has been used in the production runs on Frontier, a primary computing engine currently utilizing the 2024 SCEC INCITE allocation, a 700K node-hours supercomputing time award. Furthermore, we implemented ROCm-Aware GPU direct support in the topo code, and demonstrated 14% additional reduction in time-to-solution up to 4,096 nodes. The AWP-ODC-Topo code is also tuned on TACC Vista, an Arm-based NVIDIA GH200 Grace Hopper Superchip, with excellent performance demonstrated. This poster will demonstrate the studies of weak scaling and the performance characteristics on GPUs. We discuss the efforts of verifying the ROCm-Aware development, and utilizing high-performance MVAPICH libraries with the on-the-fly compression on modern GPU clusters. 
    more » « less