State‐of‐the‐art distributed‐memory computer clusters contain multicore CPUs with 16 and more cores. The second generation of the Intel Xeon Phi many‐core processor has more than 60 cores with 16 GB of high‐performance on‐chip memory. We contrast the performance of the second‐generation Intel Xeon Phi, code‐named Knights Landing (KNL), with 68 computational cores to the latest multicore CPU Intel Skylake with 18 cores. A special‐purpose code solving a system of nonlinear reaction‐diffusion partial differential equations with several thousands of point sources modeled mathematically by Dirac delta distributions serves as realistic test bed. The system is discretized in space by the finite volume method and advanced by fully implicit time‐stepping, with a matrix‐free implementation that allows the complex model to have an extremely small memory footprint. The sample application is a seven variable model of calcium‐induced calcium release (CICR) that models the interplay between electrical excitation, calcium signaling, and mechanical contraction in a heart cell. The results demonstrate that excellent parallel scalability is possible on both hardware platforms, but that modern multicore CPUs outperform the specialized many‐core Intel Xeon Phi KNL architecture for a large class of problems such as systems of parabolic partial differential equations.
Exploring source-to-source compiler transformation of OpenMP SIMD constructs for Intel AVX and Arm SVE vector architectures
Over the past decade, SIMD (single instruction multiple data) or vector architectures have made significant advances, now existing across a wide range of devices from commodity CPUs to high performance computing (HPC) cores. Intel's AVX (Advanced Vector Extensions) architecture has been one of the most popular SIMD extensions to commodity and HPC CPUs from Intel. Over the past few years, Arm has made significant inroads with its new SVE (Scalable Vector Extension), used in the supercomputer of the top place on the Top500 list. As SIMD has become more advanced and more important, it has become equally important the compilers support these architecture extensions. In this paper, we present our approach of source-to-source compiler transformation of explicit vectorization constructs using the OpenMP SIMD directive. We present the design of a unified IR that is easily translated to AVX and SVE vector architectures. Finally, we conduct performance evaluations on Intel AVX and Arm SVE to demonstrate how this method of vectorization can bridge the gap between auto- and manual- vectorization.
more »
« less
- Award ID(s):
- 2015254
- PAR ID:
- 10394960
- Date Published:
- Journal Name:
- PMAM '22: Proceedings of the Thirteenth International Workshop on Programming Models and Applications for Multicores and Manycores
- Page Range / eLocation ID:
- 11 to 20
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract -
Bhatele, A. ; Hammond, J. ; Baboulin, M. ; Kruse, C. (Ed.)The reactive force field (ReaxFF) interatomic potential is a powerful tool for simulating the behavior of molecules in a wide range of chemical and physical systems at the atomic level. Unlike traditional classical force fields, ReaxFF employs dynamic bonding and polarizability to enable the study of reactive systems. Over the past couple decades, highly optimized parallel implementations have been developed for ReaxFF to efficiently utilize modern hardware such as multi-core processors and graphics processing units (GPUs). However, the complexity of the ReaxFF potential poses challenges in terms of portability to new architectures (AMD and Intel GPUs, RISC-V processors, etc.), and limits the ability of computational scientists to tailor its functional form to their target systems. In this regard, the convergence of cyber-infrastructure for high performance computing (HPC) and machine learning (ML) presents new opportunities for customization, programmer productivity and performance portability. In this paper, we explore the benefits and limitations of JAX, a modern ML library in Python representing a prime example of the convergence of HPC and ML software, for implementing ReaxFF. We demonstrate that by leveraging auto-differentiation, just-in-time compilation, and vectorization capabilities of JAX, one can attain a portable, performant, and easy to maintain ReaxFF software. Beyond enabling MD simulations, end-to-end differentiability of trajectories produced by ReaxFF implemented with JAX makes it possible to perform related tasks such as force field parameter optimization and meta-analysis without requiring any significant software developments. We also discuss scalability limitations using the current version of JAX for ReaxFF simulations.more » « less
-
Today, isolated trusted computation and code execution is of paramount importance to protect sensitive information and workflows from other malicious privileged or unprivileged software. Intel Software Guard Extensions (SGX) is a set of security architecture extensions first introduced in the Skylake microarchitecture that enables a Trusted Execution Environment (TEE). It provides an ‘inverse sandbox’, for sensitive programs, and guarantees the integrity and confidentiality of secure computations, even from the most privileged malicious software (e.g. OS, hypervisor). SGX-capable CPUs only became available in production systems in Q3 2015, and they are not yet fully supported and adopted in systems. Besides the capability in the CPU, the BIOS also needs to provide support for the enclaves, and not many vendors have released the required updates for the system support. This has led to many wrong assumptions being made about the capabilities, features, and ultimately dangers of secure enclaves. By having access to resources and publications such as white papers, patents and the actual SGX-capable hardware and software development environment, we are in a privileged position to be able to investigate and demystify SGX. In this paper, we first review the previous trusted execution technologies, such as ARM Trust Zone and Intel TXT, to better understand and appreciate the new innovations of SGX. Then, we look at the details of SGX technology, cryptographic primitives and the underlying concepts that power it, namely the sealing, attestation, and the Memory Encryption Engine (MEE). We also consider use cases such as trusted and secure code execution on an untrusted cloud platform, and digital rights management (DRM). This is followed by an overview of the software development environment and the available libraries.more » « less
-
The engineering samples of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchips were tested using different benchmarks and scientific applications. The benchmarks include HPCC and HPCG. The real application-based benchmark includes AI-Benchmark-Alpha (a TensorFlow benchmark), Gromacs, OpenFOAM, and ROMS. The performance was compared to multiple Intel, AMD, ARM CPUs and several x86 with NVIDIA GPU systems. A brief energy efficiency estimate was performed based on TDP values. We found that in HPCC benchmark tests, the per-core performance of Grace is similar to or faster than AMD Milan cores, and the high core count often allows NVIDIA Grace CPU Superchip to have per-node performance similar to Intel Sapphire Rapids with High Bandwidth Memory: slower in matrix multiplication (by 17%) and FFT (by 6%), faster in Linpack (by 9%)). In scientific applications, the NVIDIA Grace CPU Superchip performance is slower by 6% to 18% in Gromacs, faster by 7% in OpenFOAM, and right between HBM and DDR modes of Intel Sapphire Rapids in ROMS. The combined CPU-GPU performance in Gromacs is significantly faster (by 20% to 117% faster) than any tested x86-NVIDIA GPU system. Overall, the new NVIDIA Grace Hopper Superchip and NVIDIA Grace CPU Superchip Superchip are high-performance and most likely energy-efficient solutions for HPC centers.more » « less
-
null (Ed.)Recent scientific computing increasingly relies on multi-scale multi-physics simulations to enhance predictive capabilities by replacing a suite of stand-alone simulation codes that independently simulate various physical phenomena. Inevitably, multi-physics simulation demands high performance computing (HPC) through advanced hardware and software accelerating due to its intensive computing workload and run-time communication needs. Thus, its research has become a hotspot across different disciplines. However, it is observed that most benchmarks used in the evaluation of corresponding work are through commercial or in-house codes. Then, the lack of accessible open-source multi-physics benchmark suites has presented a challenge in uniformly evaluating simulation performance across related disciplines. This work proposes the first open-source based benchmark suite with 12 selected benchmarks for research in multi-physics simulation, the Clarkson Open-Source Multi-physics Benchmark Suite (COMBS). Multiple metrics have been gathered for these benchmarks, such as instructions per second and memory usage. Also provided are build and benchmark scripts to improve usability. Additionally, their source codes and installation guides are available for downloading through a github repository built by the authors. The selected benchmarks are from key applications of multi-physics simulation and highly cited publications. It is believed that this benchmark suite will facilitate to harness the full potential of HPC research in the field of multi-physics simulation.more » « less