skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, January 16 until 2:00 AM ET on Friday, January 17 due to maintenance. We apologize for the inconvenience.


Title: Octo-Tiger’s New Hydro Module and Performance Using HPX+CUDA on ORNL’s Summit
Octo-Tiger is a code for modeling three-dimensional self-gravitating astrophysical fluids. It was particularly designed for the study of dynamical mass transfer between interacting binary stars. Octo-Tiger is parallelized for distributed systems using the asynchronous many-task runtime system, the C++ standard library for parallelism and concurrency (HPX) and utilizes CUDA for its gravity solver. Recently, we have remodeled Octo-Tiger’s hydro solver to use a three-dimensional reconstruction scheme. In addition, we have ported the hydro solver to GPU using CUDA kernels. We present scaling results for the new hydro kernels on ORNL’s Summit machine using a Sedov-Taylor blast wave problem. We also compare Octo-Tiger’s new hydro scheme with its old hydro scheme, using a rotating star as a test problem.  more » « less
Award ID(s):
1814967
PAR ID:
10380000
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
2021 IEEE International Conference on Cluster Computing (CLUSTER)
Page Range / eLocation ID:
204 to 214
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    ABSTRACT octo-tiger is an astrophysics code to simulate the evolution of self-gravitating and rotating systems of arbitrary geometry based on the fast multipole method, using adaptive mesh refinement. octo-tiger is currently optimized to simulate the merger of well-resolved stars that can be approximated by barotropic structures, such as white dwarfs (WDs) or main-sequence stars. The gravity solver conserves angular momentum to machine precision, thanks to a ‘correction’ algorithm. This code uses hpx parallelization, allowing the overlap of work and communication and leading to excellent scaling properties, allowing for the computation of large problems in reasonable wall-clock times. In this paper, we investigate the code performance and precision by running benchmarking tests. These include simple problems, such as the Sod shock tube, as well as sophisticated, full, WD binary simulations. Results are compared to analytical solutions, when known, and to other grid-based codes such as flash. We also compute the interaction between two WDs from the early mass transfer through to the merger and compare with past simulations of similar systems. We measure octo-tiger’s scaling properties up to a core count of ∼80 000, showing excellent performance for large problems. Finally, we outline the current and planned areas of development aimed at tackling a number of physical phenomena connected to observations of transients. 
    more » « less
  2. Motivated by the increasing importance of general-purpose Graphic Processing Units (GPGPU) programming, exemplified by NVIDIA’s CUDA framework, as well as the difficulty, especially for novice programmers, of reasoning about performance in GPGPU kernels, we introduce a novel quantitative program logic for CUDA kernels. The logic allows programmers to reason about both functional correctness and resource usage of CUDA kernels, paying particular attention to a set of common but CUDA-specific performance bottlenecks: warp divergences, uncoalesced memory accesses, and bank conflicts. The logic is proved sound with respect to a novel operational cost semantics for CUDA kernels. The semantics, logic, and soundness proofs are formalized in Coq. An inference algorithm based on LP solving automatically synthesizes symbolic resource bounds by generating derivations in the logic. This algorithm is the basis of RaCUDA, an end-to-end resource-analysis tool for kernels, which has been implemented using an existing resource-analysis tool for imperative programs. An experimental evaluation on a suite of benchmarks shows that the analysis is effective in aiding the detection of performance bugs in CUDA kernels. 
    more » « less
  3. null (Ed.)
    General-purpose programming on GPUs (GPGPU) is becoming increasingly in vogue as applications such as machine learning and scientific computing demand high throughput in vector-parallel applications. NVIDIA's CUDA toolkit seeks to make GPGPU programming accessible by allowing programmers to write GPU functions, called kernels, in a small extension of C/C++. However, due to CUDA's complex execution model, the performance characteristics of CUDA kernels are difficult to predict, especially for novice programmers. This paper introduces a novel quantitative program logic for CUDA kernels, which allows programmers to reason about both functional correctness and resource usage of CUDA kernels, paying particular attention to a set of common but CUDA-specific performance bottlenecks. The logic is proved sound with respect to a novel operational cost semantics for CUDA kernels. The semantics, logic and soundness proofs are formalized in Coq. An inference algorithm based on LP solving automatically synthesizes symbolic resource bounds by generating derivations in the logic. This algorithm is the basis of RaCuda, an end-to-end resource-analysis tool for kernels, which has been implemented using an existing resource-analysis tool for imperative programs. An experimental evaluation on a suite of CUDA benchmarks shows that the analysis is effective in aiding the detection of performance bugs in CUDA kernels. 
    more » « less
  4. In recent years, solvers for finite-element discretizations of linear or linearized saddle-point problems, like the Stokes and Oseen equations, have become well established. There are two main classes of preconditioners for such systems: those based on a block-factorization approach and those based on monolithic multigrid. Both classes of preconditioners have several critical choices to be made in their composition, such as the selection of a suitable relaxation scheme for monolithic multigrid. From existing studies, some insight can be gained as to what options are preferable in low-performance computing settings, but there are very few fair comparisons of these approaches in the literature, particularly for modern architectures, such as GPUs. In this paper, we perform a comparison between a Block-Triangular preconditioner and monolithic multigrid methods with the three most common choices of relaxation scheme – Braess-Sarazin, Vanka, and Schur-Uzawa. We develop a performant Vanka relaxation algorithm for structured-grid discretizations, which takes advantage of memory efficiencies in this setting. We detail the behavior of the various CUDA kernels for the multigrid relaxation schemes and evaluate their individual arithmetic intensity, performance, and runtime. Running a preconditioned FGMRES solver for the Stokes equations with these preconditioners allows us to compare their efficiency in a practical setting. We show that monolithic multigrid can outperform Block-Triangular preconditioning, and that using Vanka or Braess-Sarazin relaxation is most efficient. Even though multigrid with Vanka relaxation exhibits reduced performance on the CPU (up to 100% slower than Braess-Sarazin), it is able to outperform Braess-Sarazin by more than 20% on the GPU, making it a competitive algorithm, especially given the high amount of algorithmic tuning needed for effective Braess-Sarazin relaxation.

     
    more » « less
  5. This paper introduces a new weighting scheme for particle-grid transfers that generates hybrid Lagrangian/Eulerian fluid simulations with uniform particle distributions and precise volume control. At its core, our approach reformulates the construction of Power Particles [de Goes et al. 2015] by computing volume-constrained density kernels. We employ these optimized kernels as particle domains within the Generalized Interpolation Material Point method (GIMP) in order to incorporate Power Particles into the Particle-In-Cell framework, hence the name the Power Particle-In-Cell method. We address the construction of volume-constrained density kernels as a regularized optimal transportation problem and describe an iterative solver based on localized Gaussian convolutions that leads to a significant performance speedup compared to [de Goes et al. 2015]. We also present novel extensions for handling free surfaces and solid obstacles that bypass the need for cell clipping and ghost particles. We demonstrate the advantages of our transfer weights by improving hybrid schemes for fluid simulation such as the Fluid Implicit Particle (FLIP) method and the Affine Particle-In-Cell (APIC) method with volume preservation and robustness to varying particle-per-cell ratio, while retaining low numerical dissipation, conserving linear and angular momenta, and avoiding particle reseeding or post-process relaxations. 
    more » « less