skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 9, 2026

Title: Statistical Treatment of Variable MPI Latencies and MPI-Communication Hiding for Matrix-Free FEM
We consider large-scale implicit solvers for the numerical solution of partial differential equations (PDEs). The solvers require the high-bandwith networks of an HPC system for a fast time to solution. The increasing variability in performance of the HPC systems, most likely caused by variable communication latencies and network congestion, however, makes the execution time of solver algorithms unpredictable and hard to measure. In particular, the performance variability of the underlying system makes the reliable comparison of different algorithms and implementations difficult or impossible on HPC. We propose the use of statistical methods relying on hidden Markov models (HMM) to separate variable performance data into regimes corresponding to different levels of system latency. This allows us to, for ex- ample, identify and remove time periods when extremely high system latencies throttle application performance and distort performance measurements. We apply HMM to the careful analysis of implicit conjugate gradient solvers for finite-element discretized PDE, in particular comparing several new communication hiding methods for matrix-free operators of a PDE, which are critical for achieving peak performance in state-of-the-art PDE solvers. The HMM analysis allows us to overcome the strong performance variability in the HPC system. Our performance results for a model PDE problem discretized with 135 million degrees of freedom parallelized over 7168 cores of the Anvil supercomputer demonstrate that the communication hiding techniques can achieve up to a 10% speedup for the matrix-free matrix-vector product.  more » « less
Award ID(s):
2343865
PAR ID:
10620818
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Format(s):
Medium: X
Location:
ACM International Conference on Supercomputing 2025, Salt Lake City, UT, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Fractional PDEs have recently found several geophysics and imaging science applications due to their nonlocal nature and their flexibility in capturing sharp transitions across interfaces.However, this nonlocality makes it challenging to design efficient solvers for such problems.In this paper, we introduce a spectral method based on an ultraspherical polynomial discretization of the Caffarelli–Silvestre extension to solve such PDEs on rectangular and disk domains.We solve the discretized problem using tensor equation solvers and thus can solve higher-dimensional PDEs.In addition, we introduce both serial and parallel domain decomposition solvers.We demonstrate the numerical performance of our methods on a 3D fractional elliptic PDE on a cube as well as an application to optimization problems with fractional PDE constraints. 
    more » « less
  2. Quantum computing has the potential to solve certain compute-intensive problems faster than classical computing by leveraging the quantum mechanical properties of superposition and entanglement. This capability can be particularly useful for solving Partial Differential Equations (PDEs), which are challenging to solve even for High-Performance Computing (HPC) systems, especially for multidimensional PDEs. This led researchers to investigate the usage of Quantum-Centric High-Performance Computing (QC-HPC) to solve multidimensional PDEs for various applications. However, the current quantum computing-based PDE-solvers, especially those based on Variational Quantum Algorithms (VQAs) suffer from limitations, such as low accuracy, long execution times, and limited scalability. In this work, we propose an innovative algorithm to solve multidimensional PDEs with two variants. The first variant uses Finite Difference Method (FDM), Classical-to-Quantum (C2Q) encoding, and numerical instantiation, whereas the second variant utilizes FDM, C2Q encoding, and Column-by-Column Decomposition (CCD). We evaluated the proposed algorithm using the Poisson equation as a case study and validated it through experiments conducted on noise-free and noisy simulators, as well as hardware emulators and real quantum hardware from IBM. Our results show higher accuracy, improved scalability, and faster execution times in comparison to variational-based PDE-solvers, demonstrating the advantage of our approach for solving multidimensional PDEs. 
    more » « less
  3. This 3 hour course provides a detailed overview of grid-free Monte Carlo methods for solving partial differential equations (PDEs) based on the walk on spheres (WoS) algorithm, with a special emphasis on problems with high geometric complexity. PDEs are a basic building block of models and algorithms used throughout science, engineering and visual computing. Yet despite decades of research, conventional PDE solvers struggle to capture the immense geometric complexity of the natural world. A perennial challenge is spatial discretization: traditionally, one must partition the domain into a high-quality volumetric mesh—a process that can be brittle, memory intensive, and difficult to parallelize. WoS makes a radical departure from this approach, by reformulating the problem in terms of recursive integral equations that can be estimated using the Monte Carlo method, eliminating the need for spatial discretization. Since these equations strongly resemble those found in light transport theory, one can leverage deep knowledge from Monte Carlo rendering to develop new PDE solvers that share many of its advantages: no meshing, trivial parallelism, and the ability to evaluate the solution at any point without solving a global system of equations. The course is divided into two parts. Part I will cover the basics of using WoS to solve fundamental PDEs like the Poisson equation. Topics include formulating the solution as an integral equation, generating samples via recursive random walks, and employing accelerated distance and ray intersection queries to efficiently handle complex geometries. Participants will also gain experience setting up demo applications involving data interpolation, heat transfer, and geometric optimization using the open-source “Zombie” library, which implements various grid-free Monte Carlo PDE solvers. Part II will feature a mini-panel of academic and industry contributors covering advanced topics including variance reduction, differentiable and multi-physics simulation, and applications in industrial design and robust geometry processing. 
    more » « less
  4. Summary To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, also referred to as HCAs (Host Channel Adapters), resulting in a “multi‐rail”/“multi‐HCA” network. For example, the ThetaGPU system at Argonne National Laboratory (ANL) has eight adapters per node; with this many networking resources available, utilizing all of them becomes non‐trivial. The Message Passing Interface (MPI) is a dominant model for high‐performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we provide a thorough performance analysis of existing multirail solutions and their implications on collectives and present the necessity for further enhancement. Specifically, we propose novel designs for hierarchical, multi‐HCA‐aware Allgather. The proposed designs fully utilize all the available network adapters within a node and provide high overlap between inter‐node and intra‐node communication. At the micro‐benchmark level, we see large inter‐node improvements up to 62% and 61% better than HPC‐X and MVAPICH2‐X for 1024 processes. Because Allgather is used in Ring‐Allreduce, our designs also improve its performance by 56% and 44% compared to HPC‐X and MVAPICH2‐X, respectively. At the application level, our enhanced Allgather shows and improvement in a matrix‐vector multiplication kernel when compared to HPC‐X and MVAPICH2‐X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2‐X. 
    more » « less
  5. Summation-by-parts (SBP) finite difference methods are widely used in scientific applications alongside a special treatment of boundary conditions through the simultaneous-approximate-term (SAT) technique which enables the valuable proof of numerical stability. Our work is motivated by multi-scale earthquake cycle simulations described by partial differential equations (PDEs) whose discretizations lead to huge systems of equations and often rely on iterative schemes and parallel implementations to make the nu- merical solutions tractable. In this study, we consider 2D, variable coefficient elliptic PDEs in complex geometries discretized with the SBP-SAT method. The multigrid method is a well-known, efficient solver or preconditioner for traditional numerical discretizations, but they have not been well-developed for SBP-SAT methods on HPC platforms. We propose a custom geometric-multigrid pre- conditioned conjugate-gradient (MGCG) method that applies SBP- preserving interpolations. We then present novel, matrix-free GPU kernels designed specifically for SBP operators whose differences from traditional methods make this task nontrivial but that perform 3× faster than SpMV while requiring only a fraction of memory. The matrix-free GPU implementation of our MGCG method per- forms 5× faster than the SpMV counterpart for the largest problems considered (67 million degrees of freedom). When compared to off- the-shelf solvers in the state-of-the-art libraries PETSc and AmgX, our implementation achieves superior performance in both itera- tions and overall runtime. The method presented in this work offers an attractive solver for simulations using the SBP-SAT method. 
    more » « less