skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on March 1, 2026

Title: Proceedings of the 2025 SIAM International Meshing Roundtable; Parallelization of the Finite Element-based Mesh Warping Algorithm Using Hybrid Parallel Programming
Warping large volume meshes has applications in biomechanics, aerodynamics, image processing, and cardiology. However, warping large, real-world meshes is computationally expensive. Existing parallel implementations of mesh warping algorithms do not take advantage of shared-memory and one-sided communication features available in the MPI-3 standard. We describe our parallelization of the finite element-based mesh warping algorithm for tetrahedral meshes. Our implementation is portable across shared and distributed memory architectures, as it takes advantage of shared memory and one-sided communication to precompute neighbor lists in parallel. We then deform a mesh by solving a Poisson boundary value problem and the resulting linear system, which has multiple right-hand sides, in parallel. Our results demonstrate excellent efficiency and strong scalability on up to 32 cores on a single node. Furthermore, we show a 33.9% increase in speedup with 256 cores distributed uniformly across 64 nodes versus our largest single node speedup while observing sublinear speedups overall.  more » « less
Award ID(s):
2245153 1808553 2117449
PAR ID:
10621870
Author(s) / Creator(s):
;
Editor(s):
Si, Hang; Shepherd, Kendrick M; Zhang, Yongjie Jessica
Publisher / Repository:
Society for Industrial and Applied Mathematics
Date Published:
ISBN:
978-1-61197-857-5
Page Range / eLocation ID:
140-153
Subject(s) / Keyword(s):
mesh warping, parallel computing, hybrid parallel programming, MPI-3, shared and distributed memory, strong scalability
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. There are numerous large-scale applications requiring mesh adaptivity, e.g., computational fluid dynamics and weather prediction. Parallel processing is needed for simulations involving large-scale adaptive meshes. In this paper, we propose a parallel variational mesh quality improvement algorithm for use with distributed memory machines. Our method parallelizes the serial variational mesh quality improvement method by Huang and Kamenski. Their approach is based on the use of the Moving Mesh PDE method to adapt the mesh based on the minimization of an energy functional for mesh equidistribution and alignment. This leads to a system of ordinary differential equations (ODEs) to be solved which determine where to move the interior mesh nodes. An efficient solution is obtained by solving the ODEs on subregions of the mesh with overlapped communication and computation. Strong and weak scaling experiments on up to 128 cores for meshes with up to 160M elements demonstrate excellent results. 
    more » « less
  2. State-of-the-art synchronous graph processing frameworks face both inefficiency and imbalance issues that cause their performance to be suboptimal. These issues include the inefficiency of communication and the imbalanced graph computation/communication costs in an iteration. We propose to replace their conventional two-sided communication model with the one-sided counterpart. Accordingly, we design SHMEMGraph, an efficient and balanced graph processing framework that is formulated across a global memory space and takes advantage of the flexibility and efficiency of one-sided communication for graph processing. Through an efficient one-sided communication channel, SHMEMGraph utilizes the high-performance operations with RDMA while minimizing the resource contention within a computer node. In addition, SHMEMGraph synthesizes a number of optimizations to address both computation imbalance and communication imbalance. By using a graph of 1 billion edges, our evaluation shows that compared to the state-of-the-art Gemini framework, SHMEMGraph achieves an average improvement of 35.5% in terms of job completion time for five representative graph algorithms. 
    more » « less
  3. Constructing k-nearest neighbor (kNN) graphs is a fundamental component in many machine learning and scientific computing applications. Despite its prevalence, efficiently building all-nearest-neighbor graphs at scale on distributed heterogeneous HPC systems remains challenging, especially for large sparse non-integer datasets. We introduce optimizations for algorithms based on forests of random projection trees. Our novel GPU kernels for batched, within leaf, exact searches achieve 1.18× speedup over sparse reference kernels with less peak memory, and up to 19× speedup over CPU for memory-intensive problems. Our library,PyRKNN, implements distributed randomized projection forests for approximate kNN search. Optimizations to reduce and hide communication overhead allow us to achieve 5× speedup, in per iteration performance, relative to GOFMM (another projection tree, MPI-based kNN library), for a 64M 128d dataset on 1,024 processes. On a single-node we achieve speedup over FAISS-GPU for dense datasets and up to 10× speedup over CPU-only libraries.PyRKNNuniquely supports distributed memory kNN graph construction for both dense and sparse coordinates on CPU and GPU accelerators. 
    more » « less
  4. We present shared-memory parallel methods for Maximal Clique Enumeration (MCE) from a graph. MCE is a fundamental and well-studied graph analytics task, and is a widely used primitive for identifying dense structures in a graph. Due to its computationally intensive nature, parallel methods are imperative for dealing with large graphs. However, surprisingly, there do not yet exist scalable and parallel methods for MCE on a shared-memory parallel machine. In this work, we present efficient shared-memory parallel algorithms for MCE, with the following properties: (1) the parallel algorithms are provably work-efficient relative to a state-of-the-art sequential algorithm (2) the algorithms have a provably small parallel depth, showing that they can scale to a large number of processors, and (3) our implementations on a multicore machine shows a good speedup and scaling behavior with increasing number of cores, and are substantially faster than prior shared-memory parallel algorithms for MCE. 
    more » « less
  5. Simulations to calculate a single gravitational waveform (GW) can take several weeks. Yet, thousands of such simulations are needed for the detection and interpretation of gravitational waves. Future detectors will require even more accurate waveforms than those currently used. We present here the first large scale, adaptive mesh, multi-GPU numerical relativity (NR) code together with performance analysis and benchmarking. While comparisons are difficult to make, our GPU extension of the Dendro-GR NR code achieves a 6x speedup over existing state-of-the-art codes. We achieve 800 GFlops/s on a single NVIDIA A100 GPU with an overall 2.5x speedup over a two-socket, 128-core AMD EPYC 7763 CPU node with an equivalent CPU implementation. We present detailed performance analyses, parallel scalability results, and accuracy assessments for GWs computed for mass ratios q=1,2,4. We also present strong scalability up to 8 A100s and weak scaling up to 229,376 ×86 cores on the Texas Advanced Computing Center's Frontera system. 
    more » « less