Line segment intersection is one of the elementary operations in computational geometry. Complex problems in Geographic Information Systems (GIS) like finding map overlays or spatial joins using polygonal data require solving segment intersections. Plane sweep paradigm is used for finding geometric intersection in an efficient manner. However, it is difficult to parallelize due to its in-order processing of spatial events. We present a new fine-grained parallel algorithm for geometric intersection and its CPU and GPU implementation using OpenMP and OpenACC. To the best of our knowledge, this is the first work demonstrating an effective parallelization of plane sweep on GPUs. We chose compiler directive based approach for implementation because of its simplicity to parallelize sequential code. Using Nvidia Tesla P100 GPU, our implementation achieves around 40X speedup for line segment intersection problem on 40K and 80K data sets compared to sequential CGAL library.
more »
« less
This content will become publicly available on November 5, 2026
Efficient GPU Parallelization of Electronic Transport and Nonequilibrium Dynamics from Electron-Phonon Interactions in the Perturbo Code
The Boltzmann transport equation (BTE) with electron-phonon (e-ph) interactions computed from first principles is widely used to study electronic transport and nonequilibrium dynamics in materials. Calculating the e-ph collision integral is the most important step in the BTE, but it remains computationally costly, even with current MPI+OpenMP parallelization. This challenge makes it difficult to study materials with large unit cells and to achieve high resolution in momentum space. Here, we show acceleration of BTE calculations of electronic transport and ultrafast dynamics using graphical processing units (GPUs). We implement a novel data structure and algorithm, optimized for GPU hardware and developed using OpenACC, to process scattering channels and efficiently compute the collision integral. This approach significantly reduces the overhead for data referencing, movement, and synchronization. Relative to the efficient CPU implementation in the open-source package Perturbo (v2.2.0), used as a baseline, this approach achieves a speed-up of 40 times for both transport and nonequilibrium dynamics on GPU hardware, and achieves nearly linear scaling up to 100 GPUs. The novel data structure can be generalized to other electron interactions and scattering processes. We released this GPU implementation in the latest public version (v3.0.0) of Perturbo. The new MPI+OpenMP+GPU parallelization enables sweeping studies of e-ph physics and electron dynamics in conventional and quantum materials, and prepares Perturbo for exascale supercomputing platforms.
more »
« less
- Award ID(s):
- 2209262
- PAR ID:
- 10650576
- Publisher / Repository:
- arXiv
- Date Published:
- Journal Name:
- arXivorg
- ISSN:
- 2331-8422
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Geographic information systems deal with spatial data and its analysis. Spatial data contains many attributes with location information. Spatial autocorrelation is a fundamental concept in spatial analysis. It suggests that similar objects tend to cluster in geographic space. Hotspots, an example of autocorrelation, are statistically significant clusters of spatial data. Other autocorrelation measures like Moran’s I are used to quantify spatial dependence. Large scale spatial autocorrelation methods are compute- intensive. Fast methods for hotspots detection and analysis are crucial in recent times of COVID-19 pandemic. Therefore, we have developed parallelization methods on heterogeneous CPU and GPU environments. To the best of our knowledge, this is the first GPU and SIMD-based design and implementation of autocorrelation kernels. Earlier methods in literature introduced cluster-based and MapReduce-based parallelization. We have used Intrinsics to exploit SIMD parallelism on x86 CPU architecture. We have used MPI Graph Topology to minimize inter-process communication. Our benchmarks for CPU/GPU optimizations gain up to 750X relative speedup with a 8 GPU setup when compared to baseline sequential implementation. Compared to the best implementation using OpenMP + R-tree data structure on a single compute node, our accelerated hotspots benchmark gains a 25X speedup. For real world US counties and COVID data evolution calculated over 500 days, we gain up to 110X speedup reducing time from 33 minutes to 0.3 minutes.more » « less
-
The Boltzmann Transport Equation (BTE) for phonons is often used to predict thermal transport at submicron scales in semiconductors. The BTE is a seven-dimensional nonlinear integro-differential equation, resulting in difficulty in its solution even after linearization under the single relaxation time approximation. Furthermore, parallelization and load balancing are challenging, given the high dimensionality and variability of the linear systems' conditioning. This work presents a 'synthetic' scalable parallelization method for solving the BTE on large-scale systems. The method includes cell-based parallelization, combined band+cell-based parallelization, and batching technique. The essential computational ingredient of cell-based parallelization is a sparse matrix-vector product (SpMV) that can be integrated with an existing linear algebra library like PETSc. The combined approach enhances the cell-based method by further parallelizing the band dimension to take advantage of low inter-band communication costs. For the batched approach, we developed a batched SpMV that enables multiple linear systems to be solved simultaneously, merging many MPI messages to reduce communication costs, thus maintaining scalability when the grain size becomes very small. We present numerical experiments to demonstrate our method's excellent speedups and scalability up to 16384 cores for a problem with 12.6 billion unknowns.more » « less
-
Accurate simulation of earthquake scenarios is essential for advancing seismic hazard analysis and risk mitigation strategies. At the San Diego Supercomputer Center (SDSC), our research focuses on optimizing the performance and reliability of large-scale earthquake simulations using the AWP-ODC software. By implementing GPU-aware MPI calls, we enable direct data processing within GPU memory, eliminating the need for explicit data transfers between CPU and GPU. This GPU-aware MPI achieves nearly ideal parallel efficiency at full scale across both Nvidia and AMD GPUs, leveraging the MVAPICH-PLUS support on Frontier at Oak Ridge National Laboratory and Vista at the Texas Advanced Computing Center. We utilized the MVAPICH-Plus 4.0 compiler to enable ZFP compression, which significantly enhances inter-node communication efficiency – a critical improvement given the communication bottleneck inherent in large-scale simulations. Our GPU-aware AWP-ODC versions include linear forward, topography and nonlinear Iwan-type solvers with discontinuous mesh support. On the Frontier system with MVAPICH 4.0, Hip-aware MPI calls on MI250X GPUs deliver nearly ideal weak-scaling speedup up to 8,192 nodes for both linear and topography versions. On TACC’s Vista system, CUDA-aware MPI calls on GH200 GPUs substantially outperform their non-GPU-aware counterparts across all three solver versions. This poster will present a detailed evaluation of GPU-aware AWP-ODC using MVAPICH, including the impact of ZFP message compression compared to the native versions. Our results highlight the importance of Mvapich support for GPU-ware MPI and on-the-fly compression techniques for accelerating and scaling earthquake simulations.more » « less
-
GPU-accelerated on-the-fly nonadiabatic dynamics is enabled by interfacing the linearized semiclassical dynamics approach with the TeraChem electronic structure program. We describe the computational workflow of the “PySCES” code interface, a Python code for semiclassical dynamics with on-the-fly electronic structure, including parallelization over multiple GPU nodes. We showcase the abilities of this code and present timings for two benchmark systems: fulvene solvated in acetonitrile and a charge transfer system in which a photoexcited zinc-phthalocyanine donor transfers charge to a fullerene acceptor through multiple electronic states on an ultrafast timescale. Our implementation paves the way for an efficient semiclassical approach to model the nonadiabatic excited state dynamics of complex molecules, materials, and condensed phase systems.more » « less
An official website of the United States government
