NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Survey on Heterogeneous Computing Using SmartNICs and Emerging Data Processing Units (Expanded Preprint)

Tibbetts, Nathan; Ibtisum, Sifat; Puri, Satish (March 2025, arxiv)

The emergence of new, off-path smart network cards (SmartNICs), known generally as Data Processing Units (DPU), has opened a wide range of research opportunities. Of particular interest is the use of these and related devices in tandem with their host’s CPU, creating a heterogeneous computing system with new properties and strengths to be explored, capable of accelerating a wide variety of workloads. This survey begins by providing background information to this new field, such as discussing its origins, its motivations and challenges, listing a few of the current market offerings for DPUs, and providing some brief information about the major programming languages and frameworks for using them. Then, we review and categorize a number of recent works in the field, covering a wide variety of studies, benchmarks, and application areas such as in data center infrastructure, commercial uses, and AI and ML acceleration.
more » « less
Free, publicly-accessible full text available March 3, 2026
Extending Segment Tree for Polygon Clipping and Parallelizing using OpenMP and OpenACC Directives

https://doi.org/10.1145/3673038.3673141

Ashan, Buddhi; Puri, Satish; Prasad, Sushil (August 2024, ACM)

Full Text Available
Geospatial Filter and Refine Computations on NVidia Bluefield Data Processing Units (DPU).

Kaymak, Derda; Puri, Satish (November 2023, Poster Session at Supercomputing Conference 23)

Full Text Available
Geospatial Filter and Refine Computations on NVidia Bluefield Data Processing Units (DPU)

Kaymak, Derda; Puri, Satish (November 2023, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'23))

In this poster, we will show how to leverage NVidia’s Bluef ield Data Processing Unit (DPU) in geospatial systems. Existing work in literature has explored DPUs in the context of machine learning, compression and MPI acceleration. We show our designs on how to integrate DPUs into existing high performance geospatial systems like MPI-GIS. The workflow of a typical spatial computing workload consists of two phases- filter and refine. First we used DPU as a target to offload spatial computations from the host CPU. We show the performance improvements due to offload. Next we used DPU for network I/O processing. In network I/O case, the query data first comes to DPU for filtering and then the query goes to CPU for refinement. DPU-based filter and refine system can be useful in other domains like Physics where an FPGA is used to perform the filter to handle Big Data. We used Bluefield-2 and Bluefield-3 in our experiments. For scalability study, we have used up to 16 DPUs.
more » « less
Full Text Available
Message from Workshop Chairs

https://doi.org/10.1109/HiPCW61695.2023.00007

Ghafoor, Sheikh; Prasad, Sushil K; Kuvelkar, Ashish; Sinha, Sharad; Puri, Satish (December 2023, IEEE)

Full Text Available
Fine-grained dynamic load balancing in spatial join by work stealing on distributed memory

https://doi.org/10.1145/3557915.3560936

Yang, Jie; Puri, Satish; Zhou, Hui (November 2022, Proceedings of the 30th International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL))

Spatial join is an important operation for combining spatial data. Parallelization is essential for improving spatial join performance. However, load imbalance due to data skew limits the scalability of parallel spatial join. There are many work sharing techniques to address this problem in a parallel environment. One of the techniques is to use data and space partitioning and then scheduling the partitions among threads/processes with the goal of minimizing workload differences across threads/processes. However, load imbalance still exists due to differences in join costs of different pairs of input geometries in the partitions. For the load imbalance problem, we have designed a work stealing spatial join system (WSSJ-DM) on a distributed memory environment. Work stealing is an approach for dynamic load balancing in which an idle processor steals computational tasks from other processors. This is the first work that uses work stealing concept (instead of work sharing) to parallelize spatial join computation on a large compute cluster. We have evaluated the scalability of the system on shared and distributed memory. Our experimental evaluation shows that work stealing is an effective strategy. We compared WSSJ-DM with work sharing implementations of spatial join on a high performance computing environment using partitioned and un-partitioned datasets. Static and dynamic load balancing approaches were used for comparison. We study the effect of memory affinity in work stealing operations involved in spatial join on a multi-core processor. WSSJ-DM performed spatial join using ST_Intersection on Lakes (8.4M polygons) and Parks (10M polygons) in 30 seconds using 35 compute nodes on a cluster (1260 CPU cores). A work sharing Master-Worker implementation took 160 seconds in contrast.
more » « less
Full Text Available
Accelerating Spatial Autocorrelation Computation with Parallelization, Vectorization and Memory Access Optimization

Paudel, Anmol; Puri, Satish (January 2022, 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Italy, May 2022)

Geographic information systems deal with spatial data and its analysis. Spatial data contains many attributes with location information. Spatial autocorrelation is a fundamental concept in spatial analysis. It suggests that similar objects tend to cluster in geographic space. Hotspots, an example of autocorrelation, are statistically significant clusters of spatial data. Other autocorrelation measures like Moran’s I are used to quantify spatial dependence. Large scale spatial autocorrelation methods are compute- intensive. Fast methods for hotspots detection and analysis are crucial in recent times of COVID-19 pandemic. Therefore, we have developed parallelization methods on heterogeneous CPU and GPU environments. To the best of our knowledge, this is the first GPU and SIMD-based design and implementation of autocorrelation kernels. Earlier methods in literature introduced cluster-based and MapReduce-based parallelization. We have used Intrinsics to exploit SIMD parallelism on x86 CPU architecture. We have used MPI Graph Topology to minimize inter-process communication. Our benchmarks for CPU/GPU optimizations gain up to 750X relative speedup with a 8 GPU setup when compared to baseline sequential implementation. Compared to the best implementation using OpenMP + R-tree data structure on a single compute node, our accelerated hotspots benchmark gains a 25X speedup. For real world US counties and COVID data evolution calculated over 500 days, we gain up to 110X speedup reducing time from 33 minutes to 0.3 minutes.
more » « less
Full Text Available
Efficient Filters for Geometric Intersection Computations using GPU

Liu, Yiming; Puri, Satish (November 2020, ACM SIGSPATIAL 2020)
null (Ed.)
Geometric intersection algorithms are fundamental in spatial analysis in Geographic Information System (GIS). Applying high performance computing to perform geometric intersection on huge amount of spatial data to get real-time results is necessary. Given two input geometries (polygon or polyline) of a candidate pair, we introduce a new two-step geospatial filter that first creates sketches of the geometries and uses it to detect workload and then refines the sketches by the common areas of sketches to decrease the overall computations in the refine phase. We call this filter PolySketch-based CMBR (PSCMBR) filter. We show the application of this filter in speeding-up line segment intersections (LSI) reporting task that is a basic computation in a variety of geospatial applications like polygon overlay and spatial join. We also developed a parallel PolySketch-based PNP filter to perform PNP tests on GPU which reduces computational workload in PNP tests. Finally, we integrated these new filters to the hierarchical filter and refinement (HiFiRe) system to solve geometric intersection problem. We have implemented the new filter and refine system on GPU using CUDA. The new filters introduced in this paper reduce more computational workload when compared to existing filters. As a result, we get on average 7.96X speedup compared to our prior version of HiFiRe system.
more » « less
Full Text Available
Efficient Parallel and Adaptive Partitioning for Load-balancing in Spatial Join

Yang, Jie; Puri, Satish (May 2020, 34th IEEE International Parallel & Distributed Processing Symposium)

Due to the developments of topographic techniques, clear satellite imagery, and various means for collecting information, geospatial datasets are growing in volume, complexity, and heterogeneity. For efficient execution of spatial computations and analytics on large spatial data sets, parallel processing is required. To exploit fine-grained parallel processing in large scale compute clusters, partitioning in a load-balanced way is necessary for skewed datasets. In this work, we focus on spatial join operation where the inputs are two layers of geospatial data. Our partitioning method for spatial join uses Adaptive Partitioning (ADP) technique, which is based on Quadtree partitioning. Unlike existing partitioning techniques, ADP partitions the spatial join workload instead of partitioning the individual datasets separately to provide better load-balancing. Based on our experimental evaluation, ADP partitions spatial data in a more balanced way than Quadtree partitioning and Uniform grid partitioning. ADP uses an output-sensitive duplication avoidance technique which minimizes duplication of geometries that are not part of spatial join output. In a distributed memory environment, this technique can reduce data communication and storage requirements compared to traditional methods. To improve the performance of ADP, an MPI+Threads based parallelization is presented. With ParADP, a pair of real world datasets, one with 717 million polylines and another with 10 million polygons, is partitioned into 65,536 grid cells within 7 seconds. ParADP performs well with both good weak scaling up to 4,032 CPU cores and good strong scaling up to 4,032 CPU cores.
more » « less
Full Text Available
Hierarchical Filter and Refinement System Over Large Polygonal Datasets on CPU-GPU

https://doi.org/10.1109/HiPC.2019.00027

Liu, Yiming; Yang, Jie; Puri, Satish (December 2019, 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC))

In this paper, we introduce our hierarchical filter and refinement technique that we have developed for parallel geometric intersection operations involving large polygons and polylines. The inputs are two layers of large polygonal datasets and the computations are spatial intersection on a pair of cross-layer polygons. These intersections are the compute-intensive spatial data analytic kernels in spatial join and map overlay computations. We have extended the classical filter and refine algorithms using PolySketch Filter to improve the performance of geospatial computations. In addition to filtering polygons by their Minimum Bounding Rectangle (MBR), our hierarchical approach explores further filtering using tiles (smaller MBRs) to increase the effectiveness of filtering and decrease the computational workload in the refinement phase. We have implemented this filter and refine system on CPU and GPU by using OpenMP and OpenACC. After using R-tree, on average, our filter technique can still discard 69% of polygon pairs which do not have segment intersection points. PolySketch filter reduces on average 99.77% of the workload of finding line segment intersections. PNP based task reduction and Striping algorithms filter out on average 95.84% of the workload of Point-in-Polygon tests. Our CPU-GPU system performs spatial join on two shapefiles, namely USA Water Bodies and USA Block Group Boundaries with 683K polygons in about 10 seconds using NVidia Titan V and Titan Xp GPU.
more » « less
Full Text Available

« Prev Next »

Search for: All records