Persistent homology (PH) matrix reduction is an important tool for data analytics in many application areas. Due to its highly irregular execution patterns in computation, it is challenging to gain high efficiency in parallel processing for increasingly large data sets. In this paper, we introduce HYPHA, a HYbrid Persistent Homology matrix reduction Accelerator, to make parallel processing highly efficient on both GPU and multicore. The essential foundation of our algorithm design and implementation is the separation of SIMT and MIMD parallelisms in PH matrix reduction computation. With such a separation, we are able to perform massive parallel scanning operations on GPU in a super-fast manner, which also collects rich information from an input boundary matrix for further parallel reduction operations on multicore with high efficiency. The HYPHA framework may provide a general purpose guidance to high performance computing on multiple hardware accelerators. To our best knowledge, HYPHA achieves the highest performance in PH matrix reduction execution. Our experiments show speedups of up to 116x against the standard PH algorithm. Compared to the state-of-the-art parallel PH software packages, such as PHAT and DIPHA, HYPHA outperforms their fastest PH matrix reduction algorithms by factor up to 2.3x.
more »
« less
RXMesh: a GPU mesh data structure
We propose a new static high-performance mesh data structure for triangle surface meshes on the GPU. Our data structure is carefully designed for parallel execution while capturing mesh locality and confining data access, as much as possible, within the GPU's fast shared memory. We achieve this by subdividing the mesh intopatchesand representing these patches compactly using a matrix-based representation. Our patching technique is decorated withribbons, thin mesh strips around patches that eliminate the need to communicate between different computation thread blocks, resulting in consistent high throughput. We call our data structureRXMesh: Ribbon-matriX Mesh. We hide the complexity of our data structure behind a flexible but powerful programming model that helps deliver high performance by inducing load balance even in highly irregular input meshes. We show the efficacy of our programming model on common geometry processing applications---mesh smoothing and filtering, geodesic distance, and vertex normal computation. For evaluation, we benchmark our data structure against well-optimized GPU and (single and multi-core) CPU data structures and show significant speedups.
more »
« less
- Award ID(s):
- 1637442
- PAR ID:
- 10603813
- Publisher / Repository:
- Association for Computing Machinery (ACM)
- Date Published:
- Journal Name:
- ACM Transactions on Graphics
- Volume:
- 40
- Issue:
- 4
- ISSN:
- 0730-0301
- Format(s):
- Medium: X Size: p. 1-16
- Size(s):
- p. 1-16
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Whenever the concept of high-performance cloth simulation is brought up, GPU acceleration is almost always the first that comes to mind. Leveraging immense parallelization, GPU algorithms have demonstrated significant success recently, whereas CPU methods are somewhat overlooked. Indeed, the need for an efficient CPU simulator is evident and pressing. In many scenarios, high-end GPUs may be unavailable or are already allocated to other tasks, such as rendering and shading. A high-performance CPU alternative can greatly boost the overall system capability and user experience. Inspired by this demand, this paper proposes a CPU algorithm for high-resolution cloth simulation. By partitioning the garment model into multiple (but not massive) sub-meshes or domains, we assign per-domain computations to individual CPU processors. Borrowing the idea of projective dynamics that breaks the computation into global and local steps, our key contribution is a new parallelization paradigm at domains for both global and local steps so that domain-level calculations are sequential and lightweight. The CPU has much fewer processing units than a GPU. Our algorithm mitigates this disadvantage by wisely balancing the scale of the parallelization and convergence. We validate our method in a wide range of simulation problems involving high-resolution garment models. Performance-wise, our method is at least one order faster than existing CPU methods, and it delivers a similar performance compared with the state-of-the-art GPU algorithms in many examples, but without using a GPU.more » « less
-
This paper describes a numerically robust data structure for encoding intrinsic triangulations of polyhedral surfaces. Many applications demand a correspondence between the intrinsic triangulation and the input surface, but existing data structures either rely on floating point values to encode correspondence, or do not support remeshing operations beyond basic edge flips. We instead provide an integer-based data structure that guarantees valid correspondence, even for meshes with near-degenerate elements. Our starting point is the framework ofnormal coordinatesfrom geometric topology, which we extend to the broader set of operations needed for mesh processing (vertex insertion, edge splits,etc.). The resulting data structure can be used as a drop-in replacement for earlier schemes, automatically improving reliability across a wide variety of applications. As a stress test, we successfully compute an intrinsic Delaunay refinement and associated subdivision for all manifold meshes in the Thingi10k dataset. In turn, we can compute reliable and highly accurate solutions to partial differential equations even on extremely low-quality meshes.more » « less
-
null (Ed.)Flat surfaces captured by 3D point clouds are often used for localization, mapping, and modeling. Dense point cloud processing has high computation and memory costs making low-dimensional representations of flat surfaces such as polygons desirable. We present Polylidar3D, a non-convex polygon extraction algorithm which takes as input unorganized 3D point clouds (e.g., LiDAR data), organized point clouds (e.g., range images), or user-provided meshes. Non-convex polygons represent flat surfaces in an environment with interior cutouts representing obstacles or holes. The Polylidar3D front-end transforms input data into a half-edge triangular mesh. This representation provides a common level of abstraction for subsequent back-end processing. The Polylidar3D back-end is composed of four core algorithms: mesh smoothing, dominant plane normal estimation, planar segment extraction, and finally polygon extraction. Polylidar3D is shown to be quite fast, making use of CPU multi-threading and GPU acceleration when available. We demonstrate Polylidar3D’s versatility and speed with real-world datasets including aerial LiDAR point clouds for rooftop mapping, autonomous driving LiDAR point clouds for road surface detection, and RGBD cameras for indoor floor/wall detection. We also evaluate Polylidar3D on a challenging planar segmentation benchmark dataset. Results consistently show excellent speed and accuracy.more » « less
-
Many popular vetting tools for Android applications use static code analysis techniques. In particular, Inter-procedural Data-Flow Graph (IDFG) construction is the computation at the core of Android static data-flow analysis and consumes most of the analysis time. Many analysis tools use a worklist algorithm, an iterative fixed-point approach, to construct the IDFG. In this paper, we observe that a straightforward GPU parallelization of the worklist algorithm leads to significant underutilization of the GPU resources. We identify four performance bottlenecks, namely, frequent dynamic memory allocations, high branch divergence, workload imbalance, and irregular memory access patterns. Accordingly, we propose GDroid, a GPU-based worklist algorithm implementation with multiple fine-grained optimizations tailored to common characteristics of Android applications. The optimizations considered are: matrix-based data structure, memory access-based node grouping, and worklist merging. Our experimental evaluation, performed on 1000 Android applications, shows that the proposed optimizations are beneficial to performance, and GDroid can achieve up to 128X speedups against a plain GPU implementation.more » « less
An official website of the United States government
