skip to main content

Title: A General Novel Parallel Framework for SPH-centric Algorithms
To date, large-scale fluid simulation with more details employing the Smooth Particle Hydrodynamics (SPH) method or its variants is ubiquitous in computer graphics and digital entertainment applications. Higher accuracy and faster speed are two key criteria evaluating possible improvement of the underlying algorithms within any available framework. Such requirements give rise to high-fidelity simulation with more particles and higher particle density that will unavoidably increase computational cost significantly. In this paper, we develop a new general GPGPU acceleration framework for SPH-centric simulations founded upon a novel neighbor traversal algorithm. Our novel parallel framework integrates several advanced characteristics of GPGPU architecture (e.g., shared memory and register memory). Additionally, we have designed a reasonable task assignment strategy, which makes sure that all the threads from the same CTA belong to the same cell of the grid. With this organization, big bunches of continuous neighboring data can be loaded to the shared memory of a CTA and used by all its threads. Our method has thus low global-memory bandwidth consumption. We have integrated our method into both WCSPH and PCISPH, that are two improved variants in recent years, and demonstrated its performance with several scenarios involving multiple-fluid interaction, dam break, and elastic solid. Through comprehensive tests validated in practice, our work can exhibit up to 2.18x speedup when compared with other state-of-the-art parallel frameworks.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the ACM on Computer Graphics and Interactive Techniques
Page Range / eLocation ID:
1 to 16
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU applications is the purpose of this study. To this end, it is imperative to explore the range of application output by injecting faults at all the potential fault sites. This problem is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space - in the order of billions even for some simple applications. In this paper, we present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience can be practical. The key insight behind our proposed methodology stems from the fact that GPGPU applications spawn a lot of threads, however, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by a careful analysis of faults across threads and instructions. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be first classified based on the number of the dynamic instructions they execute. We achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying and analyzing: a) the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, b) a subset of loop iterations within the representative threads, and c) a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications. 
    more » « less
  2. We present a multi-scale mathematical model and a novel numerical solver to study blood plasma flow and oxygen concentration in a prototype model of an implantable Bioartificial Pancreas (iBAP) that operates under arteriovenous pressure differential without the need for immunosuppressive therapy. The iBAP design consists of a poroelastic cell scaffold containing the healthy transplanted cells, encapsulated between two semi-permeable nano-pore size membranes to prevent the patient’s own immune cells from attacking the transplant. The device is connected to the patient’s vascular system via an anastomosis graft bringing oxygen and nutrients to the transplanted cells of which oxygen is the limiting factor for long-term viability. Mathematically, we propose a (nolinear) fluid–poroelastic structure interaction model to describe the flow of blood plasma through the scaffold containing the cells, and a set of (nonlinear) advection–reaction–diffusion equations defined on moving domains to study oxygen supply to the cells. These macro-scale models are solved using finite element method based solvers. One of the novelties of this work is the design of a novel second-order accurate fluid–poroelastic structure interaction solver, for which we prove that it is unconditionally stable. At the micro/nano-scale, Smoothed Particle Hydrodynamics (SPH) simulations are used to capture the micro/nano-structure (architecture) of cell scaffolds and obtain macro-scale parameters, such as hydraulic conductivity/permeability, from the micro-scale scaffold-specific architecture. To avoid expensive micro-scale simulations based on SPH simulations for every new scaffold architecture, we use Encoder–Decoder Convolution Neural Networks. Based on our numerical simulations, we propose improvements in the current prototype design. For example, we show that highly elastic scaffolds have a higher capacity for oxygen transfer, which is an important finding considering that scaffold elasticity can be controlled during their fabrication, and that elastic scaffolds improve cell viability. The mathematical and computational approaches developed in this work provide a benchmark tool for computational analysis of not only iBAP, but also, more generally, of cell encapsulation strategies used in the design of devices for cell therapy and bio-artificial organs. 
    more » « less
  3. Advances in biomolecular simulation methods and access to large scale computer resources have led to a massive increase in the amount of data generated. The key enablers have been optimization and parallelization of the simulation codes. However, much of the software used to analyze trajectory data from these simulations is still run in serial, or in some cases many threads via shared memory. Here, we describe the addition of multiple levels of parallel trajectory processing to the molecular dynamics simulation analysis software CPPTRAJ. In addition to the existing OpenMP shared‐memory parallelism, CPPTRAJ now has two additional levels of message passing (MPI) parallelism involving both across‐trajectory processing and across‐ensemble processing. All three levels of parallelism can be simultaneously active, leading to significant speed ups in data analysis of large datasets on the NCSA Blue Waters supercomputer by better leveraging the many available nodes and its parallel file system. © 2018 Wiley Periodicals, Inc.

    more » « less
  4. Finding connected components in a graph is a fundamental problem in graph analysis. In this work, we present a novel minimum-mapping based Contour algorithm to efficiently solve the connectivity problem. We prove that the Contour algorithm with two or higher order operators can identify all connected components of an undirected graph within O(log d_max) iterations, with each iteration involving O(m) work, where d_max represents the largest diameter among all components in the given graph, and m is the total number of edges in the graph. Importantly, each iteration is highly parallelizable, making use of the efficient minimum-mapping operator applied to all edges. To further enhance its practical performance, we optimize the Contour algorithm through asynchronous updates, early convergence checking, eliminating atomic operations, and choosing more efficient mapping operators. Our implementation of the Contour algorithm has been integrated into the open-source framework Arachne. Arachne extends Arkouda for large-scale interactive graph analytics, providing a Python API powered by the high-productivity parallel language Chapel. Experimental results on both real-world and synthetic graphs demonstrate the superior performance of our proposed Contour algorithm compared to state-of-the-art large-scale parallel algorithm FastSV and the fastest shared memory algorithm ConnectIt. On average, Contour achieves a speedup of 7.3x and 1.4x compared to FastSV and ConnectIt, respectively. All code for the Contour algorithm and the Arachne framework is publicly available on GitHub {, ensuring transparency and reproducibility of our work. 
    more » « less
  5. null (Ed.)
    Timing side channels have been used to extract cryptographic keys and sensitive documents even from trusted enclaves. Specifically, cache side channels created by reuse of shared code or data in the memory hierarchy have been exploited by several known attacks, e.g., evict+reload for recovering an RSA key and Spectre variants for leaking speculatively loaded data.In this paper, we present TimeCache, a cache design that incorporates knowledge of prior cache line access to eliminate cache side channels due to reuse of shared software (code and data). Our goal is to retain the benefits of a shared cache of allowing each process access to the entire cache and of cache occupancy by a single copy of shared software. We achieve our goal by implementing per-process cache line visibility so that the processes do not benefit from cached data brought in by another process until they have incurred a corresponding miss penalty. Our design achieves low overhead by using a novel combination of timestamps and a hardware design to allow efficient parallel comparisons of the timestamps. The solution works at all the cache levels without the need to limit the number of security domains, and defends against an attacker process running on the same core, on a another hyperthread, or on another core.Our implementation in the gem5 simulator demonstrates that the system is able to defend against RSA key extraction. We evaluate performance using SPEC2006 and PARSEC and observe the overhead of TimeCache to be 1.13% on average. Delay due to first access misses adds the majority of the overhead, with the security context bookkeeping incurred at the time of a context switch contributing 0.02% of the 1.13%. 
    more » « less