This paper documents development of a multiple‐Graphics Processing Unit (GPU) version of FUNWAVE‐Total Variation Diminishing (TVD), an open‐source model for solving the fully nonlinear Boussinesq wave equations using a high‐order TVD solver. The numerical schemes of FUNWAVE‐TVD, including Cartesian and spherical coordinates, are rewritten using CUDA Fortran, with inter‐GPU communication facilitated by the Message Passing Interface. Since FUNWAVE‐TVD involves the discretization of high‐order dispersive derivatives, the on‐chip shared memory is utilized to reduce global memory access. To further optimize performance, the batched tridiagonal solver is scheduled simultaneously in multiple‐GPU streams, which can reduce the GPU execution time by 20–30%. The GPU version is validated through a benchmark test for wave runup on a complex shoreline geometry, as well as a basin‐scale tsunami simulation of the 2011 Tohoku‐oki event. Efficiency evaluation shows that, in comparison with the CPU version running at a 36‐core HPC node, speedup ratios of 4–7 and above 10 can be observed for single‐ and double‐GPU runs, respectively. The performance metrics of multiple‐GPU implementation needs to be further evaluated when appropriate.
Real‐time tracking of multiple particles is key for quantitative analysis of dynamic biophysical processes and materials science via time‐lapse microscopy image data, especially for single molecule biophysical techniques, such as magnetic tweezers and centrifugal force microscopy. However, real‐time multiple particle tracking with high resolution is limited by the current imaging processes or tracking algorithms. Here, we demonstrate 1 nm resolution in three dimensions in real‐time with a graphics‐processing unit (GPU) based on a compute unified device architecture (CUDA) parallel computing framework instead of only a central processing unit (CPU). We also explore the trade‐offs between processing speed and size of the utilized regions of interest and a maximum speedup of 137 is achieved with the GPU compared with the CPU. Moreover, we utilize this method with our recently self‐built centrifugal force microscope (CFM) in experiments that track multiple DNA‐tethered particles. Our approach paves the way for high‐throughput single molecule techniques with high resolution and efficiency.
Particles are widely used as probes in life sciences through their motions. In single molecule techniques such as optical tweezers and magnetic tweezers, microbeads are used to study intermolecular or intramolecular interactions via beads tracking. Also tracking multiple beads’ motions could study cell–cell or cell–ECM interactions in traction force microscopy. Therefore, particle tracking is of key important during these researches. However, parallel 3D multiple particle tracking in real‐time with high resolution is a challenge either due to the algorithm or the program. Here, we combine the performance of CPU and CUDA‐based GPU to make a hybrid implementation for particle tracking. In this way, a speedup of 137 is obtained compared the program before only with CPU without loss of accuracy. Moreover, we improve and build a new centrifugal force microscope for multiple single molecule force spectroscopy research in parallel. Then we employed our program into centrifugal force microscope for DNA stretching study. Our results not only demonstrate the application of this program in single molecule techniques, also indicate the capability of multiple single molecule study with centrifugal force microscopy.
- NSF-PAR ID:
- Publisher / Repository:
- Date Published:
- Journal Name:
- Journal of Microscopy
- Page Range / eLocation ID:
- p. 178-188
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
Due to their high spatial resolution and precise application of force, optical traps are widely used to study the mechanics of biomolecules and biopolymers at the single‐molecule level. Recently, core–shell particles with optical properties that enhance their trapping ability represent promising candidates for high‐force experiments. To fully harness their properties, methods for functionalizing these particles with biocompatible handles are required. Here, a straightforward synthesis is provided for producing functional titania core–shell microparticles with proteins and nucleic acids by adding a silane–thiol chemical group to the shell surface. These particles display higher trap stiffness compared to conventional plastic beads featured in optical tweezers experiments. These core–shell microparticles are also utilized in biophysical assays such as amyloid fiber pulling and actin rupturing to demonstrate their high‐force applications. It is anticipated that the functionalized core–shells can be used to probe the mechanics of stable proteins structures that are inaccessible using current trapping techniques.
Off-axis digital holographic microscopy (DHM) provides both amplitude and phase images, and so it may be used for label-free 3D tracking of micro- and nano-sized particles of different compositions, including biological cells, strongly absorbing particles, and strongly scattering particles. Contrast is provided by differences in either the real or imaginary parts of the refractive index (phase contrast and absorption) and/or by scattering. While numerous studies have focused on phase contrast and improving resolution in DHM, particularly axial resolution, absent have been studies quantifying the limits of detection for unresolved particles. This limit has important implications for microbial detection, including in life-detection missions for space flight. Here we examine the limits of detection of nanosized particles as a function of particle optical properties, microscope optics (including camera well depth and substrate), and data processing techniques and find that DHM provides contrast in both amplitude and phase for unresolved spheres, in rough agreement with Mie theory scattering cross-sections. Amplitude reconstructions are more useful than phase for low-index spheres and should not be neglected in DHM analysis.
Solving the shallow water equations efficiently is critical to the study of natural hazards induced by tsunami and storm surge, since it provides more response time in an early warning system and allows more runs to be done for probabilistic assessment where thousands of runs may be required. Using adaptive mesh refinement speeds up the process by greatly reducing computational demands while accelerating the code using the graphics processing unit (GPU) does so through using faster hardware. Combining both, we present an efficient CUDA implementation of GeoClaw, an open source Godunov‐type high‐resolution finite volume numerical scheme on adaptive grids for shallow water system with varying topography. The use of adaptive mesh refinement and spherical coordinates allows modeling transoceanic tsunami simulation. Numerical experiments on the 2011 Japan tsunami and a local tsunami triggered by a hypothetical
M w7.3 earthquake on the Seattle Fault illustrate the correctness and efficiency of the code, which implements a simplified dimensionally split version of the algorithms. Both numerical simulations are conducted on subregions on a sphere with adaptive grids that adequately resolve the propagating waves. The implementation is shown to be accurate and faster than the original when using Central Processing Units (CPUs) alone. The GPU implementation, when running on a single GPU, is observed to be 3.6 to 6.4 times faster than the original model running in parallel on a 16‐core CPU. Three metrics are proposed to evaluate relative performance of the model, which shows efficient usage of hardware resources.
Outlier detection (OD) is a key machine learning task for finding rare and deviant data samples, with many time-critical applications such as fraud detection and intrusion detection. In this work, we propose TOD, the first tensor-based system for efficient and scalable outlier detection on distributed multi-GPU machines. A key idea behind TOD is decomposing complex OD applications into a small collection of basic tensor algebra operators. This decomposition enables TOD to accelerate OD computations by leveraging recent advances in deep learning infrastructure in both hardware and software. Moreover, to deploy memory-intensive OD applications on modern GPUs with limited on-device memory, we introduce two key techniques. First, provable quantization speeds up OD computations and reduces its memory footprint by automatically performing specific floating-point operations in lower precision while provably guaranteeing no accuracy loss. Second, to exploit the aggregated compute resources and memory capacity of multiple GPUs, we introduce automatic batching , which decomposes OD computations into small batches for both sequential execution on a single GPU and parallel execution across multiple GPUs. TOD supports a diverse set of OD algorithms. Evaluation on 11 real-world and 3 synthetic OD datasets shows that TOD is on average 10.9X faster than the leading CPU-based OD system PyOD (with a maximum speedup of 38.9X), and can handle much larger datasets than existing GPU-based OD systems. In addition, TOD allows easy integration of new OD operators, enabling fast prototyping of emerging and yet-to-discovered OD algorithms.more » « less