skip to main content


Title: Highly-parallelized simulation of a pixelated LArTPC on a GPU
Abstract The rapid development of general-purpose computing on graphics processing units (GPGPU) is allowing the implementation of highly-parallelized Monte Carlo simulation chains for particle physics experiments. This technique is particularly suitable for the simulation of a pixelated charge readout for time projection chambers, given the large number of channels that this technology employs. Here we present the first implementation of a full microphysical simulator of a liquid argon time projection chamber (LArTPC) equipped with light readout and pixelated charge readout, developed for the DUNE Near Detector. The software is implemented with an end-to-end set of GPU-optimized algorithms. The algorithms have been written in Python and translated into CUDA kernels using Numba, a just-in-time compiler for a subset of Python and NumPy instructions. The GPU implementation achieves a speed up of four orders of magnitude compared with the equivalent CPU version. The simulation of the current induced on 10^3 pixels takes around 1 ms on the GPU, compared with approximately 10 s on the CPU. The results of the simulation are compared against data from a pixel-readout LArTPC prototype.  more » « less
Award ID(s):
2112727 2013217 2209601 2047665
NSF-PAR ID:
10416547
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; « less
Date Published:
Journal Name:
Journal of Instrumentation
Volume:
18
Issue:
04
ISSN:
1748-0221
Page Range / eLocation ID:
P04034
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)
    The Neuronix high-performance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from low-end consumer grade devices such as the Nvidia GTX 970 to higher-end devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely compute-intensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floating-point calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should produce identical results. (3) A job should produce comparable results if the data is presented in a different order. System optimization requires an ability to directly compare error rates for algorithms evaluated under comparable operating conditions. However, it is a difficult task to exactly reproduce the results for large, complex deep learning systems that often require more than a trillion calculations per experiment [5]. This is a fairly well-known issue and one we will explore in this poster. Researchers must be able to replicate results on a specific data set to establish the integrity of an implementation. They can then use that implementation as a baseline for comparison purposes. A lack of reproducibility makes it very difficult to debug algorithms and validate changes to the system. Equally important, since many results in deep learning research are dependent on the order in which the system is exposed to the data, the specific processors used, and even the order in which those processors are accessed, it becomes a challenging problem to compare two algorithms since each system must be individually optimized for a specific data set or processor. This is extremely time-consuming for algorithm research in which a single run often taxes a computing environment to its limits. Well-known techniques such as cross-validation [5,6] can be used to mitigate these effects, but this is also computationally expensive. These issues are further compounded by the fact that most deep learning algorithms are susceptible to the way computational noise propagates through the system. GPUs are particularly notorious for this because, in a clustered environment, it becomes more difficult to control which processors are used at various points in time. Another equally frustrating issue is that upgrades to the deep learning package, such as the transition from TensorFlow v1.9 to v1.13, can also result in large fluctuations in error rates when re-running the same experiment. Since TensorFlow is constantly updating functions to support GPU use, maintaining an historical archive of experimental results that can be used to calibrate algorithm research is quite a challenge. This makes it very difficult to optimize the system or select the best configurations. The overall impact of all of these issues described above is significant as error rates can fluctuate by as much as 25% due to these types of computational issues. Cross-validation is one technique used to mitigate this, but that is expensive since you need to do multiple runs over the data, which further taxes a computing infrastructure already running at max capacity. GPUs are preferred when training a large network since these systems train at least two orders of magnitude faster than CPUs [7]. Large-scale experiments are simply not feasible without using GPUs. However, there is a tradeoff to gain this performance. Since all our GPUs use the NVIDIA CUDA® Deep Neural Network library (cuDNN) [8], a GPU-accelerated library of primitives for deep neural networks, it adds an element of randomness into the experiment. When a GPU is used to train a network in TensorFlow, it automatically searches for a cuDNN implementation. NVIDIA’s cuDNN implementation provides algorithms that increase the performance and help the model train quicker, but they are non-deterministic algorithms [9,10]. Since our networks have many complex layers, there is no easy way to avoid this randomness. Instead of comparing each epoch, we compare the average performance of the experiment because it gives us a hint of how our model is performing per experiment, and if the changes we make are efficient. In this poster, we will discuss a variety of issues related to reproducibility and introduce ways we mitigate these effects. For example, TensorFlow uses a random number generator (RNG) which is not seeded by default. TensorFlow determines the initialization point and how certain functions execute using the RNG. The solution for this is seeding all the necessary components before training the model. This forces TensorFlow to use the same initialization point and sets how certain layers work (e.g., dropout layers). However, seeding all the RNGs will not guarantee a controlled experiment. Other variables can affect the outcome of the experiment such as training using GPUs, allowing multi-threading on CPUs, using certain layers, etc. To mitigate our problems with reproducibility, we first make sure that the data is processed in the same order during training. Therefore, we save the data from the last experiment and to make sure the newer experiment follows the same order. If we allow the data to be shuffled, it can affect the performance due to how the model was exposed to the data. We also specify the float data type to be 32-bit since Python defaults to 64-bit. We try to avoid using 64-bit precision because the numbers produced by a GPU can vary significantly depending on the GPU architecture [11-13]. Controlling precision somewhat reduces differences due to computational noise even though technically it increases the amount of computational noise. We are currently developing more advanced techniques for preserving the efficiency of our training process while also maintaining the ability to reproduce models. In our poster presentation we will demonstrate these issues using some novel visualization tools, present several examples of the extent to which these issues influence research results on electroencephalography (EEG) and digital pathology experiments and introduce new ways to manage such computational issues. 
    more » « less
  2. A bstract The MicroBooNE liquid argon time projection chamber located at Fermilab is a neutrino experiment dedicated to the study of short-baseline oscillations, the measurements of neutrino cross sections in liquid argon, and to the research and development of this novel detector technology. Accurate and precise measurements of calorimetry are essential to the event reconstruction and are achieved by leveraging the TPC to measure deposited energy per unit length along the particle trajectory, with mm resolution. We describe the non-uniform calorimetric reconstruction performance in the detector, showing dependence on the angle of the particle trajectory. Such non-uniform reconstruction directly affects the performance of the particle identification algorithms which infer particle type from calorimetric measurements. This work presents a new particle identification method which accounts for and effectively addresses such non-uniformity. The newly developed method shows improved performance compared to previous algorithms, illustrated by a 93.7% proton selection efficiency and a 10% muon mis-identification rate, with a fairly loose selection of tracks performed on beam data. The performance is further demonstrated by identifying exclusive final states in ν μ CC interactions. While developed using MicroBooNE data and simulation, this method is easily applicable to future LArTPC experiments, such as SBND, ICARUS, and DUNE. 
    more » « less
  3. Abstract

    The conductor‐like polarizable continuum model (C‐PCM) with switching/Gaussian smooth discretization is a widely used implicit solvation model in quantum chemistry. We have previously implemented C‐PCM solvation for Hartree‐Fock (HF) and density functional theory (DFT) on graphical processing units (GPUs), enabling the quantum mechanical treatment of large solvated biomolecules. Here, we first propose a GPU‐based algorithm for the PCM conjugate gradient linear solver that greatly improves the performance for very large molecules. The overhead for PCM‐related evaluations now consumes less than 15% of the total runtime for DFT calculations on large molecules. Second, we demonstrate that our algorithms tailored for ground state C‐PCM are transferable to excited state properties. Using a single GPU, our method evaluates the analytic gradient of the linear response PCM time‐dependent density functional theory energy up to 80× faster than a conventional central processing unit (CPU)‐based implementation. In addition, our C‐PCM algorithms are transferable to other methods that require electrostatic potential (ESP) evaluations. For example, we achieve speed‐ups of up to 130× for restricted ESP‐based atomic charge evaluations, when compared to CPU‐based codes. We also summarize and compare the different PCM cavity discretization schemes used in some popular quantum chemistry packages as a reference for both users and developers.

     
    more » « less
  4. A novel green granular neural network (GGNN) with new fast software-FPGA co-designed learning is developed to reduce both CO2 emissions and energy consumption more effectively than popular neural networks with the traditional software-CPU-GPU-based learning. Different from traditional tedious CPU-GPU-based training algorithms using gradient descent methods and other methods such as genetic algorithms , the software-FPGA co-designed training algorithm may quickly solve a system of linear equations to directly calculate optimal values of hyperparameters of the GGNN. Initial simulation results indicates that the FPGA equation solver code ran faster than the Python equation solver code. Therefore, implementing the GGNN with software-FPGA co-designed learning is feasible. In addition, the shallow high-speed GGNN is explainable because it can generate interpretable granular If-Then rules. In the future, The GGNN will be evaluated by comparing with other machine learning models with traditional software-based learning in terms of speeds, model sizes, accuracy, CO2 emissions and energy consumption by using popular datasets. New algorithms will be created to divide the inputs to different input groups that will be used to build different small-size GGNNs to solve the curse of dimensionality. Additionally, the explainable green granular convolutional neural network will be developed by using the GGNNs as basic building blocks to efficiently solve image recognition problems. 
    more » « less
  5. Summary Lay Description

    Particles are widely used as probes in life sciences through their motions. In single molecule techniques such as optical tweezers and magnetic tweezers, microbeads are used to study intermolecular or intramolecular interactions via beads tracking. Also tracking multiple beads’ motions could study cell–cell or cell–ECM interactions in traction force microscopy. Therefore, particle tracking is of key important during these researches. However, parallel 3D multiple particle tracking in real‐time with high resolution is a challenge either due to the algorithm or the program. Here, we combine the performance of CPU and CUDA‐based GPU to make a hybrid implementation for particle tracking. In this way, a speedup of 137 is obtained compared the program before only with CPU without loss of accuracy. Moreover, we improve and build a new centrifugal force microscope for multiple single molecule force spectroscopy research in parallel. Then we employed our program into centrifugal force microscope for DNA stretching study. Our results not only demonstrate the application of this program in single molecule techniques, also indicate the capability of multiple single molecule study with centrifugal force microscopy.

     
    more » « less