skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Profiling NVIDIA Jetson Embedded GPU Devices for Autonomous Machines
This paper presents two methods, tegrastats GUI version jtop and Nsight Systems, to profile NVIDIA Jetson embedded GPU devices on a model race car which is a great platform for prototyping and field testing autonomous driving algorithms. The two profilers analyze the power consumption, CPU/GPU utilization, and the run time of CUDA C threads of Jetson TX2 in five different working modes. The performance differences among the five modes are demonstrated using three example programs: vector add in C and CUDA C, a simple ROS (Robot Operating System) package of the wall follow algorithm in Python, and a complex ROS package of the particle filter algorithm for SLAM (Simultaneous Localization and Mapping). The results show that the tools are effective means for selecting operating mode of the embedded GPU devices.  more » « less
Award ID(s):
1853257
PAR ID:
10208378
Author(s) / Creator(s):
;
Date Published:
Journal Name:
International Conference on Signal, Image Processing and Embedded Systems
Volume:
10
Issue:
18
Page Range / eLocation ID:
133 to 144
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. General-purpose programming on GPUs (GPGPU) is becoming increasingly in vogue as applications such as machine learning and scientific computing demand high throughput in vector-parallel applications. NVIDIA's CUDA toolkit seeks to make GPGPU programming accessible by allowing programmers to write GPU functions, called kernels, in a small extension of C/C++. However, due to CUDA's complex execution model, the performance characteristics of CUDA kernels are difficult to predict, especially for novice programmers. This paper introduces a novel quantitative program logic for CUDA kernels, which allows programmers to reason about both functional correctness and resource usage of CUDA kernels, paying particular attention to a set of common but CUDA-specific performance bottlenecks. The logic is proved sound with respect to a novel operational cost semantics for CUDA kernels. The semantics, logic and soundness proofs are formalized in Coq. An inference algorithm based on LP solving automatically synthesizes symbolic resource bounds by generating derivations in the logic. This algorithm is the basis of RaCuda, an end-to-end resource-analysis tool for kernels, which has been implemented using an existing resource-analysis tool for imperative programs. An experimental evaluation on a suite of CUDA benchmarks shows that the analysis is effective in aiding the detection of performance bugs in CUDA kernels. 
    more » « less
  2. Experience shows that on today's high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for distributed applications is to guarantee high utilization of all available resources, including local or remote acceleration cards on a cluster while fully using all the available CPU resources and the integration of the GPU work into the overall programming model. For the integration of CUDA code we extended HPX, a general purpose C++ run time system for parallel and distributed applications of any scale, and enabled asynchronous data transfers from and to the GPU device and the asynchronous invocation of CUDA kernels on this data. Both operations are well integrated into the general programming model of HPX which allows to seamlessly overlap any GPU operation with work on the main cores. Any user defined CUDA kernel can be launched on any (local or remote) GPU device available to the distributed application. We present asynchronous implementations for the data transfers and kernel launches for CUDA code as part of a HPX asynchronous execution graph. Using this approach we can combine all remotely and locally available acceleration cards on a cluster to utilize its full performance capabilities. Overhead measurements show, that the integration of the asynchronous operations (data transfer + launches of the kernels) as part of the HPX execution graph imposes no additional computational overhead and significantly eases orchestrating coordinated and concurrent work on the main cores and the used GPU devices. 
    more » « less
  3. This paper describes a novel framework for executing a network of trained deep neural network (DNN) models on commercial-off-the-shelf devices that are deployed in an IoT environment. The scenario consists of two devices connected by a wireless network: a user-end device (U), which is a low-end, energy and performance-limited processor, and a cloudlet (C), which is a substantially higher performance and energy-unconstrained processor. The goal is to distribute the computation of the DNN models between U and C to minimize the energy consumption of U while taking into account the variability in the wireless channel delay and the performance overhead of executing models in parallel. The proposed framework was implemented using an NVIDIA Jetson Nano for U and a Dell workstation with Titan Xp GPU as C. Experiments demonstrate significant improvements both in terms of energy consumption of U and processing delay. 
    more » « less
  4. MagmaDNN [17] is a deep learning framework driven using the highly optimized MAGMA dense linear algebra package. The library offers comparable performance to other popular frameworks, such as TensorFlow, PyTorch, and Theano. C++ is used to implement the framework providing fast memory operations, direct cuda access, and compile time errors. Common neural network layers such as Fully Connected, Convolutional, Pooling, Flatten, and Dropout are included. Hyperparameter tuning is performed with a parallel grid search engine. MagmaDNN uses several techniques to accelerate network training. For instance, convolutions are performed using the Winograd algorithm and FFTs. Other techniques include MagmaDNNs custom memory manager, which is used to reduce expensive memory transfers, and accelerated training by distributing batches across GPU nodes. This paper provides an overview of the MagmaDNN framework and how it leverages the MAGMA library to attain speed increases. This paper also addresses how deep networks are accelerated by training in parallel and further challenges with parallelization. 
    more » « less
  5. Integrated CPU-GPU architecture provides excellent acceleration capabilities for data parallel applications on embedded platforms while meeting the size, weight and power (SWaP) requirements. However, sharing of main memory between CPU applications and GPU kernels can severely affect the execution of GPU kernels and diminish the performance gain provided by GPU. For example, in the NVIDIA Jetson TX2 platform, an integrated CPU-GPU architecture, we observed that, in the worst case, the GPU kernels can suffer as much as 3X slowdown in the presence of co-running memory intensive CPU applications. In this paper, we propose a software mechanism, which we call BWLOCK++, to protect the performance of GPU kernels from co-scheduled memory intensive CPU applications. 
    more » « less