<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Re-Thinking CNN Frameworks for Time-Sensitive Autonomous-Driving Applications: Addressing an Industrial Challenge</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>04/01/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10108220</idno>
					<idno type="doi">10.1109/RTAS.2019.00033</idno>
					<title level='j'>2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Ming Yang</author><author>Shige Wang</author><author>Joshua Bakita</author><author>Thanh Vu</author><author>F. Donelson Smith</author><author>James H. Anderson</author><author>Jan-Michael Frahm</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Vision-based perception systems are crucial for profitable autonomous-driving vehicle products. High accuracy in such perception systems is being enabled by rapidly evolving convolution neural networks (CNNs). To achieve a better understanding of its surrounding environment, a vehicle must be provided with full coverage via multiple cameras. However, when processing multiple video streams, existing CNN frameworks often fail to provide enough inference performance, particularly on embedded hardware constrained by size, weight, and power limits. This paper presents the results of an industrial case study that was conducted to re-think the design of CNN software to better utilize available hardware resources. In this study, techniques such as parallelism, pipelining, and the merging of per-camera images into a single composite image were considered in the context of a Drive PX2 embedded hardware platform. The study identifies a combination of techniques that can be applied to increase throughput (number of simultaneous camera streams) without significantly increasing per-frame latency (camera to CNN output) or reducing per-stream accuracy.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>This paper reports on the results of an industrial automotive design study focusing on the application of convolutional neural networks (CNNs) in computer vision (CV) for autonomous or automated-driving cars. CNNs are essential building blocks for CV applications that process camera images in diverse uses including object detection and recognition (cars, people, bicycles, signs), scene segmentation (finding roads, sidewalks, buildings), localization (generating and following a map), etc. CV applications based on CNN technology are essential because cameras deployed on cars are much less costly than other sensors such as Lidar. CV in a resource-constrained setting. While cameras are cheap, CV computational costs are not. Keeping overall monetary cost low is essential for consumer-market cars and requires effective management of computing hardware. Because state-of-the-art CV approaches have been shown to take over 94% of the computational capacity in a simulated vehicle <ref type="bibr">[18]</ref>, computer hardware costs can be significantly reduced by optimized implementations. Existing CNN frameworks have some amount of optimization with parallelism, but focus largely on processing a single stream of images in a parallel manner.</p><p>The solutions considered in this study re-think the way CNN software is implemented so that effective resource management can be applied to CNN applications. Our study focuses on one of the most demanding CV applications in cars, the timely detection/recognition of objects in the surrounding environment. Industrial challenge. The major challenge we seek to address is to resolve the inherent tensions among requirements for throughput, response time, and accuracy, which are difficult to satisfy simultaneously.</p><p>The sheer number and diversity of cameras necessary to provide adequate coverage and redundancy (for safety) drives throughput requirements. A typical configuration in today's experimental automated-driving vehicle includes ten or more cameras that cover different fields of view, orientations, and near/far ranges. Each camera generates a stream of images at rates ranging from 10 to 40 frames per second (FPS) depending on its function (e.g., lower rates for side-facing cameras, higher rates for forward-facing ones). All streams must be processed simultaneously by the object detection/recognition application.</p><p>For this application, response time (latency from image capture to recognition completion) is time-sensitive because of its position on the critical path to control and actuation of the vehicle. Every millisecond that elapses during object recognition is a millisecond not available for effecting safe operation. Ideally, the camera-to-recognition latency per frame should not substantially exceed the inter-frame time of the input images (e.g., 25 milliseconds for a 40 FPS camera).</p><p>Accuracy requirements for detection/recognition depend on a number of factors specific to the functions supported by a given camera. For example, accuracy may be much more important in streams from front-facing cameras than those from side-facing cameras. For most of the design alternatives we consider, our modifications have no effect on the accuracy of the CV CNN. However, we do explore one particular instance of a throughput vs. accuracy trade-off of relevance in systems with many cameras (see Sec. V).</p><p>To summarize, our challenge is to re-think the design of CNN software to better utilize hardware resources and achieve increased throughput (number of simultaneous camera streams) without any appreciable increase in per-frame latency (camera to CNN output) or reduction of per-stream accuracy. Our hardware and software environment. Cars are manufactured by integrating components (including computing hardware and software) from a supply chain of various vendors ("Tier-1" and "Tier-2" suppliers). To meet the computational demands of CV applications, supplier offerings are usually based on various application-specific integrated circuits (ASICs) such as field programmable gate arrays (FPGAs), or on digital signal processors (DSPs) or graphics processing units (GPUs). FPGAs and DSPs have advantages in power and density over GPUs, but require specialized knowledge for using the hardware and its tooling environment. Moreover, integrating CV applications supplied by different sources using proprietary black-box solutions built on FPGAs or DSPs can increase product cost. Because of these complicating factors, we consider GPUs as the primary solution for CV applications. GPUs have more familiar development tools and a large number of open-source software choices, TensorFlow <ref type="bibr">[1]</ref> being a notable example.</p><p>In this study, we use the NVIDIA DRIVE PX Parker AutoChauffeur (previously called the 'DRIVE PX2' or just 'PX2') <ref type="bibr">[20]</ref>. This embedded computing platform is marketed by NVIDIA as the "first artificial-intelligence (AI) supercomputer for autonomous and semi-autonomous vehicles." The PX2 incorporates multicore processors along with multiple GPUs to accelerate CV applications. With respect to such applications, we focus on a variant of the open-source CNN framework 'Darknet' configured to implement an object detection/recognition program called 'Tiny YOLO' (version 2, 'YOLO' for short) <ref type="bibr">[22,</ref><ref type="bibr">23,</ref><ref type="bibr">24]</ref>. While our study targets specific hardware and CNN software, our hardware and software choices are representative of the multi-core/multi-GPU hardware and CNN designs that are used and will continue to be used for automated driving.</p><p>Our design study. As stated earlier, the fundamental challenge we consider is to design an object-detection application that can support independent streams of images from several cameras. Due to the constraining nature of our hardware, effective utilization of every part of the system is essential. Our preliminary measurements (provided in Sec. V) revealed that GPU execution time dominates the time spent processing each image in YOLO, yet one of the PX2's two GPUs<ref type="foot">foot_0</ref> was left completely idle and the second was utilized at less than 40%. This result motivated us to explore system-level optimizations of YOLO as a path to higher throughputs.</p><p>Many CNN implementations similarly seem to have limited throughput and poor GPU utilization. As explained further in Sec. III, the root of the problem lies in viewing processing as a single run-through of all the processing layers comprising a CNN. However, in processing an image sequence from one camera, each layer's processing of image i is independent of its processing of image i + 1. If the intermediate per-image data created at each layer is buffered and sequenced, there is no reason why successive images from the same camera cannot be processed in different layers concurrently. This motivates us to consider the classic pipelining solution to sharing resources for increased throughput. <ref type="foot">2</ref> While pipelining offers the potential for greater parallelism, concurrent executions of the same layer on different images are still precluded. While allowing this may be of limited utility with one camera, it has the potential of further increasing throughput substantially when processing streams of images from multiple cameras (all of which are also independent). This motivates us to extend pipelining by enabling each layer to execute in parallel on multiple images.</p><p>We altered the design of Darknet to enable pipelined execution, with parallel layer execution as an option, in the YOLO CNN. We also implemented a technique whereby different camera image streams are combined into a single image stream by combining separate per-camera images into a single composite image. This compositing technique can greatly increase throughput at the expense of some accuracy loss. We also enabled the balancing of GPU work across the PX2's two GPUs, which have differing processing capabilities. The YOLO variants arising from these options and various other implementation details are discussed in Sec. IV. The options themselves are evaluated in Sec. V.</p><p>Contribution. This paper provides an effective solution for a challenging industry problem: designing CNN systems that can provide the throughput, timeliness, and accuracy for CV applications in automated and autonomous driving. Through re-thinking how the YOLO CNN is executed, we show that its throughput can be improved by more than twofold with only a minor latency increase and no change in accuracy. We also show that throughput can be further improved with proper GPU load balancing. Finally, we show that the compositing technique can enable very high throughputs of up to 250 FPS on the PX2. While this technique does incur some accuracy loss, we show that such loss can be mitigated by redoing the "training" CNNs require. We claim these results are sufficiently fundamental to guide CNN construction for similar automotive CV applications on other frameworks or hardware.</p><p>In order to understand these contributions in full, an understanding of relevant background information and related work is required. This we provide in the next two sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND</head><p>In this section, we provide detailed descriptions of our hardware platform, the DRIVE PX2 <ref type="bibr">[20]</ref>, the Darknet CNN framework, and the object-detection application ("YOLO") that runs using this framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. NVIDIA DRIVE PX2</head><p>The PX2 is embedded hardware specifically designed to provide high-performance computing resources for autonomous operation on a platform with small size, weight, and power characteristics.   The exact performance differences between the iGPU and the dGPU depend on the characteristics of GPU programs used in a specific application. We characterize these differences for YOLO in Sec. V.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. YOLO: You Only Look Once</head><p>In this section, we describe Tiny YOLOv2 (simply "YOLO" in the remainder of this paper). We use YOLO as an exemplar for a wide class of CNN implementations that can be optimized using the designs presented here. We also use it for empirical evaluations of the new CNN framework. Convolutional neural networks. YOLO is a CNN that localizes and classifies multiple objects in an image during one forward pass through the CNN. The localization operation predicts the position and dimensions (in pixel units) of a rectangle ("bounding box") in the image that contains an object of interest. The classification operation predicts with a probability what specific class (car, bicycle, sign, etc.) the object belongs to. Collectively, localization and classification are here referred to as object detection.</p><p>A neural network can "learn" to localize and classify objects in images via a process called supervised training in which a large number of images labeled with "ground-truth" bounding boxes and class identifications are input to the CNN running in training mode. Training typically results in computing a large set of parameters (or "weights") defining how the CNN will respond when given images not in the training set (e.g., images from cameras mounted on a car). For most of the experiments in Sec. V, we used the YOLO CNN as already trained on a widely used set of relevant training images (VOC 2007+12), so we did not modify the YOLO network architecture or training approach. This means that most of our work applies to any CNN constructed using Darknet, from YOLOv3 to AlexNet. This is extremely important as noted in <ref type="bibr">[18]</ref> because the state of the art rapidly changes for autonomous vehicles. Only one optional enhancement we consider, namely image compositing, requires retraining. Tiny YOLO implementation. Tiny YOLOv2 is a loweraccuracy, faster-speed derivative of YOLOv2 by the YOLOv2 authors <ref type="bibr">[22,</ref><ref type="bibr">24]</ref>. This version of YOLO runs roughly five times faster than the standard version of YOLO (&#8805; 200 FPS vs. &#8805; 40 FPS on a high-end GPU) <ref type="bibr">[24]</ref>. The network architecture is shown in Fig. <ref type="figure">2</ref>, as it would be configured by default for C cameras (denoted 0 through C -1). In this default system, each camera uses a separate "private" instantiation of the network. Each instantiation contains nine convolution or convolutional layers interleaved with six maxpool layers and a region layer at the end. All YOLO networks run on the Darknet <ref type="bibr">[23]</ref> CNN framework. A convolutional layer convolves over the input by computing the dot product between input and filters to produce an activation map of features. A maxpool layer applies a non-linear transform to the input by picking maximum values out of non-overlapping regions. YOLO's region layer uses the activation maps of the last convolutional layer alongside precalculated "anchor" values to predict the output locations and bounding boxes of objects of interest. These anchors characterize the common shapes and sizes of bounding boxes in the training data and help improve the prediction accuracy. All of our modifications and benchmarks are based on a proprietary fork of Darknet, so frame rates in Sec. V are not directly comparable to those from the opensource version. <ref type="foot">3</ref> Unfortunately, we are unable to open source our fork due to intellectual property management issues.</p><p>The Darknet framework is programmed in C and CUDA. CUDA is an API library and set of C language extensions provided by NVIDIA for general-purpose programming of applications that use a GPU for acceleration. Programs in CUDA run on the CPU and can request execution of programs called "kernels" that are written to run on the GPU. Kernel execution requests are effectively placed in FIFO CUDA stream queues managed by GPU hardware and its device driver. The GPU scheduler executes kernels from multiple queues in a complicated, semi-fair order but always respects the FIFO ordering within a stream queue. When the CPU requests the execution of a kernel on the GPU, control returns to the CPU shortly after the request is enqueued, but before the kernel is executed. The CPU program must use an explicit synchronization operation to determine when all its requested kernel executions have completed.</p><p>Darknet uses a common CUDA programming pattern in its layers. The pattern is (a), if necessary, copy data from CPU memory to GPU memory, (b) start one or more kernels, (c) synchronize on the GPU completion of the kernels, and (d), copy data from GPU memory to CPU memory. Requests for copy operations use specialized hardware on the GPU but are placed in the FIFO stream queues in order with the kernels. The number of kernels run by YOLO varies between layers, but in total may be hundreds of kernels per image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. RELATED WORK</head><p>YOLO comes from a family of "single-shot" detection approaches. These architectures, pioneered by Redmon and Liu <ref type="bibr">[19,</ref><ref type="bibr">21]</ref>, perform all the steps of detection in one run of a single CNN. This allows for higher throughput while competing with the state-of-the-art on accuracy. This builds off several foundational CV CNN works <ref type="bibr">[10,</ref><ref type="bibr">11,</ref><ref type="bibr">17]</ref>.</p><p>CNNs inherently consist largely of parallel math, and thus GPUs serve as the ideal execution platform. Unfortunately, the leading GPU manufacturer (NVIDIA) presently holds their architecture and software stacks extremely closely. This secrecy has even left some publicly documented synchronization behaviors of their hardware to be incorrect <ref type="bibr">[27]</ref>. These factors have previously made it very challenging to build CV CNNs that consistently or properly utilize the GPU. (We show how inadequate the best existing efforts have been in this paper.)</p><p>Fortunately, recent work has begun to address this problem. While AMD's GPUOpen effort <ref type="bibr">[2]</ref> helped unveil aspects of GPU design, Amert et al. <ref type="bibr">[3]</ref> and Capodieci et al. <ref type="bibr">[5]</ref> recently presented more relevant findings. These works uncover details on how CUDA threads are scheduled inside the same process and dispatched to the GPU cores on the NVIDIA Parker SoC, highlighting a set of timesliced FIFO queues at the core. While some other works <ref type="bibr">[16]</ref> have critiqued Amert et al.'s efforts for ambiguous applicability to other NVIDIA GPUs, our platform (the PX 2) shares an SoC (the NVIDIA Parker chip) with their work. This, and some limited validation work by Bakita et al. in <ref type="bibr">[4]</ref>, allows us to safely apply the findings of <ref type="bibr">[3]</ref> here.</p><p>In the area of parallel CNN scheduling, frameworks such at TensorFlow <ref type="bibr">[1]</ref> already support intra-model CNN graphs. These graphs serve to enable "parallelism, distributed execution, compilation, and portability" in their applications (per their website). However, neither the original paper <ref type="bibr">[1]</ref> nor their survey of CV CNNs <ref type="bibr">[12]</ref> seems to consider anything besides parallelism inside one, serialized execution of a model on static data. Their graphs seem mostly to serve to formalize the implicit dataflow graph present in any CNN. In contrast, our work deals with synthesizing an extra-model shared CNN by which to collapse multiple CNN instances into a single shared program. It is worth emphasizing that our efforts are designed to largely work alongside the vast body of existing CNN optimization and compression efforts such as DeepMon <ref type="bibr">[14]</ref>.</p><p>One other work similar to ours, S 3 DNN, applied techniques such as kernel-level deadline-aware scheduling and batch processing (data fusion) that require additional middleware <ref type="bibr">[28]</ref>. We focus on higher-level parallelism by applying a graph-based architecture. Such a graph-based architecture was also applied in <ref type="bibr">[26]</ref>, which focuses on separating CPU and GPU workloads at the kernel level to guarantee response-time bounds. Some of our work here also considers ways to optimally partition YOLO between GPUs of different capabilities. The partitioning concept is not new, used by Kang et al. to dynamically balance CNN layers between a compute-impoverished system and a networked server with Neurosurgeon <ref type="bibr">[15]</ref>. We, however, use the technique to optimize solely for performance without the networking constraints that dominate that work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. SYSTEM DESIGN</head><p>In this section, we discuss some of our ideas for how to implement this shared CNN. First, we describe our overarching approach, which involves generalizing the concept of a CV layer by combining groups of consecutive layers into "stages," examining opportunities for parallelism across stages ("pipelining"), and considering opportunities for parallelism inside each stage ("parallel-pipelining"). Second, our hardware platform provides the somewhat unique opportunity to examine GPUwork-assignment strategies across asymmetric processors, so we briefly consider that issue. Third, we propose a novel technique for realizing additional throughput gains in high-camera-count environments. Along the way, we also briefly touch on several other issues that naturally arise with the concept of a shared CNN. These include data handling, communication issues, and implicit synchronization. Before delving too deeply into the details, however, we first motivate why we are interested in this shared approach in the first place.</p><p>The obvious solution to support multiple cameras is simple: run multiple instances of the application -one process for each camera -and have them share the GPU(s), as shown in Fig. <ref type="figure">2</ref>. We reject this approach for two reasons. First, the high memory requirements of each network instance heavily limits the total number of instances (a maximum of six on our hardware platform). Second, running the models independently does not easily allow for fast synchronization across multiple cameras. Ideally, we would like each network and camera to process frames at the same rate, prioritizing work for networks that are lagging behind. However, this proves extremely challenging in practice as it requires modifying GPU scheduling behavior -a task still not satisfactorily solved by any research that we are aware of.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Enabling Parallelism</head><p>As noted in Sec. I, CNN layers and models are inherently independent and thus sharable -the only major complexities for us stem from Darknet's limited "bookkeeping" side-effects and from the need to correctly pipe intermediate data between layers. In this section, we explore ways of enabling parallelism in such a shared computation structure. Decomposition into stages. As a precursor to introducing parallelism, we decompose a shared CNN into a succession of computational stages. Each stage is simply a subsequence of layers from the YOLO CNN shown in Fig. <ref type="figure">2</ref>, as illustrated in Fig. <ref type="figure">3</ref>. To correctly manage both frame data and the bookkeeping data required by YOLO, we buffer indices to this data using queues between stages as shown in Fig. <ref type="figure">3</ref>. Note that these queues are not necessarily index-ordered. Conceptually, we can view the data being indexed as being stored in infinite arrays, but in reality, these arrays are "wrapped" to form ring buffers. The size of the frame ring buffer is large enough to prevent any blocking when adding another frame to it. However, memory constraints limit the ring buffer for bookkeeping data to have only ten entries. This implies that only ten frames can exist concurrently in the shared CNN at any time. Note that Darknet ensures that the original layers execute correctly without further modification by us when provided with the appropriate bookkeeping data. Data buffering and synchronization in our pipelined implementation are provided by PGM RT , a library created as part of prior work at UNC <ref type="bibr">[7]</ref>. Serialized execution. Before enabling parallelism, we first look at the scheduling of the default serialized execution of the unmodified YOLO through an example in Fig. <ref type="figure">4(a</ref>). We will explain Fig. <ref type="figure">4</ref>(b) and (c) shortly. Fig. <ref type="figure">4</ref>(a) shows a single threestage CNN that processes frames serially. At time t 0 , the first stage (Stage 0) starts the initial preparation work on the CPU for frame f . This initial preparation work (for example, resizing the frame) requires Stage 0 to execute longer on the CPU than the following stages. Stage 0 launches its kernel, indicated by S0, at time t 1 into the single CUDA stream queue belonging to this CNN instance. Stage 1 launches another kernel, S1, at time t 1 , and S2 is launched at time t 2 . These kernels S0, S1, and S2 are executed in FIFO order and are completed at times t 4 , t 5 , and t 6 respectively. Since the intermediate data produced by the first kernels S0 and S1 are not needed for CPU computations, only the last stage suspends to wait for all Pipelined execution. We now explore different methods for enabling the parallel processing of stages. The first method we consider is to enable pipelined execution, i.e., we allow successive stages to execute in parallel on different frames. We implement pipelined execution by defining a thread per stage, which processes one frame by performing the computational steps illustrated in Fig. <ref type="figure">5</ref> for a stage comprised of one convolution layer followed by one maxpool layer. Each stage's thread starts by waiting on the arrival of an index from the previous stage to signal that stage's completion. Once this index becomes available, the thread starts processing by calling the original CNN layer functions with the appropriate bookkeeping data. After these layer functions return, the thread notifies the next stage of completion by pushing the index it just finished into the next queue. It then returns to waiting for input data again. This approach requires no changes to the original YOLO layer functions.</p><p>The resulting parallelism is illustrated in Fig. <ref type="figure">4(b)</ref>. For conciseness, we henceforth refer to this pipelined-execution variant of YOLO as simply "PIPELINE" and refer to the original unmodified YOLO as "SERIAL." In PIPELINE, multiple perframe queues are created and pipeline stages can execute concurrently on CPU cores. This enables kernels from different frames to be queued and executed concurrently on the GPU when enough of its resources are available. One example of the concurrent GPU execution of kernels is noted in Fig. <ref type="figure">4(b)</ref>. The threads for stages are scheduled by the Linux scheduler SCHED_NORMAL. Threads may migrate between CPU cores or block when cores are not available. The concurrent executions of threads on the CPUs and kernels on the GPU should enable PIPELINE to provide higher throughput or lower latency. We evaluate this in Sec. V. While the longer first stage causes some overall pipeline imbalance, we believe this is largely confined to the CPU. We further mitigate the influence of a pipeline imbalance by enabling thread parallelism within each stage as described next. Parallel-pipeline execution. As seen in Fig. <ref type="figure">4</ref>, PIPELINE enables parallelism across stages but not within a stage. Given the independence properties mentioned in Sec. I, there is no reason why the same stage cannot be executed on different frames at the same time. To enable this, we consider a second parallel YOLO variant, called parallel-pipeline execution, or simply "PARALLEL" for brevity, that extends PIPELINE by allowing parallel executions of the same stage. The PARALLEL YOLO variant is realized by allocating a configurable number of "worker" threads in each stage, and a "manager" thread that is responsible for dispatching worker threads. The worker threads can process different frames concurrently. This is illustrated in Fig. <ref type="figure">6</ref> (once again) for a stage comprised of one convolution layer and one maxpool layer. The difference between PIPELINE and PARALLEL is illustrated in Fig. <ref type="figure">4(c</ref>). In PARALLEL, more parallelism is enabled on both CPU cores and the GPU through allowing multiple threads of the same stage to execute in parallel. Recalling the scheduling rules revealed in <ref type="bibr">[3]</ref>, kernels are scheduled in order of their arrival times to the heads of their stream queues. In Fig. <ref type="figure">4</ref>(c), these arrival times are such that kernel executions from different frames are interleaved, as shown.</p><p>With the scheduling shown in Fig. <ref type="figure">4</ref>, we can see that PIPELINE and PARALLEL achieve the purpose of fully utilizing the GPU resources (much less GPU idle time than SERIAL). We quantify the performance in Sec. V. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Multi-GPU Execution</head><p>Recall from Sec. II that the PX2 has an iGPU and a dGPU with different capabilities. As seen in Fig. <ref type="figure">12</ref>, which we cover in detail later, the iGPU tends to execute layers substantially slower than the dGPU. However, notable exceptions (maxpool layers and the region layer) do exist, so the iGPU is still a useful computational resource. Given this observation, we enhanced our PIPELINE and PARALLEL implementations to allow a configurable distribution of stages across GPUs. While utilizing both GPUs may be beneficial, additional data-transfer and synchronization overheads can cause dual-GPU configurations to sometimes under-perform single-GPU ones. We analyze this trade-off experimentally in Sec. V-B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Accelerating Multiple Streams with Multi-Camera Composite Images</head><p>What if we could combine several physical cameras into a single "virtual" one, as illustrated in Fig. <ref type="figure">7</ref>, by combining independent per-camera images into a single composite image as illustrated? Such an approach could have a tremendous impact on throughput. For example, if all physical cameras were combined in groups of four, throughput could potentially be increased by fourfold. Whether this is acceptable would depend on how latency and accuracy are impacted.</p><p>To enable this approach to be evaluated, we added flexible support for compositing physical cameras to the YOLO variants already described. Note that these variants themselves are not impacted by this approach: the fact that the images to be processed are obtained by compositing does not change the processing to be done. Because of this, it is possible to have both composite and full images supported in the same system, as illustrated in Fig. <ref type="figure">7</ref>. The compositing approach works by exploiting the flexibility of YOLO's neural network to process multiple frames as a single frame. This requires taking k 2 frames for some k, tiling them in a grid, and then passing that combined image through YOLO as a single input, as shown in Fig. <ref type="figure">7</ref>. For best accuracy, this approach requires re-training the YOLO network on a set of suitably tiled images to produce new model weights.</p><p>While compositing should yield a throughput improvement that is linear in the number of images in the grid, there are several additional factors to consider. First, frame arrival times from different cameras must be synchronized or the combining process will block waiting for all the images required to fill the grid. This could increase latency to the time between frames on the camera with the smallest frame rate. Second, this approach will reduce accuracy. Because our configuration of YOLO only accepts frames of 416 by 416 pixels, each frame in the grid must be downscaled by a factor proportional to the grid size. For detecting objects that are large relative to the frame size (e.g., close vehicles), this compromise may be permissible, however that will not be true in all cases.</p><p>Additionally, we must stress that this approach only works for CNNs that reason locally. Using this approach in globalclassification systems such as those used in scene understanding will yield incorrect results. While the YOLO authors believe their network learned to use global features, we did not observe this behavior in our analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Avoiding Implicit Synchronization</head><p>The original Darknet framework uses the single default FIFO queue in CUDA for kernel requests. This works flawlessly for serialized execution by a single-threaded program, but falls apart when we want concurrent parallel-pipeline instances. Specifically, use of only the default queue causes destructive implicit synchronization on the GPU, which can significantly degrade performance. As a prior work suggested <ref type="bibr">[27]</ref>, our approach mitigates that by using CUDA functions to create a unique FIFO queue for the kernel executions and memory transfers of each parallel-pipeline instance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. EVALUATION</head><p>Our evaluation experiments were designed to answer a number of questions related to our challenge: How many cameras of differing frame rates (FPS) can be supported with the PX2 running a shared CNN? Can acceptable latency be maintained for each camera supported? How can we use the configuration options in PIPELINE and PARALLEL to increase the number of cameras supported? Are multi-camera composite images a viable option for supporting significant numbers of cameras?</p><p>The evaluation of multi-camera composite images must explicitly consider the effects on object detection/recognition accuracy. In all other evaluations of PIPELINE and PARALLEL using single images, we verified that accuracy remained unchanged (as expected, since the CNN layer processing programs were unchanged).</p><p>We next describe aspects of the experimental methodology common to all experiments and then proceed to answer our questions. We first evaluate the frame rates enabled by the three basic execution models on single and multiple cameras. Then, we explore the effects of configurable options -stage granularity, the number of threads per stage, and the assignment of stages to GPUs. Finally, we explore the trade-off between throughput and accuracy with multi-camera composite images.</p><p>General experiment set-up. In each experiment, the CNN processes an aggregate of at least 10,000 frames. Latency is measured as the time difference between arrival time of a frame at the first CNN layer and the time when the object detection result for that image is ready. Because the system runs Linux and all its services, worst-case latency outliers can occur. Occasionally dropping a small number of frames when their processing time exceeds a latency threshold is acceptable for our automated-driving platform. Therefore, latencies are reported only for those in the 95 th -percentile of all measured values.</p><p>Unless otherwise stated, the default CNN configuration has one CNN layer per pipeline stage for both PIPELINE and PARALLEL and ten threads per stage for PARALLEL. In these default configurations, only the PX2's dGPU is used. The choice of these default configurations is explained as part of configuration option evaluation. All the experiments were conducted on the NVIDIA Drive PX 2 with Linux kernel 4.9.38-rt25-tegra with the PREEMPT RT patch, NVIDIA Driver 390.00, and CUDA 9.0, V9.0.225. CPU threads were scheduled under SCHED_NORMAL.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. How Many Cameras can be Supported?</head><p>To answer this question we first consider processing one stream of images by the shared CNN. Cameras are characterized by their frame rates (FPS), and we emulate camera inputs to the CNN at rates common to cameras used in automated driving <ref type="bibr">(10,</ref><ref type="bibr">15,</ref><ref type="bibr">30,</ref><ref type="bibr">40 FPS)</ref>. We also consider streams with much higher frame rates (up to 100 FPS) to represent aggregate frame rates from multiple cameras sharing the CNN. Our emulated camera is a program running on a separate thread that generates input frames to the shared CNN at a defined inter-frame interval corresponding to a given FPS (e.g., 25 milliseconds for 40 FPS). Because latency is a critical metric, we present results as a plot of latency for different input camera rates and compare execution under SERIAL, PIPELINE, and PARALLEL. PIPELINE supports 63 FPS, PARALLEL may support 71 FPS. Fig. <ref type="figure">8</ref> shows the latency results. Our main criterion for supporting a frame rate is that latency does not become unbounded at that rate. Using this criterion, SERIAL (essentially the original YOLO) can support rates up to 28 FPS and PIPELINE can support up to 63 FPS. PARALLEL can also support 63 FPS with negligible latency reduction and up to 71 FPS before latency becomes unbounded. Note that the latency at rates lower than 63 FPS is essentially constant indicating that 36 milliseconds is the minimum latency for YOLO (as seen on the y-axis of Fig. <ref type="figure">8</ref>). Latency becomes unbounded in any of these scenarios when either the GPU becomes overloaded or the CNN execution model cannot further utilize the available GPU capacity. The latency increase of PARALLEL at 71 FPS over PIPELINE at 63 FPS is 20 milliseconds. This may be acceptable in scenarios such as maneuvering in a parking lot. However, running multiple SERIAL instances (at most six due to memory constraints) can support at most 53 FPS with latency of 114 milliseconds before getting unbounded latency.</p><p>We attribute these results to the inter-stage GPU dispatch parallelism of PIPELINE and the additional intra-stage GPU dispatch parallelism of PARALLEL. These simultaneous dispatches allow us to more fully utilize the GPU and achieve a higher frame rate. Although some dispatch parallelism can be achieved by running multiple SERIAL instances, GPU resource usage is still limited because kernels from different process contexts cannot run concurrently in CUDA. Concurrent kernel execution from different processes can be enabled with NVIDIA's MPS (Multi-Process Service), however, NVIDIA has not implemented this on the PX2.</p><p>These results are perhaps most useful as ways to estimate how many cameras with differing rates can share the CNN. As long as the aggregate rate from all cameras is less than the supported total rate, latency should not be compromised. For example, the results show that a single, shared SERIAL YOLO could not support a single 30 FPS camera on the PX2 hardware while PIPELINE and PARALLEL should easily support two 30 FPS cameras. Similarly, PIPELINE can support up to four 15 FPS cameras, or two at 15 FPS and one at 30 FPS, etc.</p><p>Note that all these results are from using only one of the two independent computing systems on the PX2 board. Using both systems to run two independent CNN configurations should double the number of cameras supported by one PX2. Multiple cameras are supported for aggregate frame rates. We verified that estimations based on the FPS of a single image stream are valid for multiple. This experiment investigated whether the additional competition from multiple input streams to the shared CNN changed latencies. The results are shown in Fig. <ref type="figure">9</ref> for one, two, and four cameras generating input to PARALLEL YOLO. The supported total camera frame rate is determined by the hardware capacity and the CNN framework execution model, thus it is constant with multiple cameras instead of one. That is, this data corroborates the estimate given above that supporting one 60 FPS camera is the equivalent of supporting two 30 FPS cameras, four 15 FPS cameras, etc. The increased latencies for two and four cameras seen on the y-axis in Fig. <ref type="figure">9</ref> are an artifact of the experiment. The input images are processed in sequence, so longer latency measurements will be incurred when a frame starts being processed after frames that arrive at the same time from other cameras. These in-phase frame arrivals can be avoided in a real system by setting a different phasing for each camera (a topic for future work).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Can Our CNN Implementations Be Better Configured?</head><p>PIPELINE can be configured with multiple CNN layers aggregated into the pipeline stages, i.e., configuring the granularity of each stage. PARALLEL can be configured in the same way, but also adds the option to adjust the number of threads assigned per stage, i.e., configure the degree of parallelism within each stage. In addition, sequences of stages can be assigned to either the iGPU or the dGPU on the PX2 to avoid wasting any available GPU capacity. We evaluate these configuration options next. Granularity of stages improves latency under heavy loads for PARALLEL. In Fig. <ref type="figure">10</ref>, we compare the latency results at frame rates from a single camera stream for PARALLEL with different stage granularities. We did not observe any impact of granularities on PIPELINE, so those results are not shown here.</p><p>Note that pipeline is effectively disabled by configuring a single stage containing all 16 layers. This approximates multiple instances of SERIAL that are running in one process context, a configuration not supported in the original framework.</p><p>As shown in Fig. <ref type="figure">10</ref>, latencies of different granularities up to a frame rate at most 63 FPS (the maximum load before latencies increase significantly, e.g., the max steady state) are very close. While all the granularities can support  a heavier processing load up to 71 FPS with an increase in latency, the finer-granularity stages have smaller latencies in this region. Because each stage is implemented by one manager thread and multiple worker threads, more stages with finer granularity enable more parallelism among stages on the CPU. Our design for data communication between stages introduces small overheads, which is reflected in the very small changes in the performance of different granularities when the frame rate is low. A granularity of one layer per stage is a good configuration choice. More threads for PARALLEL improves latency under heavy loads. In Fig. <ref type="figure">11</ref>, we compare the latency results at frame rates from a single camera stream for PARALLEL with different numbers of threads. More parallelism is enabled with more threads per stage, so more frames can be processed in parallel by a stage. Having only one thread per stage is essentially the same as pipelined execution, so latency becomes unbounded at 63 FPS as shown previously. Between 63 FPS and 71 FPS the latencies increase substantially but higher numbers of threads have somewhat lower latencies and a slower rate of latency increase with load. A larger number of threads, 10 or so, is a good configuration choice. Overhead analysis and performance bottleneck. Enabling PIPELINE and PARALLEL certainly causes additional overheads. We evaluate the computational and spatial overheads with different configurations by comparing the CPU and main memory usage between SERIAL and PIPELINE or PARALLEL. In each experiment, process or thread instances share the six active CPU cores on the PX2. The maximum CPU   <ref type="figure">4</ref>. This demonstrates that using more CPU resources in order to fully utilize the scarce resources of the GPU is a reasonable strategy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Multi-GPU execution improves latency under heavy loads</head><p>We evaluate the latency performance of different GPU assignment strategies next to understand the impact of assigning some layers to the iGPU. Layer performance on both GPUs. We first measured the performance of each layer on both GPUs. We executed SERIAL using each GPU individually and measured per-layer response times and per-layer CPU computation times. Results are shown in Fig. <ref type="figure">12</ref>, which consists of two parts that use a common x-axis; the top part depicts per-layer dGPU-iGPU performance ratios (obtained by dividing a layer's iGPU response time by its dGPU response time); the bottom part depicts the perlayer data just mentioned. As seen, convolution layers typically take longer on the iGPU, some two to three times longer. Convolution layers contain heavy matrix computations that are completed much faster with more CUDA cores. The maxpool and region layers have lower performance ratios because they have lighter computation loads. These layers are good choices to be moved to the iGPU. GPU assignment strategy. To introduce as little latency degradation as possible while using both GPUs, our strategy is to move layers with low dGPU-iGPU performance ratios to the iGPU. Given the performance ratios at the top of Fig. <ref type="figure">12</ref>, we evaluate the GPU assignment strategies shown in TABLE <ref type="table">III</ref>. Each strategy is characterized by the sequence of layer numbers assigned to each GPU. For example, strategy S1 assigns layers 0 to 13 to the dGPU and layers 14 and 15 to the iGPU. Strategy S1 substantially reduces latency for PARALLEL under heavy loads. Fig. <ref type="figure">13</ref> shows that, for PARALLEL, moving the last two CNN layers to the iGPU under strategy S1 improves   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Are Multi-Camera Composite Images Effective?</head><p>Any combination of the features and execution models above fails to provide enough throughput at desired latencies for a vehicle equipped with more than four cameras at a frame rate of 30 FPS using both systems on the PX2 board. We propose to tackle this by leveraging the technique of compositing frames from multiple cameras into one image for processing. Although compositing might introduce a small loss of accuracy as the image resolution will be reduced, this can be acceptable for low-criticality cameras in certain circumstances. We devote this section to evaluate the effectiveness of the technique.</p><p>With a multi-camera composite, the throughput is multiplied by the number of images in the composite image. However, the frame concatenation introduces overhead. Also, because the frames are down-sampled as explained in Sec. IV-C, the object detection accuracy could be reduced. We evaluate the throughput, latency, and accuracy of the multi-camera composite image technique.  Experimental set-up. We focus on comparing YOLO's performance on full-size images and 2x2 composite images. The network is evaluated on a standard object-detection dataset called PASCAL VOC <ref type="bibr">[8]</ref>, using two types of tests: one of full-size images and one of composite images. For the full-size test, the original VOC test data is used. For the composite, we generate test images by randomly choosing four full-size VOC test images and combining them into a composite image with a 2x2 grid. Each VOC test image is used exactly once as a component of a composite image, hence, the size of the composite test set is four times smaller than the fullsize one. This ensures that the total numbers of objects to be detected are the same in both tests. At runtime, full-size and composite images are both resized by YOLO to be 416x416 during processing.</p><p>Evaluation of detection results. Evaluation of the network's output is done using mean average precision (mAP) -a metric commonly used to evaluate object-detection algorithms <ref type="bibr">[9]</ref>. In loose terms, mAP characterizes a model's ability to have high recall and high precision. In other words, higher mAP means more objects are detected and the detections and localizations are more accurate. A detailed explanation of mAP is provided in Appendix A.</p><p>Similar to other state-of-the-art detection networks, YOLO is designed to be scale-invariant, meaning it can detect objects at different scales and resolutions. However, when image resolutions decrease significantly (in our case, by a factor of four), the network's performance can start to degrade. As seen in  To mitigate this, we retrained the network on the VOC dataset using the same number of training images for both full-size and composite data. The resulting mAPs are shown in TABLE IV. Moreover, since our primary applications are to autonomous driving, we also show in TABLE V the detection results for relevant object classes. <ref type="foot">4</ref> Overall, our results consistently indicate that with YOLO retrained on both composite and full-size data, the detection accuracy of composite images is improved without reducing that of full-size images. With the retrained composite network, we can provide a trade-off between a 10% accuracy drop and a fourfold increase of throughput.</p><p>Although the mAP decreases with composite images, it is important to remember the following. First, the focus of our work is on providing the option to trade-off between higher throughput and acceptably lower accuracy, and not on the magnitude of the accuracy itself. Second, YOLO represents a state-of-the-art detection network, but we are tackling a very challenging CV problem with strictly limited resources. Third, real-life autonomous driving systems usually deploy tracking algorithms, which can use information of multiple frames to improve detection accuracy. Finally, as shown in <ref type="bibr">[22]</ref>, YOLO tends to perform better for bigger objects than for smaller ones. Thus, we expect the mAPs to be higher for nearby objects, which are usually prioritized in the case of automated driving. A four-camera composite image supports 250 FPS (!) and a latency of 48 milliseconds. This is shown in Fig. <ref type="figure">15</ref>. Examining the y-axis Fig. <ref type="figure">15</ref>, observe that for lower respective frame rates, latency is higher in the four-camera scenario than in the one-camera scenario (about 45 vs. 36 milliseconds). This is due to the additional overhead incurred to combine multiple frames.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSIONS</head><p>In this paper, we presented the results of an industrial design study that addresses the challenge of re-designing CNN software to support multiple cameras with better throughput for automated-driving systems. We showed how to transform a traditional CNN framework to obtain pipelined and parallelpipelined variants to better utilize hardware resources. We also examined the impact of properly load balancing GPU work across the GPUs of a multi-GPU hardware platform. These basic techniques resulted in an improvement in throughput of over twofold with only minor latency increases and no accuracy loss. In order to accommodate more cameras that these basic techniques allow, we proposed the technique of compositing multiple image streams from different cameras into a single stream. We showed that this technique can greatly boost throughput (up to 250 FPS) at the expense of some accuracy loss. We demonstrated that this accuracy loss can be partially mitigated through network retraining.</p><p>The results of this study open up many interesting avenues for future work. For example, we would like to determine whether the compositing technique can be dynamically applied at runtime for input streams that may change in criticality. For example, when driving forward, side-facing cameras are less critical and thus could be reasonable to composite. However, when turning, these cameras become more critical, so disabling compositing for them might be appropriate. In multi-mode systems, compositing might also be mode-dependent, in which case it becomes an issue mode-change protocols must consider. In addition to these issues, we would like to explore whether the CNN model can be further optimized to increase detection precision at the center boundaries of a composite image. Embedded hardware platforms other than the PX2 and CV applications other than YOLO also certainly warrant attention.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>APPENDIX A: MEAN AVERAGE PRECISION (MAP)</head><p>This appendix provides an intuitive understanding of the Mean Average Precision (mAP) metric, which is a popular metric for evaluating object detection accuracy. For a more detailed introduction, we refer the reader to <ref type="bibr">[13]</ref> and Section 4.2 in <ref type="bibr">[9]</ref> for further details on the precision-recall curve and mAP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1) IoU, Precision, and Recall</head><p>In order to determine the precision of object detection we first need to measure the quality of object detections, which is done through the Intersection over Union (IoU) measure. It is defined as the proportion of overlapping area between the predicted and expected area of an object over their union: IoU = P redictedArea &#8745; ExpectedArea P redictedArea &#8746; ExpectedArea . If the IoU of a prediction is greater than a chosen threshold (e.g., 0.5, as specified in PASCAL VOC dataset), we consider it to be a "correct" detection or a true positive.</p><p>Being able to measure correct for a given class (e.g., car), enables us to measure the precision and recall for the class, which are defined as follows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>P recision = T rueP ositive T rueP ositive + F alseP ositive</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Recall = T rueP ositive T rueP ositive + F alseN egative</head><p>The precision measures the proportion of correctly detected objects over all predictions, while recall measures the proportion of correctly detected objects over all test objects. A good object detector should be able to detect most of the objects in the scene (high recall/completeness) and do so accurately (high precision/correctness). Please note, there are typically trade-offs to be made, i.e., in order to improve precision usually the recall suffers and vice versa.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2) Precision-Recall curve and Mean Average Precision</head><p>A precision-recall curve (see Fig. <ref type="figure">16</ref>) is used to evaluate these trade-offs between precision and recall or false positives and false negatives <ref type="bibr">[1,</ref><ref type="bibr">3]</ref>. For a given class, Average Precision (AP) is a numerical value that summarizes the shape of the precision/recall curve. Specifically, it is calculated by taking the average precision of 11 equally spaced recall levels {0, 0.1, , 1} <ref type="bibr">[9]</ref>. A method with high AP has relatively high precision at all levels of recall and thus, can perform accurate detection for the given class. Mean Average Precision (mAP) is the mean AP over all classes (e.g., car, people, dog, etc.).</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Technically, the PX2 provides four GPUs split between two systems on a chip (SoCs). We considered only one of these SoCs in our evaluation.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>For example, instruction pipelining in processors can fully utilize data-path elements and increase throughput.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>Specifically, our base version of Darknet pre-loads frames and performs frame resizing in the main thread whereas the open-source version uses a separate thread.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>Among 20 object classes, we deem the following eight to be irrelevant to the task of autonomous driving: airplane, boat, bottle, chair, dining table, potted plant, sofa, and TV monitor.</p></note>
		</body>
		</text>
</TEI>
