<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Making OpenVX Really "Real Time"</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>12/01/2018</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10108191</idno>
					<idno type="doi">10.1109/RTSS.2018.00018</idno>
					<title level='j'>2018 IEEE Real-Time Systems Symposium (RTSS)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Ming Yang</author><author>Tanya Amert</author><author>Kecheng Yang</author><author>Nathan Otterness</author><author>James H. Anderson</author><author>F. Donelson Smith</author><author>Shige Wang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[OpenVX is a recently ratified standard that was expressly proposed to facilitate the design of computer-vision (CV) applications used in real-time embedded systems. Despite its real-time focus, OpenVX presents several challenges when validating real-time constraints. Many of these challenges are rooted in the fact that OpenVX only implicitly defines any notion of a schedulable entity. Under OpenVX, CV applications are specified in the form of processing graphs that are inherently considered to execute monolithically end-to-end. This monolithic execution hinders parallelism and can lead to significant processing-capacity loss. Prior work partially addressed this problem by treating graph nodes as schedulable entities, but under OpenVX, these nodes represent rather coarse-grained CV functions, so the available parallelism that can be obtained in this way is quite limited. In this paper, a much more fine-grained approach for scheduling OpenVX graphs is proposed. This approach was designed to enable additional parallelism and to eliminate schedulability-related processing-capacity loss that arises when programs execute on both CPUs and graphics processing units (GPUs). Response-time analysis for this new approach is presented and its efficacy is evaluated via a case study involving an actual CV application.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The push towards deploying autonomous-driving capabilities in vehicles is happening at breakneck speed. Semiautonomous features are becoming increasingly common, and fully autonomous vehicles at mass-market scales are on the horizon. In realizing these features, computer-vision (CV) techniques have loomed large. Looking forward, such techniques will continue to be of importance as cameras are both cost-effective sensors (an important concern in mass-market vehicles) and a rich source of environmental perception.</p><p>To facilitate the development of CV techniques, the Kronos Group has put forth a ratified standard called OpenVX <ref type="bibr">[42]</ref>. Although initially released only four years ago, OpenVX has quickly emerged as the CV API of choice for real-time embedded systems, which are the standard's intended focus. Under OpenVX, CV computations are represented as directed graphs, where graph nodes represent high-level CV functions and graph edges represent precedence and data dependencies across functions. OpenVX can be applied across a diversity of hardware platforms. In this paper, we consider its use on platforms where graphics processing units (GPUs) are used to accelerate CV processing.</p><p>Unfortunately, OpenVX's alleged real-time focus reveals a disconnect between CV researchers and the needs of the real-time applications where their work would be applied. In particular, OpenVX lacks concepts relevant to real-time analysis such as priorities and graph invocation rates, so it is debatable as to whether it really does target real-time systems. More troublingly, OpenVX implicitly treats entire graphs as monolithic schedulable entities. This inhibits parallelism <ref type="foot">1</ref>and can result in significant processing-capacity loss in settings (like autonomous vehicles) where many computations must be multiplexed onto a common hardware platform.</p><p>In prior work, our research group partially addressed these issues by proposing a new OpenVX variant in which individual graph nodes are treated as schedulable entities <ref type="bibr">[23,</ref><ref type="bibr">51]</ref>. This variant allows greater parallelism and enables end-toend graph response-time bounds to be computed. However, graph nodes remain as high-level CV functions, which is problematic for (at least) two reasons. First, these high-level nodes still execute sequentially, so some parallelism is still potentially inhibited. Second, such a node will typically involve executing on both a CPU and a GPU. When a node accesses a GPU, it suspends from its assigned CPU. Suspensions are notoriously difficult to handle in schedulability analysis without inducing significant capacity loss.</p><p>Contributions. In this paper, we show that these problems can be addressed through more fine-grained scheduling of OpenVX graphs. Our specific contributions are threefold.</p><p>First, we show how to transform the coarse-grained OpenVX graphs proposed in our group's prior work <ref type="bibr">[23,</ref><ref type="bibr">51]</ref> to fine-grained variants in which each node accesses either a CPU or a GPU (but not both). Such transformations eliminate suspension-related analysis difficulties at the expense of (minor) overheads caused by the need to manage data sharing. Additionally, our transformation process exposes new potential parallelism at many levels. For example, because we decompose a coarse-grained OpenVX node into finer-grained schedulable entities, portions of such a node can now execute in parallel. Also, we allow not only successive invocations of the same graph to execute in parallel but even successive invocations of the same (fine-grained) node.</p><p>Second, we explain how prior work on scheduling processing graphs and determining end-to-end graph responsetime bounds can be adapted to apply to our fine-grained OpenVX graphs. The required adaptation requires new analysis for determining response-time bounds for GPU computations. We show how to compute such bounds for recent NVIDIA GPUs by leveraging recent work by our group on the functioning of these GPUs <ref type="bibr">[1]</ref>. Our analysis shows that allowing invocations of the same graph node to execute in parallel is crucial in avoiding extreme capacity loss.</p><p>Third, we present the results of case-study experiments conducted to assess the efficacy of our fine-grained graphscheduling approach. In these experiments, we considered six instances of an OpenVX-implemented CV application called HOG (histogram of oriented gradients), which is used in pedestrian detection, as scheduled on a multicore+GPU platform. These instances reflect a scenario where multiple camera feeds must be supported. We compared both analytical response-time bounds and observed response times for HOG under coarse-vs. fine-grained graph scheduling. We found that bounded response times could be guaranteed for all six camera feeds only under fine-grained scheduling. In fact, under coarse-grained scheduling, just one camera could (barely) be supported. We also found that observed response times were substantially lower under fine-grained scheduling. Additionally, we found that the overhead introduced by converting from a coarse-grained graph to a fine-grained one had modest impact. These results demonstrate the importance of enabling fine-grained scheduling in OpenVX if real time is really a first-class concern. Organization. In the rest of the paper, we provided needed background (Sec. 2), describe our new fine-grained scheduling approach (Sec. 3), present the above-mentioned GPU response-time analysis (Sec. 4) and case study (Sec. 5), discuss related work (Sec. 6), and conclude (Sec. 7).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background</head><p>In this section, we review prior relevant work on the realtime scheduling of DAGs and explain how this work was applied previously for coarse-grained OpenVX graph scheduling <ref type="bibr">[23,</ref><ref type="bibr">51]</ref>. The prior scheduling work of relevance takes considerable space to cover, so for the time being, we focus on generic (perhaps non-OpenVX) DAGs. Our review of this prior work draws heavily from a previous paper by three of the authors <ref type="bibr">[52]</ref>. We specifically consider a system</p><p>, . . .. The edges in G i reflect precedence relationships. A particular task &#964; i v 's predecessors are those tasks with outgoing edges directed to &#964; i v , and its successors are those with incoming edges directed from &#964; i v . The j th job of task &#964; i v , &#964; i v,j , cannot commence execution until the j th jobs of all of its predecessors finish. Such dependencies only exist for the same invocation of a DAG, not across invocations. That is, while jobs are sequential, intratask parallelism (i.e., parallel node invocation) is possible: successive jobs of a task are allowed to execute in parallel.</p><p>Ex. 1. Consider DAG G 1 in Fig. <ref type="figure">1</ref>. Task &#964; 1 4 's predecessors are tasks &#964; 1 2 and &#964; 1 3 , i.e., for any j, job &#964; 1 4,j waits for jobs &#964; 1 2,j and &#964; 1 3,j to finish. If intra-task parallelism is allowed, then &#964; 1 4,j and &#964; 1 4,j+1 could execute in parallel.</p><p>For simplicity, we assume that each DAG G i has exactly one source task &#964; i 1 , with only outgoing edges, and one sink task &#964; i n i , with only incoming edges. Multi-source/multisink DAGs can be supported with the addition of singular "virtual" sources and sinks that connect multiple sources and sinks, respectively. Virtual sources and sinks have a worst-case execution time (WCET) of zero.</p><p>Source tasks are released sporadically, i.e., for the DAG G i , the job releases of &#964; i 1 have a minimum separation time, or period, denoted T i . A non-source task &#964; i v (v &gt; 1) releases its j th job &#964; i v,j after the j th jobs of all its predecessors in G i have completed. That is, letting r i v,j and f i v,j denote the release and finish times of &#964; i v,j , respectively,</p><p>, and the end-to-end response time of the j th invocation of the DAG G i as f i n i ,j -r i 1,j . Deriving response-time bounds. An end-to-end responsetime bound can be computed inductively for a DAG G i by scheduling its nodes in a way that allows them to be viewed as sporadic tasks and by then leveraging responsetime bounds applicable to such tasks. When viewing nodes as sporadic tasks, precedence constraints must be respected. This can be ensured by assigning an offset &#934; i v to each task &#964; i v based on the response-time bounds applicable to "up-stream" tasks in G i , and by requiring the j th job of &#964; i v to be released exactly &#934; i v time units after the release time of the j th job of the source task &#964; i 1 , i.e., r i v,j = r i 1,j +&#934; i v , where &#934; i 1 = 0. With offsets so defined, every task &#964; i v in G i (not just the source) has a period of T i . Also, letting C i v denote the WCET of &#964; i v , its utilization can be defined as <ref type="bibr">1 (cont'd)</ref>. Fig. <ref type="figure">2</ref> depicts an example schedule for the DAG G 1 in Fig. <ref type="figure">1</ref>. The first (resp., second) job of each task has a lighter (resp., darker) shading to make them easier to distinguish. Assume that the tasks have deadlines as shown, and response-time bounds of R 1 1 = 9, R 1 2 = 5, R 1 3 = 7, and R 1 4 = 9, respectively. Based on these bounds, we define corresponding offsets &#934; 1 1 = 0, &#934; 1 2 = 9, &#934; 1 3 = 9, and &#934; 1 4 = 16, respectively. With these response-time bounds, the end-to-end response-time bound that can be guaranteed is determined by R 1  1 , R 1 3 , and R 1 4 and is given by R 1 = 25. The task response-time bounds used here depend on the scheduler employed. For example, if all tasks are scheduled via the global earliest-deadline-first (G-EDF) scheduler, then pertask response-time bounds can be determined from tardiness analysis for G-EDF <ref type="bibr">[20,</ref><ref type="bibr">24]</ref>. In fact, this statement applies to any G-EDF-like (GEL) scheduler <ref type="bibr">[25]</ref>. 2 Such schedulers 2 Under such a scheduler, each job has a priority point within a constant distance of its release; an earliest-priority-point-first order is assumed. will be our focus. Recall that, according to the DAG-based task model introduced here, successive jobs of the same task might execute in parallel. We see this with jobs &#964; 1 4,1 and &#964; 1 4,2</p><p>in the interval <ref type="bibr">[23,</ref><ref type="bibr">24)</ref>. Such jobs could even finish out of release-time order due to execution-time variations. &#9830;</p><p>Early releasing. Using offsets may cause non-work-conserving behavior: a given job may be unable to execute even though all of its predecessor jobs have completed. Under any GEL scheduler, work-conserving behavior can be restored in such a case without altering response-time bounds <ref type="bibr">[20,</ref><ref type="bibr">24,</ref><ref type="bibr">25]</ref> via a technique called early releasing <ref type="bibr">[2]</ref>, which allows a job to execute "early," before its "actual" release time.</p><p>Schedulability. For DAGs as considered here, schedulability conditions for ensuring bounded responses times hinge on conditions for ensuring bounded tardiness under GEL scheduling. Assuming a CPU-only platform with M processors, if intra-task parallelism is forbidden, then the required conditions are u i v &#8804; 1 for each v and u i v &#8804; M <ref type="bibr">[20,</ref><ref type="bibr">25]</ref>. On the other hand, if arbitrary intra-task parallelism is allowed, then only u i v &#8804; M is required and per-task utilizations can exceed 1.0 <ref type="bibr">[24,</ref><ref type="bibr">25]</ref>. These conditions remain unaltered if arbitrary early releasing is allowed.</p><p>Coarse-grained OpenVX graphs. In two prior papers by our group <ref type="bibr">[23,</ref><ref type="bibr">51]</ref>, the techniques described above, but without intra-task parallelism, are proposed for scheduling acyclic <ref type="foot">3</ref> OpenVX graphs using G-EDF, <ref type="foot">4</ref> with graph nodes implicitly defined by high-level OpenVX CV functions. We call OpenVX graphs so scheduled coarse-grained graphs.</p><p>Given the nature of high-level CV functions, the nodes of a coarse-grained graph will typically involve executing both CPU code and GPU code. Executing GPU code can introduce task suspensions, and under G-EDF schedulability analysis, suspensions are typically dealt with using suspension-oblivious analysis <ref type="bibr">[15]</ref>. This entails analytically viewing suspension time as CPU computation time and can result in significant processing-capacity loss.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Fine-Grained OpenVX Graph Scheduling</head><p>In this section, we propose a fine-grained scheduling approach for OpenVX graphs obtained by applying four techniques. First, to eliminate suspension-based capacity loss, we treat CPU code and GPU code as separate graph nodes. Second, to reduce response-time bounds, we allow intra-task parallelism. Third, to avoid non-work-conserving behavior and enable better observed response times, we allow early releasing. Finally, we use a scheduler (namely, G-FL-see below) that offers advantages over G-EDF. We elaborate on these techniques in turn below after first providing a brief introduction to GPU programming using CUDA. CUDA basics. The general structure of a CUDA program is as follows: (i) allocate necessary memory on the GPU; (ii) copy input data from the CPU to the GPU; (iii) execute a GPU program called a kernel 5 to process the data; (iv) copy the results from the GPU back to the CPU; (v) free unneeded memory. To handle data dependencies, CUDA provides a set of synchronization functions. For example, such a function would be invoked between steps (iii) and (iv). These functions are configured on a per-device basis to wait via spinning or suspending. In this paper, we consider only waiting by suspending because the kernel executions in the workloads of interest are too long for spinning to be viable. DAG nodes as CPU or GPU nodes. In our fine-grained scheduling approach, we avoid suspension-related capacity loss due to kernel executions by more finely decomposing an OpenVX graph so that each of its nodes is either a CPU node or a GPU node that executes a kernel. Additionally, we distinguish between regular CPU nodes and the necessary CPU work to launch a GPU kernel and await its results. In this paper, we assume that copy operations are included in CPU nodes. In the workloads of interest to us, copies are short, so any resulting suspension-based capacity loss is minor. More lengthy copies could instead be handled as separate nodes, similarly to how we handle kernels.</p><p>In the rest of this section, we use a continuing example to illustrate various nuances of our fine-grained approach. Ex. 2. Fig. <ref type="figure">3</ref>(a) depicts a simple coarse-grained graph comprised of two tasks, &#964; 2</p><p>x and &#964; 2 y . Fig. <ref type="figure">3</ref>(b) shows a fine-grained representation of this same graph. Task &#964; 2</p><p>x is a simple CPUonly CV function and is represented by one fine-grained CPU task, &#964; 2 1 . &#964; 2 y is more complex, and its fine-grained representation consists of six tasks,</p><p>, where &#964; 2 5 is a GPU task, and &#964; 2 4 and &#964; 2 6 are CPU tasks that launch the GPU kernel and await its completion, respectively. 6  &#9830; 5 Unfortunate terminology, not to be confused with an OS kernel. 6 The synchronization call to await results may be launched before the GPU kernel has completed, but this overhead is extremely short. x is simple sequential CPU code, so it is represented by one fine-grained task. &#964; 2 y is more complex and consists of both CPU and GPU parts, some of which can execute in parallel.</p><p>An end-to-end response-time bound for a fine-grained graph can be obtained from per-node bounds as discussed in Sec. 2, with the copy operations in CPU nodes dealt with using suspension-oblivious analysis. For GPU nodes, new analysis is needed, which we provide for NVIDIA GPUs in Sec. 4. 7 Note that, in work on the prior coarse-grained approach <ref type="bibr">[23,</ref><ref type="bibr">51]</ref>, locking protocols were used to preclude concurrent GPU access, obviating the need for such analysis. Ex. 2 (cont'd). Possible schedules for the graphs in Fig. <ref type="figure">3</ref> are depicted in Fig. <ref type="figure">4</ref>. As before, successive jobs of the same task are shaded differently to make them easier to distinguish. Recall from Sec. 2 that all tasks in a graph share the same period; in these schedules, all periods are 5 time units, shown as the time between successive job release times (up arrows).</p><p>Fig. <ref type="figure">4</ref>(a) depicts the graph's schedule as a single monolithic entity, as implied by the OpenVX standard. OpenVX lacks any notion of real-time deadlines or phases, so these are excluded here, as is a response-time bound. The depicted schedule is a bit optimistic because the competing workload does not prevent the graph from being scheduled continuously. Under monolithic scheduling, the entire graph must complete before a new invocation can begin. As a result, the second invocation does not finish until just before time 28.</p><p>Fig. <ref type="figure">4</ref>(b) depicts coarse-grained scheduling as proposed in prior work <ref type="bibr">[23,</ref><ref type="bibr">51]</ref>, where graph nodes correspond to high-level CV functions, as in Fig. <ref type="figure">3(a)</ref>. Nodes can execute in parallel. For example, &#964; 2 y,1 and &#964; 2 x,2 do so in the interval <ref type="bibr">[5,</ref><ref type="bibr">6)</ref>. However, intra-task parallelism is not allowed: &#964; 2 y,2 cannot begin until &#964; 2 y,1 completes, even though its predecessor (&#964; 2 x,2 ) is finished. Note that, under coarse-grained scheduling, GPU execution time is also analytically viewed is CPU execution time using suspension-oblivious analysis. This analytical impact is not represented in the figure.</p><p>Fig. <ref type="figure">4</ref>(c) depicts a fine-grained schedule for the graph in Fig. <ref type="figure">3(b</ref>), but without intra-task parallelism. In comparing insets (b) and (c), the difference is that nodes are now more fine-grained, enabling greater concurrency. As a result, &#964; 2 7,2 completes earlier, at time 25. The detriments of suspensionoblivious analysis for GPU kernels are also now avoided. &#9830; Intra-task parallelism. Our notion of fine-grained graph scheduling allows intra-task parallelism, i.e., consecutive 7 Response times for copies, if handled as separate nodes, are trivial to bound because they are FIFO-scheduled.  jobs of the same task may execute in parallel. Such parallelism can cause successive invocations of the same graph to complete out of order. This can be rectified via buffering. <ref type="foot">8</ref>Ex. 2 (cont'd). Lower response times are enabled by intratask parallelism, as depicted in Fig. <ref type="figure">4(d)</ref>. In this schedule, &#964; 2 3,1 and &#964; 2 7,1 are able to execute in parallel with their predecessors &#964; 2  3,2 and &#964; 2 7,2 , respectively, reducing the completion time of &#964; 2 7,2 to time 23. Observe that &#964; 2 7,2 completes before &#964; 2  7,1 , so some output buffering would be needed here. &#9830; Allowing intra-task parallelism has an even greater impact on analytically derived response-time bounds <ref type="bibr">[24]</ref>, and as noted earlier, enables task utilizations to exceed 1.0. Early releasing. Although omitted in Fig. <ref type="figure">4</ref> for clarity, early releasing can decrease observed response times without affecting analytical bounds. Ex. 2 (cont'd). Allowing &#964; 2  7,1 to be early released once &#964; 2 6,1 completes in Fig. <ref type="figure">4</ref>(d) reduces the overall graph's completion time to just under 23 time units. &#9830; G-FL scheduling. The approach in Sec. 2 applies to any GEL scheduler. As shown in <ref type="bibr">[25]</ref>, the global fair-lateness (G-FL) scheduler is the "best" GEL scheduler with respect to tardiness bounds. We therefore perform CPU scheduling using G-FL instead of G-EDF. <ref type="foot">9</ref>Periods. An additional benefit of fine-grained scheduling is that it allows for shorter periods. Ex. 2 (cont'd). The period used in Fig. <ref type="figure">4</ref> seems reasonable in insets (c) and (d): notice that each job finishes before or close to its task's next job release. In contrast, in insets (a) and (b), response times could easily be unbounded. &#9830;</p><p>Recently proposed OpenVX extensions. The Kronos Group recently released the OpenVX Graph Pipelining, Streaming, and Batch Processing Extension <ref type="bibr">[43]</ref>, which enables greater parallelism in OpenVX graph executions. However, this extension is not available in any current OpenVX implementation and still lacks concepts necessary for ensuring real-time schedulability. While we have not specifically targeted this extension, an ancillary contribution of our work is to provide these needed concepts. In particular, the parallelism enabled by this extension's pipelining feature is actually subsumed by that allowed in our fine-grained graphs. Furthermore, the batching feature allows a node to process multiple frames instead of just one, potentially increasing computation cost; this could increase the node's utilization, possibly even exceeding 1.0. Introducing intra-task parallelism as we have done enables such nodes to be supported while still ensuring schedulability.</p><p>Rest of the paper. Despite the potential benefits of finegrained scheduling described above, additional issues remain. First, as noted earlier, response-time bounds for GPU nodes are needed in order to compute end-to-end response-time bounds. We derive such bounds for NVIDIA GPUs in Sec. 4. Second, decomposing a coarse-grained graph node into a set of fine-grained ones can introduce additional overhead</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">GPU Response-Time Bound</head><p>In this section, we derive a response-time bound for tasks executing on NVIDIA GPUs. To facilitate this, we first introduce additional background relating to NVIDIA GPUs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">NVIDIA GPU Details</head><p>The compute units of NVIDIA GPUs are streaming multiproccessors (SMs), typically comprised of 64 or 128 physical GPU cores. The SMs together can be logically viewed as an execution engine (EE). Execution on these GPUs is constrained by the number of available GPU threads, which we call G-threads to distinguish from CPU threads; on current NVIDIA GPUs, there are 2,048 G-threads per SM. CUDA programs submit work to a GPU as kernels. A kernel is run on the GPU as a series of thread blocks. These thread blocks, or simply blocks, are each comprised of a number of G-threads. The number of blocks and G-threads per block (i.e., the block size) are set at runtime when a kernel is launched. The GPU scheduler uses these values to assign work to the GPU's SMs. Blocks are the schedulable entities on the GPU. All G-threads in a block are always scheduled on the same SM, and execute non-preemptively. In prior work, we documented scheduling rules used by NVIDIA GPUs when either all GPU work is submitted from the same address space or NVIDIA's multi-process service (MPS) is used, which we assume <ref type="bibr">[1]</ref>. For simplicity, we restate here only the rules needed for our purposes. These rules govern how kernels are enqueued on and dequeued from a FIFO EE queue, as depicted in Fig. <ref type="figure">5</ref>. CUDA also provides a concept called a CUDA stream that adds an additional layer of queueing in the form of stream queues prior to the EE queue. However, as explained later, we assume streams are used in a way that effectively obviates these additional queues. (Our statement of Rule G2 has been simplified from the original to reflect this assumption.) The following terminology is used in the rules below. A block is assigned when it is scheduled for execution on an SM. A kernel is dispatched when one or more of its blocks are assigned. A kernel is fully dispatched when its last block is assigned.</p><p>G2 A kernel is enqueued on the EE queue when launched. G3 A kernel at the head of the EE queue is dequeued from that queue once it becomes fully dispatched. X1 Only blocks of the kernel at the head of the EE queue are eligible to be assigned. R1 A block of the kernel at the head of the EE queue is eligible to be assigned if its resource constraints are met. Constrained resources include shared memory, registers, and (of course) G-threads. We assume that an NVIDIAprovided CUDA compiler option limiting register usage is applied to obviate blocking for registers. We consider techniques for handling shared-memory-induced blocking later. Ex. 3. In the simple case depicted in Fig. <ref type="figure">5</ref>, the GPU is comprised of two SMs. Two tasks submit one kernel each, and these are immediately enqueued on the EE queue upon launch (Rule G2). Kernel K1 is comprised of two blocks of 1,024 G-threads each; K2 is comprised of six blocks of 512 G-threads each. K1 is fully dispatched, so it has been dequeued from the EE queue (Rule G3). The remaining two blocks of K2 do not fit on either SM, and thus are not yet assigned (Rule R1); K2 is not fully dispatched, so it is still at the head of the EE queue. Any new kernel K3 would be behind K2 in the EE queue, so its blocks would be ineligible to be assigned until K2 is fully dispatched (Rule X1). &#9830;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">System Model</head><p>One of the virtues of the DAG scheduling approach we proposed in Sec. 3 is that concerns pertaining to CPU and GPU work can be considered separately. Fig. <ref type="figure">6</ref> shows the subset of Fig. <ref type="figure">4</ref>(d) involving GPU work; using our approach, GPU kernels are just sporadic tasks that can be analyzed independently from CPU tasks. In deriving a response-time bound, we therefore restrict our attention to a set &#964; of n independent sporadic GPU tasks {&#964; 1 , &#964; 2 , &#8226; &#8226; &#8226; , &#964; n }, which are scheduled via the rules in Sec. 4.1 on a single GPU with multiple SMs. Each task &#964; i has a period T i , defined as before.</p><p>With our focus on GPU-using tasks, additional notation (to be illustrated shortly) is needed to express execution requirements. Each job of task &#964; i consists of B i blocks, each of which is executed in parallel by exactly<ref type="foot">foot_5</ref> H i G-threads. H i is called the block size<ref type="foot">foot_6</ref> of &#964; i , and H max = max i {H i } denotes the maximum block size in the system. We denote by C i the per-block worst-case execution workload of a block of &#964; i , where one unit of execution workload is defined by the work completed by one G-thread in one time unit. In summary, a GPU task &#964; i is specified as</p><p>Note that C i corresponds to an amount of execution workload instead of execution time. As each block of task &#964; i requires H i threads concurrently executing in parallel, the worst-case execution time of a block of &#964; i is given by C i /H i . Def. 1. (block length) For each task &#964; i , its maximum block length is defined as L i = C i /H i . The maximum block length of tasks in &#964; is defined as</p><p>The utilization of task &#964; i is given by u i = C i &#8226; B i /T i , and the total system utilization by U sum = &#964;i&#8712;&#964; u i .</p><p>Let &#964; i,j denote the j th (j &#8805; 1) job of &#964; i . The release time <ref type="foot">12</ref> of job &#964; i,j is denoted by r i,j , its (absolute) deadline by d i,j = r i,j + T i , and its completion time by f i,j ; its response time is defined as f i,j -r i,j . A task's response time is the maximum response time of any its jobs. SM constraints. We consider a single GPU platform consisting of g identical SMs, each of which consists of m G-threads (for NVIDIA GPUs, m = 2048). A single block of &#964; i must execute on H i G-threads that belong to the same SM. That is, as long as there are fewer than H i available G-threads on each SM, a block of &#964; i,j cannot commence execution even if the total number of available G-threads (from multiple SMs) in the GPU exceeds H i . On the other hand, different blocks may be assigned to different SMs for execution even if these blocks are from the same job.</p><p>Similar to G-thread limitations, there are per-SM and perblock limits on shared-memory usage on NVIDIA GPUs. In experimental work involving CV workloads on NVIDIA GPUs spanning several years, we have never observed blocking due to shared-memory limits on any platform for any workload. Thus, in deriving our response-time bound in Sec. 4.4, we assume that such blocking does not occur. After deriving the bound, we discuss ways in which sharedmemory blocking can be addressed if it becomes an issue. Ex. 4. Our GPU task model is illustrated in Fig. <ref type="figure">7</ref>. There are two SMs, and B 1 = 2 and B 2 = 6. The height of a rectangle denoting a block is given by H i and its length, which denotes its runtime duration, by L i ; the area is bounded by C i . &#9830; Intra-task parallelism. We assume that intra-task parallelism is allowed: consecutive jobs of the same task can execute in parallel if both are pending and sufficient Gthreads are available. This is often the case in CV processing pipelines where each video frame is processed independently. Additionally, Thm. 1 below shows that severe schedulabilityrelated consequences exist if intra-task parallelism is forbidden. Practically speaking, intra-task parallelism can be enabled by assuming per-job streams. A stream is a FIFO queue of operations, so two kernels submitted to a single  stream cannot execute concurrently. Thus, the alternative of using per-task streams would preclude intra-task parallelism. Note that, with each job issuing one kernel at a time, any actual stream queueing is obviated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Total Utilization Constraint</head><p>According to the rules in Sec. 4.1, idle G-threads can exist while the kernel at the head of the EE queue has unassigned blocks. In particular, this can happen when the number of idle threads on any one SM is insufficient for scheduling such a block. Such scenarios imply that some capacity loss is fundamental when seeking to ensure response-time bounds for GPUs. We express such loss by providing a total utilization bound and proving that any system with U sum at most that bound has bounded response times. The utilization bound we present relies on the following definition. Def. 2. (unit block size) The unit block size, denoted by h, is defined by the greatest common divisor (gcd) of all tasks' block sizes and m, i.e., h = gcd({H i } n i=1 &#8746; {m}). The theorem below shows that capacity loss can be extreme if intra-task parallelism is forbidden.</p><p>Theorem 1. With per-task streams, for any given g, m, H max , and h, there exists a task system &#964; with U sum greater than but arbitrarily close to h such that the response time of some task may increase without bound in the worst case. Proof. For any m and H max , m = Z &#8226; h for some integer Z &#8805; 1 and H max = K &#8226; h for some integer K &#8805; 1, because h is a common divisor of m and H max . Recall that there are g SMs. Consider the following task system:</p><p>For this task system, u 1 = h, u 2 = 2Hmax 1+&#949; &#949;, and</p><p>&#949;, so U sum &#8594; h + as &#949; &#8594; 0 + . Now consider the following job execution pattern, which is illustrated in Fig. <ref type="figure">8</ref> for g = 2: &#964; 1 releases its first job at time 0, &#964; 2 and &#964; 3 release their first jobs at time 1 -&#949;, all three tasks continue to release jobs as early as possible, and  every block executes for its worst-case execution workload. Assume that, every time when &#964; 2 and &#964; 3 simultaneously release a job, the job of &#964; 2 is enqueued on the EE queue first. Note that in Fig. <ref type="figure">8</ref>, block boundaries for &#964; 3 are omitted when possible for clarity.</p><p>At time 0, &#964; 1,1 is the only job in the EE queue and is therefore scheduled. Then, at time 1 -&#949;, &#964; 2,1 and (g &#8226; Z -K -1) blocks of &#964; 3,1 are scheduled for the interval [1-&#949;, 1+ &#949;). As a result, all available G-threads are then occupied. Therefore, the remaining one block of &#964; 3,1 must wait until time 1 when &#964; 1,1 finishes. Note that, with per-task streams, a job cannot enter the EE queue until the prior job of the same task completes. &#964; 1,2 enters the EE queue at time 1 after &#964; 3,1 , which entered at time 1 -&#949;. Thus, &#964; 1,2 must wait to begin execution until after the last block of &#964; 3,1 is assigned and once sufficient G-threads become available at time 1 + &#949;.</p><p>This pattern repeats, with task &#964; 1 releasing a job every time unit but finishing a job every 1 + &#949; time units. Thus, its response time increases without bound.</p><p>For example, on the test platform considered in Sec. 5, g = 80, m = 2048, and h can be as small as 32. Thus, close to 99.98% of the hardware capacity may be wasted!</p><p>In contrast, as we show in Sec. 4.4, if intra-task parallelism is allowed, then any task set with U sum &#8804; g &#8226; (m -H max + h) has bounded response times. Furthermore, the following theorem shows that this utilization bound is tight (under our analysis assumptions).</p><p>Theorem 2. With per-job streams, for any given g, m, H max , and h, there exists a task system &#964; with U sum greater than but arbitrarily close to g &#8226; (m -H max + h) such that the response time of some task may increase without bound in the worst case.</p><p>Proof. For any m and H max , integers P and Q exist such that m = P &#8226; H max + Q, where P &#8805; 1 and 0 &#8804; Q &lt; H max . Furthermore, by the definition of h, H max = K &#8226; h for some integer K &#8805; 1, and m = Z &#8226; h for some integer Z &#8805; 1. Thus,  For this task system,</p><p>Now consider the following job execution pattern, which is illustrated in Fig. <ref type="figure">9</ref> for g = 2: all three tasks release jobs as soon as possible, i.e., at time instants 0, 1, 2, . . . , the EE enqueue order is always &#964; 1 , &#964; 2 , and then &#964; 3 , and every block executes for its worst-case execution workload.</p><p>At time 0, the g&#8226;P blocks of &#964; 1 are scheduled first, leaving Q available G-threads in each SM. Next, the g blocks of &#964; 2 are scheduled using the remaining Q G-threads on each SM. Thus, all G-threads are fully occupied in the time interval [0, &#949;). As we often see in experiments, the g &#8226; (Z -K + 1) blocks of &#964; 3 are distributed to the g SMs evenly, and are scheduled for the time interval [&#949;, 1 + &#949;). Note that, although we currently do not have sufficient evidence to guarantee this even distribution, it at least represents a potential worst case.</p><p>Notice that there are only m-(Z -K +1)&#8226;h = (H maxh) G-threads available on each of the g SMs during the interval [&#949;, 1 + &#949;). Therefore, none of the blocks of &#964; 1,2 , which has a block size of H max , will be scheduled before time 1 + &#949;. As a result, no blocks of &#964; 2,2 or &#964; 3,2 will be scheduled before time 1+&#949; either, because they are enqueued after &#964; 1,2 on the FIFO EE queue.</p><p>This pattern repeats, with each of the three tasks releasing a job every time unit but finishing a job every 1 + &#949; time units, so the response time of each task increases without bound.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Response-Time Bound</head><p>In this section, we derive a response-time bound assuming per-job streams are used (i.e., intra-task parallelism is allowed) and the following holds.</p><p>Our derivation is based on the following key definition. Def. 3. (busy and non-busy) A time instant is called if and only if (H max -h) G-threads are idle of the g SMs; a time instant is called non-busy if and only if at least H max G-threads are idle in some of the g SMs. A time interval is busy if and only if every time instant in that interval is busy. By Def. 2, h is the minimum amount by which the number of idle G-threads can change, so "more than (H max -h) Gthreads are idle" is equivalent to "at least H max G-threads are idle." Thus, busy and non-busy time instants are welldefined, i.e., a time instant is either busy or non-busy.</p><p>To derive response-time bounds for all tasks in the system, we bound the response time of an arbitrary job &#964; k,j . The following two lemmas bound the unfinished workload at certain time instants. In the first lemma, t 0 denotes the latest non-busy time instant at or before &#964; k,j 's release time r k,j , i.e., t 0 = r k,j or (t 0 , r k,j ] is a busy interval.</p><p>Lemma 1. At time t 0 , the total unfinished workload from jobs released at or before t 0 , denoted by W (t 0 ), satisfies</p><p>Proof. Suppose there are b blocks, &#946; 1 , &#946; 2 , . . . , &#946; b , that have been released but are unfinished at time t 0 . For each block &#946; i , let H(&#946; i ) denote its block size and let C(&#946; i ) denote its worst-case execution workload. By definition, t 0 is a nonbusy time instant, so by Def. 3, at least H max G-threads are idle in some SM at time t 0 . Because this SM has enough available G-threads to schedule any of the b blocks, they all must be scheduled at time t 0 . These facts imply b i=1</p><p>Therefore,</p><p>The lemma follows.</p><p>Lemma 2. At time r k,j , the total unfinished workload from jobs released at or before r k,l , denoted by W (r k,j ), satisfies</p><p>Proof. Let new(t 0 , r k,j ) denote the workload released during the time interval (t 0 , r k,j ], and let done(t 0 , r k,j ) denote the workload completed during the time interval (t 0 , r k,j ]. Then,</p><p>As each task &#964; i releases consecutive jobs with a minimum separation of T i , new(t 0 , r k,j ) can be upper bounded by</p><p>By Def. 3, (t 0 , r k,j ] being a busy time interval implies that at most (H max -h) G-threads in each of the g SMs are idle at any time instant in this time interval. That is, at least g &#8226; (m -H max + h) G-threads are occupied executing work at any time instant in (t 0 , r k,j ]. Therefore,</p><p>By ( <ref type="formula">3</ref>), (4), and (5),</p><p>The lemma follows.</p><p>The following theorem provides our response-time bound.</p><p>Theorem 3. &#964; k,j finishes the execution of all of its blocks by time r k,j + R k , where</p><p>(6) Proof. Since the EE queue is FIFO, we omit all jobs released after r k,j in the analysis. Thus, any workload executed at or after r k,j is from W (r k,j ). We also assume each block of &#964; k,j executes for its worst-case workload C k (if any of its blocks executes for less, &#964; k,j 's completion is not delayed). 13  Let &#946; * denote the last-finished block of &#964; k,j . Then, the workload from other blocks or jobs at r k,j is W (r k,j ) -C k . Let t * denote the time instant at which &#946; * starts execution. Then, [r k,j , t * ) is a busy interval (else &#946; * would have executed before time t * ). Let done(r k,j , t * ) denote the workload completed during the time interval [r k,j , t * ). Then, by Def. 3,</p><p>The workload C k from &#946; * executes beyond time t * , so done(r k,j , t * ) &#8804; W (r k,j ) -C k . By <ref type="bibr">(7)</ref>, this implies t * &#8804; r k,j + W (r k,j )-C k g&#8226;(m-Hmax+h) . At time t * , &#946; * executes continuously for L k time units. Thus, &#946; * finishes by time r k,j + W (r k,j )-C k g&#8226;(m-Hmax+h) +L k . By Lemma 2, the theorem follows.</p><p>Discussion. As noted in Sec. 4.2, the absence of sharedmemory-induced blocking is assumed in the above analysis. This limitation could be eased by introducing blocking terms, but we leave this for future work. Alternatively, through offline profiling, one could restrict the per-SM G-thread count of m to some value m such that, if only m G-threads are used per SM, no shared-memory-induced blocking ever occurs. The analysis above could then be applied with m replaced by m . While one might expect this analysis to be sustainable in the sense that m per-SM G-threads could really be used at runtime, we found a counterexample where increasing m to m causes response times to increase. Thus, the restricted G-thread count of m would actually have to be enforced. This could potentially be done by creating a neverending kernel per SM that monopolizes m -m G-threads.</p><p>Early releasing (see Secs. 2 and 3) must be restricted for GPU tasks. Because the FIFO EE queue effectively prioritizes work by actual enqueueing times, uncontrolled early releasing can change priorities. Also, this can lead to a violation of the sporadic task model if consecutive jobs of the same task &#964; i have enqueueing times less than T i time units apart. Thus, the early releasing of GPU tasks must be guarded to ensure that this minimum separation is maintained.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Case Study</head><p>In this section, we detail the case-study experiments we performed, and compare our fine-grained DAG scheduling approach to monolithic and coarse-grained DAG scheduling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Experimental Evaluation</head><p>All of our case-study experiments focused on the Histogram of Oriented Gradients (HOG) algorithm, a well-known CV technique for recognizing pedestrians in input images <ref type="bibr">[19]</ref>. Why HOG? The HOG algorithm is required by the OpenVX 1.2 specification, 14 ensuring its relevance in real-world graphbased CV. HOG is inherently a multi-stage technique: it calculates a directional gradient for each pixel, sorts gradients into histograms, normalizes lighting and contrast, and then performs classification. These computations are performed at multiple image-scale levels, using successively smaller versions of the original image. These steps require both CPU and GPU computations, meaning that HOG fits naturally into a DAG-based model. HOG implementation and DAG variations. At the time of writing, no open-source implementations of version 1.2 of OpenVX exist, so we based our case-study experiments on the HOG implementation available in the open-source CV library OpenCV. 15 OpenCV provides CV functions, but does not structure computations as DAGs. To create DAGs, we split the OpenCV HOG code into separate functions, and designated each function as a DAG node. We compared the response times of successively finer-grained notions of DAG scheduling, corresponding to monolithic, coarse-grained, and fine-grained HOG DAGs. 16  The monolithic version of HOG corresponds to the type of DAG that one might specify using OpenVX, and consists of a single DAG of six types of nodes (three are replicated per scale level), as shown in Fig. <ref type="figure">10(a)</ref>. Implementing this DAG required the fewest changes to OpenCV code, as monolithic execution requires only a single thread to sequentially execute the six nodes' functions. Coarse-grained HOG uses the same DAG as monolithic HOG, but, as discussed in Sec. 2, each of the six nodes is a schedulable entity, with scheduling via G-EDF with early releasing. We also used G-EDF, but without early releasing, as a monolithic DAG scheduler. 14 <ref type="url">https://www.khronos.org/registry/OpenVX/specs/</ref> 1.2/OpenVX_Specification_1_2.pdf, Sec. 3.53.1.</p><p>15 <ref type="url">https://docs.opencv.org/3.4.1/d5/d33/structcv_</ref> 1_1HOGDescriptor.html. 16 Our source code is available online at <ref type="url">https://github.com/</ref> Yougmark/opencv/tree/rtss18.</p><p>In fine-grained HOG, several of the coarse-grained nodes are refined, as shown in Fig. <ref type="figure">10(b</ref>). This DAG reflects our new fine-grained approach, where nodes are treated as tasks and the techniques (early releasing, intra-task parallelism, and G-FL scheduling) in Sec. 3 are applied.</p><p>Fine-grained DAG implementation. Implementing finegrained HOG introduced a series of challenges. For example, intra-task parallelism required multiple instances of each DAG to ensure each node can have multiple jobs executing in parallel. Other challenges included priority points that varied (for launching GPU kernels and awaiting results), handling inter-process communication (IPC) between CPU and GPU nodes, enforcing guards on early releasing for GPU nodes, and computing task offsets from response-time bounds in order to run the experiments.</p><p>As in prior work <ref type="bibr">[23]</ref>, we used PGM RT <ref type="bibr">[21]</ref> to handle IPC in the coarse-and fine-grained HOG variants. PGM RT introduces producer/consumer buffers and mechanisms that enable producer nodes to write output data and consumer nodes to suspend until data is available on all inbound edges.</p><p>Test platform. Our evaluation platform was selected to overapproximate current NVIDIA embedded offerings for automotive systems, such as the Drive PX2. This platform features a single NVIDIA Titan V GPU, two eight-core Intel CPUs, and 32 GB of DRAM. Each core features a 32-KB L1 data cache, a 32-KB L1 instruction cache, and a 1-MB L2 cache, and all eight cores on a socket share an 11-MB L3 cache. The system was configured to run Ubuntu 16.04 as an OS, using version 2017.1 of the LITMUS RT kernel <ref type="bibr">[39]</ref>, with hardware multi-threading disabled. Overall computational workload. One would expect contention for hardware resources in many real-world use cases, such as an autonomous vehicle that processes data from multiple camera feeds. We approximated a "contentious" workload by executing six HOG processes on our hardware platform-the limit based on our platform's DRAM, CPU, and GPU capacity. This number of HOG processes makes ensuring bounded response times difficult without careful consideration of resource allocation. This scenario also reflects the very real possibility of executing at high platform utilization, as is commonly done in the automotive industry. To ensure consistency with the GPU scheduling rules in Sec. 4.1, we conducted all of our experiments using NVIDIA's MPS.</p><p>Video-frame-processing can potentially experience cacheaffinity-loss issues under global scheduling. We therefore considered two variants of both G-EDF and G-FL: a truly global variant where any of the six DAGs can be scheduled on any of our platform's 16 CPUs, and a "clustered" variant where the six DAGs are partitioned between the machine's two sockets, with scheduling being "global" only within a socket. We refer to the latter variants as C-EDF and C-FL, respectively, where the "C" prefix stands for "clustered."</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Results</head><p>Our experiments were designed to examine analytical response-time bounds and observed response times under the considered scheduling approaches. We also sought to examine the overhead required to support fine-grained scheduling. Analytical bounds. To compute analytical response-time bounds, we first computed CPU WCETs and worst-case GPU workloads via a measurement process. All worst-case values were calculated as the 99th percentile of 30,000 samples obtained with all six DAGs executing together to cause contention. For each GPU task &#964; i , we used NVIDIA's profiling tool nvprof to measure B i and H i , and instrumented the CUDA kernels to measure L i on the GPU using the globaltimer performance-counter register. For HOG, H max = 256 and h = 64. We measured CPU WCETs using Feather-Trace <ref type="bibr">[14]</ref>. For all DAGs, T i = 33ms.</p><p>We computed fine-grained response-time bounds by using Thm. 3 in Sec. 4.4 for GPU nodes and Thm. 2 from <ref type="bibr">[53]</ref> for CPU nodes and by then applying the techniques in Sec. 3 to obtain an overall bound. We tried computing analytical bounds for the coarse-grained (resp., monolithic) C-EDF and G-EDF variants using prior work <ref type="bibr">[51]</ref> (resp., <ref type="bibr">[20]</ref>), but found these variants to be unschedulable. 17  These results are summarized in the first row of Tbl. 1. Obs. 1. With respect to schedulablity, the monolithic and coarse-grained variants could not even come close to supporting all six cameras (i.e., DAGs).</p><p>With respect to schedulability, the monolithic variants could not even support one camera, because the overall execution time of a single monolithic DAG far exceeds its period. The coarse-grained variants were only slightly better, being able to support just one camera (in which case the choice of variant, C-EDF vs. G-EDF, is of little relevance). In this 17 In the original coarse-grained work <ref type="bibr">[23,</ref><ref type="bibr">51]</ref>, locking protocols were used to preclude concurrent GPU accesses. We instead allowed concurrent accesses and used the analysis in Sec. 4, but the coarse-grained variants were still unschedulable. case, adding a second HOG DAG increased GPU responses to the point of causing a CPU utilization-constraint violation. Note that the increase in CPU utilization is due to using suspension-oblivious analysis. Obs. 2. With respect to schedulability, both fine-grained variants were able to support all six cameras.</p><p>The better schedulability of the fine-grained variants largely resulted from scheduling shorter tasks with intra-task parallelism, though switching to a fair-lateness-based scheduler also proved beneficial. In particular, we found that the scheduler change reduced node response times by 0.1-9.9%. While this reduction is modest, it is still useful. Nonetheless, these reductions suggest that most of the schedulability improvements stemmed from increasing parallelism.</p><p>Observed response times. Fig. <ref type="figure">11</ref> plots DAG response-time distributions, which we computed for all tested variants from measurement data. Corresponding worst-and average-case times are also reported in Tbl. 1. Obs. 3. The average (resp., worst-case) observed response time under the fine-grained variants was around 66 ms (resp., 130 ms), which is substantially lower than the non-finegrained variants. (For reference, the response time of an alert driver is reported to be around 700 ms <ref type="bibr">[27]</ref>.)</p><p>This observation is supported by Fig. <ref type="figure">11</ref> and Tbl. 1. Note that the difference between clustered and global scheduling was not substantial. This is because the aggregate memory footprint of all frames concurrently being accessed under both variants tended to far exceed the size of the L3 cache. Obs. 4. The analytical fine-grained response-time bounds upper bounded the observed worst-case times.</p><p>This observation is supported by Fig. <ref type="figure">11</ref> and Tbl. 1. While the listed bounds of 477.25 ms and 542.39 ms in Tbl. 1 may seem high, note that they are based on considering worstcase scenarios that may be unlikely to occur in practice (and they are still within the limit mentioned in <ref type="bibr">[27]</ref>). Moreover, the monolithic and coarse-grained variants were unable to guarantee any bounds when scheduling all six DAGs. Obs. 5. Observed response times exhibited lower variation under fine-grained scheduling.</p><p>This observation is supported by Fig. <ref type="figure">11</ref>. The fine-grained variances in this plot are several orders of magnitude less than the variances for the other variants. Obs. 6. Early releasing improved observed response times.</p><p>To verify this observation, we conducted additional experiments in which we disabled early releasing for the finegrained G-FL variant. In these experiments, we found that early releasing reduced observed response times by 49%.</p><p>Overhead of DAG conversion. We estimated the overhead of converting from a coarse-grained DAG to a fine-grained one by comparing the computed WCET of every coarsegrained node with the sum of the computed WCETs of the fine-grained nodes that replace it. The total percentage increase across all nodes was deemed to be overhead. Obs. 7. The additional overhead introduced to support finegrained scheduling had modest impact.</p><p>From our collected data, the total overhead was 14.15%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Related Work</head><p>This paper and the prior work it builds upon <ref type="bibr">[23,</ref><ref type="bibr">51]</ref> focus specifically on supporting GPU-using real-time CV workloads. The only other work on this topic known to us is a recent paper by Zhou et al. <ref type="bibr">[56]</ref> that proposes a technique based on reordering and batching (see Sec. 3) kernels to speed deep neural networks. However, they provided no schedulability analysis. More broadly, a large body of work of a general nature exists pertaining to GPU-enabled real-time systems. Much of this work focuses on either treating GPUs as non-shared devices to enable highly predictable GPU usage <ref type="bibr">[22,</ref><ref type="bibr">32,</ref><ref type="bibr">33,</ref><ref type="bibr">46,</ref><ref type="bibr">47,</ref><ref type="bibr">48,</ref><ref type="bibr">50]</ref> or seeking to improve schedulability by simulating preemptive execution <ref type="bibr">[7,</ref><ref type="bibr">32,</ref><ref type="bibr">34,</ref><ref type="bibr">57]</ref>. Other work has focused on timing analysis for GPU workloads <ref type="bibr">[8,</ref><ref type="bibr">9,</ref><ref type="bibr">10,</ref><ref type="bibr">11,</ref><ref type="bibr">12]</ref>, techniques for remedying performance bottlenecks <ref type="bibr">[28]</ref>, direct I/O communication <ref type="bibr">[3]</ref>, energy management <ref type="bibr">[45]</ref>, and techniques for managing or evaluating GPU hardware resources, notably cache and DRAM <ref type="bibr">[16,</ref><ref type="bibr">17,</ref><ref type="bibr">26,</ref><ref type="bibr">29,</ref><ref type="bibr">35,</ref><ref type="bibr">40,</ref><ref type="bibr">49]</ref>.</p><p>The scheduling rules discussed in Sec. 4.1 resulted from an effort by our group to develop a model of GPU execution, particularly for NVIDIA GPUs. This effort has delved into a number of aspects of NVIDIA GPUs marketed for embedded systems <ref type="bibr">[1,</ref><ref type="bibr">41,</ref><ref type="bibr">54]</ref>. Much of this work is rooted in the observation that GPU sharing will become essential for effectively utilizing less-capable embedded GPUs. GPU sharing has also been explored by others in the context of throughput-oriented systems <ref type="bibr">[55]</ref>.</p><p>There has been much prior work on scheduling real-time DAG-based multiprocessor task systems; representative papers include <ref type="bibr">[4,</ref><ref type="bibr">5,</ref><ref type="bibr">6,</ref><ref type="bibr">13,</ref><ref type="bibr">30,</ref><ref type="bibr">31,</ref><ref type="bibr">36,</ref><ref type="bibr">37,</ref><ref type="bibr">38,</ref><ref type="bibr">44]</ref>. However, this work is largely directed at verifying hard-real-time schedulability instead of merely deriving response-time bounds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusions</head><p>In this paper, we proposed a fine-grained approach for decomposing and scheduling acyclic OpenVX graphs. We also explained how to leverage prior work to compute end-toend response-time bounds for these graphs. For GPU-using workloads, end-to-end bounds require response-time bounds for GPU tasks. We presented the first ever such bounds for NVIDIA GPUs, and showed that these bounds are tight under certain assumptions. To illustrate the efficacy of our proposed fine-grained approach, we presented an experimental case study. We saw in this study that our fine-grained approach enabled response-time bounds to be guaranteed and observed response times to be reduced. A notable aspect of our fine-grained approach is its crucial reliance on allowing intra-task parallelism, a feature forbidden in most conventional real-time task models. This paper opens up many avenues for future work. First, methods for dealing with cycles in OpenVX graphs explored previously <ref type="bibr">[23,</ref><ref type="bibr">51]</ref> need to be incorporated into our finegrained approach. Second, although shared-memory-induced GPU blocking is exceedingly rare in our experience, our GPU response-time analysis needs to be extended to fully deal with its effects. Third, tools that automate the resourceallocation options considered in our case study would be useful. Fourth, it would be desirable to augment our case study with a schedulability study that examines general trends. Finally, while we have made a case herein for introducing real-time concepts and fine-grained scheduling into OpenVX, an actual OpenVX implementation that incorporates these elements has yet to be produced.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>As discussed in Sec. 3, a recently proposed extension<ref type="bibr">[18]</ref> enables more parallelism, but this extension is directed at throughput, not real-time predictability, and is not available in any current OpenVX implementation.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>As described in these papers, cycles can be dealt with by relaxing graph constraints or by combining certain nodes into "super-nodes." Adapting these techniques to our context is beyond the scope of this paper.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>While G-EDF was the focus of<ref type="bibr">[51]</ref>, in experiments presented in<ref type="bibr">[23]</ref>, real-time work was limited to execute on one socket of a multi-socket machine and thus was only globally scheduled within a socket.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_3"><p>The buffer size can be determined based on the calculated responsetime bounds.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_4"><p>As explained in Sec. 5, we actually consider two variants of G-FL, a "clustered" variant in which G-FL is applied on a per-socket basis on our test platform, and a fully global variant that is applied across all sockets. due to data sharing. We examine this issue via a case study in Sec. 5. In this study, we also compare both analytical response-time bounds and observed response times under coarse-vs. fine-grained scheduling.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_5"><p>Blocks are executed in units of 32 G-threads called warps. Warp schedulers switch between warps to hide memory latency. This can create interference effects that must be factored into the timing analysis applied to blocks, which we assume is done in a measurement-based way. With warp-related interference incorporated into timing analysis, the G-threads in a block can be treated as executing simultaneously.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_6"><p>Current NVIDIA GPUs require block sizes to be multiples of 32, and the CUDA runtime rounds up accordingly. Additionally, the maximum possible block size is 1,024. A task's block size is determined offline.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_7"><p>For the time being, we assume that jobs of GPU tasks are not early released, but we will revisit this issue at the end of Sec. 4.4.</p></note>
		</body>
		</text>
</TEI>
