<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Modeling and analyzing evaluation cost of CUDA kernels</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/04/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10253158</idno>
					<idno type="doi">10.1145/3434306</idno>
					<title level='j'>Proceedings of the ACM on Programming Languages</title>
<idno>2475-1421</idno>
<biblScope unit="volume">5</biblScope>
<biblScope unit="issue">POPL</biblScope>					

					<author>Stefan K. Muller</author><author>Jan Hoffmann</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[General-purpose programming on GPUs (GPGPU) is becoming increasingly in vogue as applications such as machine learning and scientific computing demand high throughput in vector-parallel applications. NVIDIA's CUDA toolkit seeks to make GPGPU programming accessible by allowing programmers to write GPU functions, called kernels, in a small extension of C/C++. However, due to CUDA's complex execution model, the performance characteristics of CUDA kernels are difficult to predict, especially for novice programmers.            This paper introduces a novel quantitative program logic for CUDA kernels, which allows programmers to reason about both functional correctness and resource usage of CUDA kernels, paying particular attention to a set of common but CUDA-specific performance bottlenecks. The logic is proved sound with respect to a novel operational cost semantics for CUDA kernels. The semantics, logic and soundness proofs are formalized in Coq. An inference algorithm based on LP solving automatically synthesizes symbolic resource bounds by generating derivations in the logic. This algorithm is the basis of RaCuda, an end-to-end resource-analysis tool for kernels, which has been implemented using an existing resource-analysis tool for imperative programs. An experimental evaluation on a suite of CUDA benchmarks shows that the analysis is effective in aiding the detection of performance bugs in CUDA kernels.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Nevertheless, writing a kernel that executes efficiently on a GPU is not as simple as writing a C function: small changes to a kernel, which might be inconsequential for sequential CPU code, can have drastic impact on its performance. The CUDA C Programming Guide [NVIDIA Corporation 2019] lists three particularly pervasive performance bottlenecks to avoid: divergent warps, uncoalesced memory accesses, and shared memory bank conflicts. Divergent warps result from CUDA's execution model: a group of threads (often 32 threads, referred to as a warp) execute the same instruction on possibly different data. C functions, however, can perform arbitrary branching that can cause different threads of a warp to diverge, i.e., take different branches. CUDA is able to compile such code and execute it on a GPU, but at a fairly steep performance cost, as the two branches must be executed sequentially. Even if a conditional only has one branch, there is nontrivial overhead associated with divergence <ref type="bibr">[Bialas and Strzelecki 2016]</ref>. The other two bottlenecks have to do with the CUDA memory model and will be discussed in detail in Section 2.</p><p>A number of static <ref type="bibr">[Alur et al. 2017;</ref><ref type="bibr">Li and Gopalakrishnan 2010;</ref><ref type="bibr">Li et al. 2012;</ref><ref type="bibr">Singhania 2018</ref>] and dynamic <ref type="bibr">[Boyer et al. 2008;</ref><ref type="bibr">Wu et al. 2019</ref>] tools, including several profiling tools distributed with CUDA, aim to help programmers identify performance bottlenecks such as the three mentioned above. However, all of these tools merely point out potential performance bugs, and occasionally estimate the frequency at which such a bug might occur. Such an analysis cannot guarantee the absence of bugs and gives only a partial picture of the performance impacts. For example, it is not sufficient to simply profile the number of diverging conditionals because it can be an optimization to factor out equivalent code in the two branches of a diverging conditional, resulting in two diverging conditionals but less overall sequentialization <ref type="bibr">[Han and Abdelrahman 2011]</ref>. In addition, there are strong reasons to prefer sound static analyses over dynamic analyses or profiling, particularly in applications (e.g., real-time machine learning systems such as those deployed in autonomous vehicles) where input instances that cause the system's performance to degrade outside expected bounds can be dangerous. Such instances are not merely hypothetical: recent work <ref type="bibr">[Shumailov et al. 2020]</ref> has developed ways of crafting so-called sponge examples that can exploit hidden performance bugs in neural networks to heavily increase energy and time consumption. For such systems, a static worst-case bound on resource usage could ensure safety.</p><p>In this paper, we present an automated amortized resource analysis (AARA) <ref type="bibr">[Hoffmann et al. 2011;</ref><ref type="bibr">Hofmann and Jost 2003</ref>] that statically analyzes the resource usage of CUDA kernels and derives worst-case bounds that are polynomials in the integer inputs of a kernel. Our analysis is parametric over a resource metric that specifies the abstract cost of certain operations or events. In general, a resource metric assigns a non-negative rational number to an operation that is an upper-bound on the actual cost of that operation regardless of context. The assigned cost can depend on runtime parameters, which have to be approximated in a static resource analysis. For example, one metric we consider, &#322;sectors&#382;, estimates the number of reads and writes of memory. In CUDA, fixed-size blocks of memory read and written by a single warp can be handled as one hardware operation, so the number of such operations depends on the footprint of memory locations accessed by a warp (which we estimate using a specialized abstract interpretation) and the amount of memory accessed in a single operation (which is a hardware-specific parameter).</p><p>The main challenge of reasoning statically about CUDA programs is that reasoning about the potential values of variables is central to most static analysis techniques; in CUDA, every program variable has potentially thousands of copies, one for each thread. Reasoning about the contents of variables then requires (1) reasoning independently about each thread, which is difficult to scale, or (2) reasoning statically about which threads are active at each point in a program. Some existing program logics for CUDA (e.g. <ref type="bibr">[Kojima and Igarashi 2017]</ref>) take the latter approach, but these are difficult to prove sound and not very amenable to automated inference. We take a different approach and develop a novel program logic for CUDA that is sound and designed with automated Modeling and Analyzing Evaluation Cost of CUDA Kernels 25:3 inference in mind. In addition, the logic is quantitative, allowing us to use it to reason about both functional and resource-usage properties of CUDA programs simultaneously.</p><p>We formalize our program logic in a core calculus miniCUDA that models a subset of CUDA sufficient to expose the three performance bugs listed above. The calculus is equipped with a novel cost semantics that formalizes the execution cost of a kernel under a given resource metric. A soundness theorem then shows that bounds derived with the program logic are sound with respect to this cost semantics. All of these results are formalized in the Coq Proof Assistant.</p><p>To automate the reasoning in our program logic, we have implemented the resource analysis tool RaCuda (Resource-aware CUDA). If provided with the C implementation of a CUDA kernel and a resource metric, RaCuda automatically derives a symbolic upper bound on the execution cost of the kernel as specified by the metric. We have implemented, RaCuda on top of Absynth <ref type="bibr">[Carbonneaux et al. 2017;</ref><ref type="bibr">Ngo et al. 2018</ref>], a resource analysis tool for imperative programs.</p><p>Using the aforementioned metrics, we evaluated RaCuda for precision and performance on a number of CUDA kernels derived from various sources including prior work and sample code distributed with CUDA. The evaluation shows our tool to be useful in identifying the presence and quantifying the impact of performance bottlenecks on CUDA kernels, and shows promise as a tool for novice and intermediate CUDA programmers to debug the performance of kernels.</p><p>The features of the analysis described so far are sufficient to analyze properties of a kernel such as numbers of divergent warps or memory accesses. Analyzing the execution time of a kernel requires more care because execution time doesn't compose in a straightforward way: threads are scheduled onto processors by a (deliberately) underspecified scheduling algorithm, which exploits thread-level parallelism of kernels to hide the latency of operations such as memory accesses. Although we do not aim to develop a precise analysis of execution time in this work (even for CPUs where such Worst-case Execution Time analyses are well-studied, this is a separate and rich area of research). However, we do take steps toward such an analysis by showing how our analysis can compute the work and span of kernels, two metrics derived from the literature on parallel scheduling theory (e.g., <ref type="bibr">[Blelloch and</ref><ref type="bibr">Greiner 1995, 1996;</ref><ref type="bibr">Brent 1974;</ref><ref type="bibr">Eager et al. 1989;</ref><ref type="bibr">Muller and Acar 2016]</ref>) that can be used to abstractly approximate the running time of parallel algorithms.</p><p>The contributions of this paper include:</p><p>&#8226; Two sets of operational cost semantics for a core calculus for CUDA kernels, one for the lock-step evaluation of individual warps and one for the parallel evaluation of many warps. Both formalize the execution of kernels on a GPU under a given resource metric &#8226; A novel Hoare-style logic for qualitative and quantitative properties of miniCUDA &#8226; A Coq formalization of the cost semantics and soundness proofs of the program logic &#8226; An analysis tool RaCuda that can parse kernels written in a sizable subset of CUDA C and analyze them with respect to resource metrics such as number of bank conflicts and number of divergent warps &#8226; An empirical evaluation of our analysis tools on a suite of CUDA kernels.</p><p>The remainder of this paper is organized as follows. We begin with an introduction to the features of CUDA that will be relevant to this paper (Section 2). In Section 3, we introduce the miniCUDA calculus and the lock-step cost semantics. We use the latter to prove the soundness of the resource inference in Section 4. In Section 5, we present the parallel cost semantics, which models the work and span of executing a kernel in parallel on a GPU. We also show that it is approximated by the lock-step semantics and therefore by the resource analysis. Next, we describe in more detail our implementation of the analysis (Section 6) and evaluate it (Section 7). We conclude with a discussion of related work (Section 8). <ref type="figure">1</ref>. Four implementations addSub 0 ,. . .,addSub 3 of a CUDA kernel that alternately adds and subtracts from rows of a matrix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">A BRIEF INTRODUCTION TO CUDA</head><p>In this section, we introduce some basic concepts of CUDA using a simple running example. We focus on the features of CUDA necessary to explain the performance bottlenecks targeted by our analysis. It should suffice to allow a reader unfamiliar with CUDA to follow the remainder of the paper and is by no means intended as a thorough guide to CUDA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Kernels and Threads.</head><p>A kernel is invoked on the GPU by calling it much like a regular function with additional arguments specifying the number and layout of threads on which it should run. The number of threads running a kernel is often quite large and CUDA organizes them into a hierarchy. Threads are grouped into blocks and blocks form a grid.</p><p>Threads within a block and blocks within a grid may be organized in one, two or three dimensions, which are specified when the kernel is invoked. A thread running CUDA code may access the x, y and z coordinates of its thread index using the designated identifiers threadIdx.x, threadIdx.y and threadIdx.z. CUDA also defines the indentifiers blockDim.(x|y|z), blockIdx.(x|y|z) and gridDim.(x|y|z) for accessing the dimensions of a block, the index of the current thread's block, and the dimensions of the grid, respectively. Most of the examples in this paper assume that blocks and the grid are one-dimensional (i.e. y and z dimensions are 1), unless otherwise specified.</p><p>SIMT Execution. GPUs are designed to execute the same arithmetic or logical instruction on many threads at once. This is referred to as SIMT (Single Instruction, Multiple Thread) execution. To reflect this, CUDA threads are organized into groups called warps 1 . The number of threads in a warp is defined by the identifier warpSize, but is generally set to 32. All threads in a warp must execute the same instruction (although some threads may be inactive).</p><p>SIMT execution leads to a potential performance bottleneck in CUDA code. If a branching operation such as a conditional is executed and two threads within a warp take different execution paths, the GPU must serialize the execution of that warp. It first deactivates the threads that took one execution path and executes the other, and then switches to executing the threads that took the second execution path. This is referred to as a divergence or divergent warp and can greatly reduce the parallelism of a CUDA kernel.</p><p>The functions addSub k in Figure <ref type="figure">1</ref> implement four versions of a kernel that adds the w-length array A pointwise to even rows of the w &#215; h matrix B and subtracts it from odd rows. The annotation __global__ is a CUDA extension indicating that addSub k is a kernel. To simplify some versions of the function, we assume that h is even. Otherwise, the code is similar to standard C code.</p><p>Consider first the function addSub 0 that is given by addS 0 . The for loop iterates over the columns of the matrix, and each row is processed in parallel by separate threads. There is no need to iterate over the rows, because the main program instantiates the kernel for each row.</p><p>The implementation of addSub 0 contains a number of performance bugs. First, the conditional diverges at every iteration of the loop. The reason is that every warp contains thread identifiers (threadIdx) that result in both odd and even values for the variable j. A straightforward way to fix this bug is to remove the conditional, unroll the loop and perform both the addition of an even row and the subtraction of an odd row in one loop iteration. The resulting code, shown in addSub 1 , does not have more parallelism than the original&#208;the addition and subtraction are still performed sequentially&#208;but will perform better because it greatly reduces the overhead of branching.</p><p>Memory Accesses. The next performance bottleneck we discuss relates to the way CUDA handles global memory accesses. CUDA warps can access up to 128 consecutive bytes of such memory at once. When threads in a warp access memory, such as the accesses to arrays A and B in the example, CUDA attempts to coalesce these accesses together into as few separate accesses as possible. If a warp accesses four consecutive 32-bit elements of an array, the memory throughput of that instruction is four times higher than if it performs four non-consecutive reads.</p><p>The execution of the function addSub 1 is unable to coalesce accesses to B because, assuming w is larger than 32 and the arrays are stored in row-major order, no two threads within a warp access memory within 128 bytes of each other. This is fixed by instead iterating over the rows of the matrix and handling the columns in parallel. This way, all of the memory accesses by a warp are consecutive (e.g., threads 0 through 31 might access</p><p>The updated code is shown in the function body addSub 2 . Shared Memory. In all of the kernels discussed so far, the arrays A and B reside in global memory, which is stored on the GPU and visible to all threads. CUDA also provides a separate shared memory space, which is shared only by threads within a block. Shared memory has a lower latency than global memory so we can, for example, use it to store the values of A rather than access global memory every time. In the function addsub 3 , we declare a shared array As and copy values of A into As before their first use. Some care must be taken to ensure that the code of addsub 3 is performant because of how shared memory is accessed. Shared memory consists of a number, generally 32, of separate banks. Separate banks may be accessed concurrently, but multiple concurrent accesses to separate addresses in the same bank are serialized. It is thus important to avoid &#322;bank conflicts&#382;. Most GPUs ensure that 32 consecutive 32-bit memory reads will not result in any bank conflicts. However, if a block accesses a shared array at a stride other than 1, bank conflicts can accumulate.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">THE MINICUDA CORE CALCULUS</head><p>In this section, we present a core calculus, called miniCUDA, that captures the features of CUDA that are of primary interest in this paper: control flow (to allow for loops and to study the cost of divergent warps) and memory accesses. We will use this calculus to present the theory of our resource analysis for CUDA kernels.</p><p>In designing the miniCUDA calculus, we have made a number of simplifying assumptions which make the presentation cleaner and more straightforward. One notable simplification is that a mini-CUDA program consists of a single kernel, without its function header. In addition, we collapse the structure of thread and block indices into a single thread ID, denoted tid. This loses no generality as a three-dimensional thread index can be converted in a straightforward way to a one-dimensional thread ID, given the block dimension. The resource analysis will be parametric over the block index and other parameters, and will estimate the maximum resource usage of any warp in any block.</p><p>Syntax. The syntax of the miniCUDA calculus is presented in Figure <ref type="figure">2</ref>. Two types of data are particularly important to the evaluation and cost analysis of the calculus: integers are used for both array indices (which determine the costs of memory accesses) and loop bounds (which are crucial for estimating the cost of loops), and booleans are used in conditionals. All other base types (e.g., float, string) are represented by an abstract base type B. We also include arrays of any type.</p><p>The terms of the calculus are divided into statements, which may affect control flow or the state of memory, and expressions, which do not have effects. We further distinguish operands, which consist of thread-local variables, parameters to the kernel (these include arguments passed to the kernel function as well as CUDA parameters such as the block index and warp size), constants (of type int, bool and B) and a designated variable tid. Additional expressions include o 1 op o 2 , which stands for an arbitrary binary operation, and array accesses A[o]. The metavariable A stands for a generic array. When relevant, we use metavariables that indicate whether the array is stored in (G)lobal or (S)hared memory. Note that subexpressions of expressions are limited to operands; more complex expressions must be broken down into binary ones by binding intermediate results to variables. This restriction simplifies reasoning about expressions without limiting expressivity.</p><p>Statements include two types of assignment: assignment to a local variable and to an array element. Statements also include conditionals and while loops. The keyword skip represents the &#322;empty&#382; statement. Statements may be sequenced with semicolons, e.g., s 1 ; s 2 .</p><p>Our later results require certain &#322;sanity checks&#382; on code, namely that array indices be integers. We enforce these with a type system for miniCUDA, which is straightforward and therefore omitted for brevity. We write &#931; &#8866; e : &#964; to indicate that e has type &#964; under a signature &#931; that gives the types of local variables, parameters, operators, functions and arrays. Statements do not have return values, but the judgment &#931; &#8866; s is used to indicate that s is well-formed in that all of its subexpressions have the expected type. </p><p>Costs and Resource Metrics. In the following, we present an operational cost semantics and then a quantitative program logic for miniCUDA kernels. Both the operational semantics and the logic are parametric over a resource metric, which specifies the exact resource being considered. A resource metric M is a function whose domain is a set of resource constants that specify particular operations performed by a CUDA program. The resource metric maps these constants to rational numbers, possibly taking an additional argument depending on the constant supplied. A resource metric applied to a constant rc is written M rc , and its application to an additional argument n, if required, is written M rc (n). The only resource constant that does not correspond to a syntactic operation is M div , which is the cost overhead of a divergent warp. The resource constants for miniCUDA, their types and the meanings of their additional argument (if any) are defined in Table <ref type="table">1</ref>.</p><p>The cost of accessing an array depends upon a parameter specifying the number of separate accesses required (for global memory) or the maximum number of threads attempting to access a single shared memory bank (for shared memory). These values are supplied by two additional parameters to the operational semantics and the resource analysis. Given a set of array indices R, the function MemReads(R) returns the number of separate reads (or writes) required to access all of the indices, and the function Conflicts(R) returns the maximum number of indices that map to the same shared memory bank. These parameters are separated from the resource metric because they do not depend on the resource, but on the details of the hardware (e.g., the size of reads and the number of shared memory banks). We discuss these functions more concretely in the next subsection. Resource metrics applied to the appropriate constants simply take the output of these functions and return the cost (in whatever resource) of performing that many memory accesses. We require only that this cost be monotonic, i.e. that if i &#8804; j, then M gread (i) &#8804; M gread (j), and similarly for M sread , M gwrite and M swrite .</p><p>The table also lists the concrete cost values for four resource metrics we consider in our evaluation:</p><p>&#8226; conflicts: Counts the cost of bank conflicts.</p><p>&#8226; sectors: Counts the total number of reads and writes to memory, including multiple requests needed to serve uncoalesced requests<ref type="foot">foot_3</ref> . &#8226; divwarps: Counts the number of times a warp diverges.</p><p>&#8226; steps: Counts the total number of evaluation steps. Lock-step Operational Semantics. We now define an operational semantics for evaluating mini-CUDA kernels that also tracks the cost of evaluation given a resource metric. We use this semantic model as the basis for proving the soundness of our resource analysis in the next section. The operational semantics evaluates an expression or statement over an entire warp at a time to produce a result and the cost of evaluation. Because it only evaluates one warp and threads of a warp execute in lock-step, we refer to this as the &#322;lock-step&#382; semantics to differentiate it from the &#322;parallel&#382; semantics we will introduce in Section 5. Recall, however, that warps can diverge at conditionals, resulting in only some subset of the threads of a warp being active in each branch. We therefore track the set T of currently active threads in the warp as a parameter to the judgments.</p><p>Finally, the operational semantics also requires a store &#963; , representing the values stored in memory. Expressions (and operands) differ from statements in that expressions do not alter memory but simply evaluate to a value. Because every thread might compute on different values, the result is not a single value but rather a family of values, one per active thread, which we represent as a family R indexed by T . We write the result computed by thread t as R(t). In contrast, statements do not return values but simply change the state of memory; statement evaluation therefore simply produces a new store. These distinctions are evident in the two semantic judgments:</p><p>&#963; ; e &#8595; T M R; C indicates that, under store &#963; , the expression e evaluates on threads T to R with cost C. Statements are evaluated with the judgment &#963; ; s &#8659; T M &#963; &#8242; ; C Selected evaluation rules for both judgments are presented in Figure <ref type="figure">3</ref>. Some rules are omitted for space reasons. We now discuss additional representation details and discuss some rules in more detail. As suggested above, the elements of T are abstract thread identifiers t. The function Tid (t) converts such an identifier to an integer thread ID. The domain of a store &#963; is (Arrays &#215; Z) &#8746; (LocalVars &#215; Threads). For an array A (regardless of whether it stored in global or shared memory), &#963; (A, n) returns the n t h element of A. Note that, for simplicity of presentation, we assume that out-of-bounds indices (including negative indices) map to some default value. For a local variable x, &#963; (x, t) returns the value of x for thread t.</p><p>The evaluation of expressions is relatively straightforward: we simply evaluate subexpressions in parallel and combine appropriately, but with careful accounting of the costs. The cost of execution is generally obtained by summing the costs of evaluating subexpressions with the cost of the head operation given by the resource metric M. For example, Rule EC:Op evaluates the two operands and combines the results using the concrete binary operation represented by op. The cost is the cost of the two subexpressions, C 1 and C 2 , plus M op . Array accesses evaluate the operand to a set of indices and read the value from memory at each index. The cost of these operations depends on MemReads(R) or Conflicts(R), where R is the set of indices. Recall that these functions give the number of global memory reads necessary to access the memory locations specified by R, and the number of bank conflicts resulting from simultaneously accessing the memory locations specified by R, respectively. As discussed before, we leave these functions as parameters because their exact definitions can change across versions of CUDA and hardware implementations. As examples of these functions, we give definitions consistent with common specifications in modern CUDA implementations <ref type="bibr">[NVIDIA Corporation 2019]</ref>:</p><p>Above, we assume that global reads are 128 bytes in size and array elements are 4 bytes. In reality, and in our implementation, MemReads(R) depends on the type of the array.</p><p>Statement evaluation is slightly more complex, as statements can update the state of memory and also impact control flow: the former is represented by updating the store &#963; and the latter is represented by changing the thread set T when evaluating subexpressions. For assignment statements, the new state comes from updating the state with the new assignment. We write &#963; [(x, t) &#8594; (v) t | t &#8712; T ] to indicate the state &#963; updated so that for all t &#8712; T , the binding (x, t) now maps to (v) t . Array updates are written similarly.</p><p>Conditionals and while loops each have three rules. If a conditional evaluates to True for all threads (SC:IfT), we simply evaluate the &#322;if&#382; branch with the full set of threads T , and similar if all threads evaluate to False. If, however, there are non-empty sets of threads where the conditional evaluates to True and False (T T and T F , respectively), we must evaluate both branches. We evaluate the &#322;if&#382; branch with T T and the &#322;else&#382; branch with T F . Note that the resulting state of the &#322;if&#382; branch is passed to evaluation of the &#322;else&#382; branch; this corresponds to CUDA executing the two branches in sequence. This rule, SC:IfD, also adds the cost M div of a divergent warp. The three rules for while loops similarly handle the cases in which all, some or none of the threads in T evaluate the condition to be True. The first two rules both evaluate the body under the set of threads for which the condition is true and then reevaluate the loop. Rule SC:WhileSome also indicates that we must pay M div because the warp diverges.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">QUANTITATIVE PROGRAM LOGIC</head><p>In this section, we present declarative rules for a Hoare-style logic that can be used to reason about the resource usage of a warp of a miniCUDA kernel. The analysis for resource usage is based on the ideas of automated amortized resource analysis (AARA) <ref type="bibr">[Hoffmann et al. 2011;</ref><ref type="bibr">Hofmann and Jost 2003]</ref>. The key idea of this analysis is to assign a non-negative numerical potential to states of computation. This potential must be sufficient to cover the cost of the following step and the potential of the next state. For imperative programs, the potential is generally a function of the values of local variables. Rules of a quantitative Hoare logic specify how this potential function changes during the execution of a statement. A derivation in the quantitative Hoare logic then builds a set of constraints on potential functions at every program point. In an implementation (Section 6), these constraints are converted to a linear program and solved with an LP solver.</p><p>As an example, consider the statement for (int i = N; i &gt;=0; i--) { f(); } and suppose we wish to bound the number of calls to f. This corresponds to a resource metric in which the cost of a function call is 1 and all other operations are free. The potential function at each point should be a function of the value of i: it will turn out to be the case that the correct solution is to set the potential to i+1 in the body of the loop before the call to f and to i after the call. This difference &#322;pays for&#382; for the cost of 1 for the function call. It also sets up the proper potential for the next loop iteration: when i is decremented following the loop, the potential once again becomes i+1.</p><p>The particular challenge of designing such a logic for CUDA is that each thread in a warp has a distinct local state. To keep inference tractable and scalable, we wish to reason about only one copy of each variable, but must then be careful about what exactly is meant by any function of a state, and in particular the resource functions: such a function on the values of local variables is not well-defined for CUDA local variables, which have a value for each thread. To solve this problem, we make an observation about CUDA programs: There is often a separation between local program variables that carry data (e.g., are used to store data loaded from memory or intermediate results of computation) and those that carry potential (e.g., are used as indices in for loops). To develop a sound and useful quantitative logic, it suffices to track potential for the latter set of variables, which generally hold the same value across all active threads.</p><p>Pre-and Post-Conditions. Conditions of our logic have the form {P; Q; X } and consist of the logical condition P and the potential function Q (both of which we describe below) as well as a set X of variables whose values are uniform across the warp and therefore can be used as potential-carrying variables as described above. We write &#963; , T &#8866; X to mean that for all x &#8712; X and all t 1 , t 2 &#8712; T , we have &#963; (x, t 1 ) = &#963; (x, t 2 ). The logical condition P is a reasonably standard Hoare logic pre-or post-condition and contains logical propositions over the state of the store. We write &#963; , T &#8872; P to indicate that the condition P is true under the store &#963; and values t &#8712; T for the thread identifier. If either the store or the set of threads is not relevant in a particular context, we may use the shorthand &#963; &#8872; P to mean that there exists some T such that &#963; , T &#8872; P or the shorthand T &#8872; P to mean that there exists some &#963; such that &#963; , T &#8872; P<ref type="foot">foot_4</ref> . We write P &#8658; P &#8242; to mean that P implies P &#8242; : that is, for all &#963; , T such that &#963; , T &#8872; P, it is the case that &#963; , T &#8872; P &#8242; .</p><p>The second component of the conditions is a potential function Q, a mapping from stores and sets of variables X as described above to non-negative rational potentials. We use the potential function to track potential through a kernel in order to analyze resource usage. If &#963; , T &#8866; X , then Q X (&#963; ) refers to the potential of &#963; under function Q, taking into account only the variables in X . Formally, we require (as a property of potential functions Q) that if for all x &#8712; X and t &#8712; T , we have</p><p>For a nonnegative rational cost C, we use the shorthand Q + C to denote a potential function Q &#8242; such that for all &#963; and X , we have</p><p>In this section, we leave the concrete representation of the logical condition and the potential function abstract. In Section 6, we describe our implementation, including the representation of these conditions. For now, we make the assumptions stated above, as well as that logical conditions obey the standard rules of Boolean logic. We also assume that logical conditions and potential functions are equipped with an &#322;assignment&#382; operation</p><p>For simplicity, we also assume that the potential function depends only on the values of local variables in the store and not on the values of arrays. This is sufficient to handle the benchmarks we studied. We write &#963; ; T &#8872; {P; Q; X } to mean &#963; , T &#8872; P and &#963; , T &#8866; X .</p><p>Cost of Expressions. Before presenting the Hoare-style logic for statements, we introduce a simpler judgment that we use for describing the resource usage of operands and expressions. The judgment is written P &#8866; M e : C and indicates that, under condition P, the evaluation of e costs at most C. The rules for this judgment are presented in Figure <ref type="figure">4</ref>. These rules are similar to those of Figure <ref type="figure">3</ref>, with the exception that we now do not know the exact store used to evaluate the expression and must conservatively estimate the cost of array access based on the possible set of stores. We write P &#8658; MemReads(o) &#8804; n to mean that for all &#963; and all T such that &#963; ,</p><p>Inference Rules. Figure <ref type="figure">4</ref> presents the inference rules for the Hoare-style logic for resource usage of statements. The judgment for these rules is written</p><p>which states that if (1) P holds, (2) we have Q resources and (3) all variables in X are thread-invariant, then if s terminates, it ends in a state where (1) P &#8242; holds, (2) we have Q &#8242; resources left over and (3) all variables in X &#8242; are thread-invariant. The simplest cases (e.g., Q:Skip and Q:Seq) simply thread conditions through without altering them (note that Q:Seq feeds the post-conditions of s 1 into the pre-conditions of s 2 ). Most other rules require additional potential in the pre-condition (e.g. Q + C), which is then discarded because it is used to pay for an operation. For example, if s 1 uses C 1 resources and s 2 uses C 2 resources, we might start with</p><p>The most notable rules are for conditionals if e then s 1 else s 2 , which take into account the possibility of a divergent warp. There are four cases. First (Q:If1), we can statically determine that the (OQ:Var)</p><p>Fig. <ref type="figure">4</ref>. Hoare logic rules for resource analysis. conditional expression e does not vary across a warp: this is expressed with the premise P &#8658; e unif, which is shorthand for &#8704;&#963; , T . &#963; , T &#8872; P &#8658; &#8707;c. &#963; ; e &#8595; T M (c) t &#8712;T ; C That is, for any compatible store, e evaluates to a constant result family. In this case, only one branch is taken by the warp and the cost of executing the conditional is the maximum cost of executing the two branches (plus the cost M if of the conditional and the cost C of evaluating the expression, which are added to the precondition). This is expressed by using Q &#8242; as the potential function in the post-condition for both branches. If the two branches do not use equal potential, the one that has more potential &#322;left over&#382; may use rule Q:Weak (discussed in more detail later) to discard its extra potential and use Q &#8242; as a post-condition. We thus conservatively approximate the potential left over after executing one branch. In the next two cases (Q:If2 and Q:If3), we are able to statically determine that the conditional expression is either true or false in any compatible store (i.e., either P &#8658; e or P &#8658; &#172;e), and we need only the &#322;then&#382; or &#322;else&#382; branch, so only the respective branch is considered in these rules.</p><p>In the final case (Q:If4), we consider the possibility that the warp may diverge. In addition to accounting for the case where we must execute s 1 followed by s 2 in sequence, this rule must also subsume the three previous cases, as it is possible that we were unable to determine statically that the conditional would not diverge (i.e., we were unable to derive the preconditions of Q:If1) but the warp does not diverge at runtime. To handle both cases, we require that the precondition of s 2 is implied by:</p><p>&#8226; P &#8743; &#172;e, the precondition of the conditional together with the information that e is false, so that s 2 can execute by itself if the conditional does not diverge, as well as by &#8226; P 1 , the postcondition of s 1 , so that s 2 can execute sequentially after s 1 .</p><p>In a similar vein, we require that the postcondition of the whole conditional is implied by the individual postconditions of both branches. In addition, we remove from X 1 the set of variables possibly written to by s 1 (denoted W (s 1 ), this can be determined syntactically) because if the warp diverged, variables written to by s 1 no longer have consistent values across the entire warp. We similarly remove W (s 2 ) from X 2 .</p><p>Note that it is always sound to use rule Q:If4 to check a conditional. However, using this rule in all cases would produce a conservative over-estimate of the cost by assuming a warp diverges even if it can be shown that it does not. Our inference algorithm will maximize precision by choosing the most precise rule that it is able to determine to be sound.</p><p>The rules Q:While1 and Q:While2 charge the initial evaluation of the conditional (M if + C) to the precondition. For the body of the loop, as with other Hoare-style logics, we must derive a loop invariant: the condition P must hold at both the beginning and end of each iteration of the loop body (we additionally know that e holds at the beginning of the body). In addition, the potential after the loop body must be sufficient to &#322;pay&#382; M if +C for the next check of the conditional, and still have potential Q remaining to execute the next iteration if necessary. Recall that Q is a function of the store. So this premise requires that the value of a store element (e.g. a loop counter) change sufficiently so that the corresponding decrease in potential Q X (&#963; ) is able to pay the appropriate cost. The difference between the two rules is that Q:While1 assumes the warp does not diverge, so we need not pay M div and also need not remove variables assigned by the loop body from X .</p><p>The rules for local assignment are an extension of the standard rule for assignment in Hoare logic. If x &#8712; X and P &#8658; e unif, we add a symmetric premise for the potential function. Otherwise, we cannot use x as a potential-carrying variable and only update the logical condition. The rules for array assignments are similar to those for array accesses, but additionally include the cost of the assigned expression e. </p><p>Fig. <ref type="figure">5</ref>. A derivation using the program logic. We define L to be 2M sread (1) + 2M gwrite (5).</p><p>Finally, as discussed above, Q:Weak allows us to strengthen the preconditions and weaken the postconditions of a derivation. If s can execute with precondition {P 2 ; Q 2 ; X } and postcondition {P &#8242; 2 ; Q &#8242; 2 ; X }, it can also execute with a precondition P 1 that implies P 2 and a potential function Q 1 that is always greater than Q 2 . In addition, it can guarantee any postcondition implied by P &#8242; 2 and any potential function Q &#8242; 1 that is always less than Q &#8242; 2 . We can also take subsets of X as necessary in derivations. The rule also allows us to add a constant potential to both the pre-and post-conditions.</p><p>Example Derivation. Figure <ref type="figure">5</ref> steps through a derivation for the addSub3 kernel from Section 2, with the pre-and post-conditions interspersed in red. The code is simplified to more closely resemble miniCUDA. For illustrative purposes, we consider only the costs of array accesses (writes and reads) and assume all other costs are zero. The potential annotation consists of two parts: the constant potential and a component that is proportional to the value of hj (initially we write this as just h because j = 0). The initial constant potential is consumed by the write on line 5, which involves a global memory access with 4 separate reads (128 consecutive bytes with 32-byte reads<ref type="foot">foot_5</ref> and a shared write with no bank conflicts. The information needed to determine the number of global memory sectors read and the number of bank conflicts is encoded in the logical conditions, which we leave abstract for now; we will discuss in Section 6 how this information is encoded and used. On line 6, we establish the invariant of the loop body. On line 7, we transfer L to the constant potential (this is accomplished by Rule Q:Weak). We then spend part of this on the assignment on line 7 and the rest on line 8. These require 5 global reads each because we read 128 bytes of consecutive memory with 32-byte reads and the first index is not aligned to a 32-byte boundary. This establishes the correct potential for the next iteration of the loop, in which the value of j will be decremented. After the loop, we conclude j &#8805; h and have no remaining potential.</p><p>Soundness. We have proved that if there is a derivation under the analysis showing that a program can execute with precondition {P; Q; X }, then for any store &#963; and any set of threads T such that &#963; ; T &#8872; {P; Q; X }, the cost of executing the program under &#963; and threads T is at most Q X (&#963; ). We first state the soundness result of the resource analysis for expressions. </p><p>Modeling and Analyzing Evaluation Cost of CUDA Kernels</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>25:15</head><p>All proofs are by induction on the derivation in the logic and are formalized in Coq. The case for while loops also includes an inner induction on the evaluation of the loop.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">LATENCY AND THREAD-LEVEL PARALLELISM</head><p>The analysis developed in the previous section is useful for predicting cost metrics of CUDA kernels such as divergent warps, global memory accesses and bank conflicts, the three performance bottlenecks that are the primary focus of this paper: one can specify a resource metric that counts the appropriate operations, run the analysis to determine the maximum cost for a warp and multiply by the number of warps that will be spawned to execute the kernel (this is specified in the main program when the kernel is called). If we wish to work toward predicting actual execution time of a kernel, the story becomes more complex; we begin to explore this question in this section. A first approach would be to determine, via profiling, the runtime cost of each operation and run the analysis with a cost metric that assigns appropriate costs. Such an approach might approximate the execution time of a single warp, but it is not immediately clear how to compose the results to account for multiple warps, unlike divergences or memory accesses which we simply sum together.</p><p>Indeed, the question of how execution times of warps compose is a complex one because of the way in which GPUs schedule warps. Each Streaming Multiprocessor (SM), the computation units of the GPU, can execute instructions on several warps simultaneously, with the exact number dependent on the hardware. However, when a kernel is launched, CUDA assigns each SM a number of threads that is generally greater than the number it can simultaneously execute. It is profitable to do so because many instructions incur some amount of latency after the instruction is executed. For example, if a warp executes a load from memory that takes 16 cycles, the SM can use those 16 cycles to execute instructions on other warps. At each cycle, the SM selects as many warps as possible that are ready to execute an instruction and issues instructions on them.</p><p>In order to predict the execution time of a kernel, we must therefore reason about both the number of instructions executed and their latency. In this section, we show how our existing analysis can be used to derive these quantities and, from them, approximate execution time bounds on a block of a CUDA kernel (we choose the block level for this analysis because it is the granularity at which synchronization occurs and so composing execution times between blocks is more straightforward).</p><p>To derive such an execution time bound, we leverage a result from the field of parallel scheduling <ref type="bibr">[Muller and Acar 2016]</ref>, which predicts execution times of programs based on their work, the total computational cost (not including latency) of operations to be performed, and span, the time required to perform just the operations along the critical path (including latency). One can think of the work as the time required to execute a program running only one thread at a time, and the span as the time required to execute the program running all threads at the same time (assuming infinitely parallel hardware). In our CUDA setting, the work is the total number of instructions executed by the kernel across all warps and the span is the maximum time required to execute any warp from start to finish. Given these two quantities, we can bound the execution time of a block assuming that the SM schedules warps greedily, that is, it issues instructions on as many ready warps as possible.</p><p>Theorem 2. Suppose a block of a kernel has work W and span S and that the block is scheduled on an SM that can issue P instructions at a time. Then, the time required by the SM to execute the block is at most W P + S.</p><p>Proof. This is a direct result of Theorem 1 of <ref type="bibr">[Muller and Acar 2016]</ref>, which shows the same bound for a general computation of workW and span S under a greedy schedule on P processors. &#9633;</p><p>We can, and will, use our analysis of the previous sections to independently calculate the work and span of a warp. Independently, we can compose the work and span of warps to obtain the work and span of a block: we sum over the work of the warps and take the maximum over the spans. This approach is not sound in general because warps of a block can synchronize with each other using the __syncthreads() built-in function<ref type="foot">foot_7</ref> , which acts as a barrier forcing all warps to wait to proceed until all warps have reached the synchronization point. Consider the following code:</p><p>Assume that the latency of the write to global memory dominates the span of the computation. Each warp performs only one write, and so taking the maximum span would result in an assumption that the span of the block includes only one write. However, because of the synchronization in the middle, the span of the block must actually account for the latency of two writes: threads 32 and up must wait for threads 0-31 to perform their writes before proceeding.</p><p>Determining that, in the kernel above, each warp only performs one write, would require a sophisticated analysis that tracks costs separately for each thread: this is precisely what our analysis, to retain scalability, does not do. As a result, it is sound to simply compose the predicted spans of each warp in a block by taking the maximum. The remainder of this section will be devoted to proving this fact. In order to do so, we develop another cost semantics, this time a parallel semantics that models entire CUDA blocks and tracks the cost in terms of work and span.</p><p>The cost semantics tracks the work and span for each warp. At each synchronization, we take the maximum span over all warps to account for the fact that all warps in the block must wait for each other at that point. A cost is now a pair (c w , c s ) of the work and span, respectively. A resource metric M maps resource constants to costs reflecting the number of instructions and latency required by the operation. We can take projections M w and M s of such a resource metric which project out the work and span components, respectively, of each cost. For the purposes of calculating the span, we assume that the span of an operation (the second component of the cost) reflects the time taken to process the instruction plus the latency (in other words, the latency is the span of the operation minus the work). We represent the cost of a block as a warp-indexed family of costs C. We use &#8709; to denote the collection ((0, 0) i &#8712;Warps ). We will use a shorthand for adding a cost onto a collection for a subset of warps:</p><p>We will overload the above notation to add a cost onto a collection for a subset of threads:</p><p>where WarpOf (t) is the warp containing thread t. We denote the work of a collection by W (C) and the span by S(C). We can calculate the work and span of a block by summing and taking the maximum, respectively, over the warps:</p><p>Note that here it is safe to take the maximum span over the warps because we have also done so at each synchronization point. Figure <ref type="figure">6</ref> gives selected rules for the cost semantics for statements, for which the judgment is &#963; ; C; s &#8658; T M &#963; &#8242; ; C &#8242; meaning that, on a set of threads T , with heap &#963; , the statement s results in final heap &#963; &#8242; and if the cost collection is initially C, the cost extended with the cost of executing s is C &#8242; . The judgments for operands and expressions are similar; as with the lock-step cost semantics, they return a result family. The rule SC:Sync handles taking the maximum at synchronization points, and rules SC:If and SC:While add the cost of divergence only to warps that actually diverge. Otherwise, the rules (including rules omitted for space reasons) resemble those of the lock-step semantics.</p><p>We now show that the analysis of Section 4, when applied on work and span independently, soundly approximates the parallel cost semantics of Figure <ref type="figure">6</ref>. We do this in two stages: first, we show that the costs derived by the parallel semantics are overapproximated by the costs derived by the lock-step cost semantics of Section 3 (extended with a rule treating __syncthreads() as a no-op). Second, we apply the soundness of those cost semantics. The lock-step cost semantics were designed to model only single-warp execution, and so it may seem odd to model an entire block using them. However, doing so results in a sound overapproximation: for example, in the kernel shown above, the lock-step cost semantics ignores the synchronization but assumes that the two branches must be executed in sequence anyway because not all threads take the same branch. As we now prove, these two features of the warp-level cost semantics cancel out. Lemma 2 states that if evaluating an expression e with initial cost collection C results in a cost collection C &#8242; , then e can evaluate using M w and M s , under the lock-step semantics, with costs C w and C s , respectively.</p><p>The difference in span between C and C &#8242; is overapproximated by C s and the difference in work is overapproximated by C w times the number of warps. Lemma 3 is the equivalent result for statements. Both proofs are straightforward inductions and are formalized in Coq. </p><p>We now apply Theorem 1 to show that the analysis of Section 4, when run independently using the projections of the resource metric M, soundly approximates the work and span of a block.</p><p>Theorem 3. If &#931; &#8866; s and &#931; &#8866; T &#963; and</p><p>Proof. Note that W (&#8709;) = S(&#8709;) = 0, so by Lemma 3, we have C w and C s such that (1) &#963; ; s</p><p>&#9633;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">INFERENCE AND IMPLEMENTATION</head><p>In this section, we discuss the implementation of the logical conditions and potential functions of Section 4 and the techniques used to automate the reasoning in our tool RaCuda. The design of the boolean conditions and potential functions are similar to existing work <ref type="bibr">[Carbonneaux et al. 2017</ref><ref type="bibr">[Carbonneaux et al. , 2015]]</ref>. We have implemented the inference algorithms as an extension to the Absynth tool <ref type="bibr">[Carbonneaux et al. 2017;</ref><ref type="bibr">Ngo et al. 2018]</ref>. We begin by outlining our implementation, and then detail the instantiations of the potential annotations and logical conditions.</p><p>Implementation Overview. Absynth is an implementation of AARA for imperative programs. The core analysis is performed on a control-flow-graph (CFG) intermediate representation. Absynth first applies standard abstract interpretation to gather information about the usage and contents of program variables. It then generates templates for the potential annotations for each node in the graph and uses syntax-directed rules similar to those of the quantitative Hoare logic to collect linear constraints on coefficients in the potential templates throughout the CFG. These constraints are then solved by an LP solver.</p><p>RaCuda uses a modified version of Front-C<ref type="foot">foot_8</ref> to parse CUDA. A series of transformations lowers the code into a representation similar to miniCUDA, and then to IMP, a resource-annotated imperative calculus that serves as a front-end to Absynth. Part of this process is adding annotations that express the cost of each operation in the given resource metric.</p><p>We have extended the abstract interpretation pass to gather CUDA-specific information that allows us to bound the values of the functions MemReads(&#8226;) and Conflicts(&#8226;) (recall that these functions were used to select which rules to apply in the program logic of Figure <ref type="figure">4</ref>, but their definitions were left abstract). We describe the extended abstraction domain in the next subsection. For the most part, we did not modify the representation of potential functions, but we briefly discuss this representation at the end of this section.</p><p>Logical Conditions. The logical conditions of the declarative rules of Section 4 correspond to the abstraction domain we use in the abstract interpretation. The abstract interpretation is designed to gather information that will be used to select the most precise rules (based on memory accesses, bank conflicts and divergent warps) when applying the program logic. The exact implementation of the abstraction domain and analysis is therefore orthogonal to the implementation of the program logic, and is somewhat more standard (see Section 8 for comparisons to related work). Still, for completeness, we briefly describe our approach here.</p><p>The abstraction domain is a pair (C, M). The first component is a set of constraints of the form x &#8712;Var k x x + k &#8804; 0, where k x , k &#8712; N. These form a constraint system on the runtime values of variables which we can decide using Presburger arithmetic. The second component is a mapping that stores, for each program variable x, whether x may currently be used as a potential-carrying variable (see the discussion in Section 4). It also stores two projections of x's abstract value, one notated M tid (x) that tracks its dependence on tid (in practice, this consists of three components tracking the projections of x's dependence on the x, y and z components of threadIdx, which is three-dimensional as described in Section 2) and one notated M const (x) that tracks its constant component. Both projections are stored as polynomial functions of other variables, or &#8868;, indicating no information about that component. These projections provide useful information for the CUDAspecific analysis. For example, if M tid (x) = (0, 0, 0), then the value of x is guaranteed to be constant across threads. As another example, if M tid (x) = (1, 1, 1), then x = tid +c, where c does not depend on the thread, and so the array access A[x] has a stride of 1.</p><p>The extension of the analysis to expressions, and its use in updating information at assignments and determining whether conditional expressions might diverge, is straightforward. The use of this information to predict uncoalesced memory accesses and bank conflicts is more interesting. We assume the following definitions of MemReads(&#8226;) and Conflicts(&#8226;), now generalized to use m as the number of array elements accessed by a global read and B as the number of shared memory banks.</p><p>Theorem 4 formalizes and proves the soundness of a bound on MemReads(x) given abstract information about x.</p><p>Theorem 5 proves a bound on Conflicts(x) given abstract information about x. This bound assumes that x is divergent; for non-divergent operands o, by assumption we have Conflicts(o) = 1. The proof relies on Lemma 4, a stand-alone result about modular arithmetic.</p><p>The accesses in R access banks from R at uniform stride, and so the maximum number of times any such bank is accessed in R is</p><p>)a} is a residue system modulo n (that is, no two elements of the set are congruent modulo n) because if ik &#8226; a &#8801; jk &#8226; a mod n for ji &#8804; c, then ak(ji) is a multiple of a and n smaller than ca, which is a contradiction. This means that if m &#8804; c, then |{i &#8226; a mod n | i &#8712; {k, . . . , k + m -1}}| = m. Now consider the case where m &gt; c and let c + k &lt; i &lt; m + k. Let b = (ik) mod c. We then have (ik)a &#8801; ba mod n, and so ia is already included in A. Thus,{i &#8226; a mod n | i &#8712; {k, . . . , k + m -1}} = A. &#9633;</p><p>As an example of how the abstraction information is tracked and used in the resource analysis, we return to the code example in Figure <ref type="figure">5</ref>. Figure <ref type="figure">7</ref> steps through the abstract interpretation of the same code. For the purposes of this example, we have declared two variables temp0 and temp1 to hold intermediate computations. This reflects more closely the intermediate representation on which the abstract interpretation is done. We establish C from parameters provided to the analysis that specify that blockDim.x is 32, which also bounds threadIdx.x. The assignment to i on line 3 then establishes that i is a multiple of threadIdx.x and has a constant component of 32blockIdx.x. This information is then used on line 4 to bound MemReads(i) and Conflicts(threadIdx.x). By Theorem 4, we can bound MemReads(i) by 32 m + 1. Note that both t and t &#8242; in the statement of Theorem 4 are multiples of 32 (and, in practice, m will divide 32), so we can improve the bound to 32 m . By Theorem 5, we can bound Conflicts(threadIdx.x) by 32 min(32, 32 gcd(1, 32) ) = 1. When j is declared, it is established to have no thread-dependence. Its constant component is initially zero, but the loop invariant sets M const (j) = &#8868;. The assignments on lines 6 and 8 propagate the information that i depends on threadIdx.x as well as some additional information about the constant components. This information is used in the two global loads to bound MemReads(temp0) and MemReads(temp1) by 32 m + 1. In this case, without further information about the value of w, we are unable to make any assumptions about alignment and cannot derive a more precise bound. As above, we can determine Conflicts(i) &#8804; 1 for the loads on both lines.  Potential functions. Our implementation of potential functions is taken largely from prior work on AARA for imperative programs <ref type="bibr">[Carbonneaux et al. 2017;</ref><ref type="bibr">Ngo et al. 2018</ref>]. We instantiate a potential function Q as a linear combination of a fixed set I of base functions from stores to rational costs, each depending on a portion of the state. A designated base function b 0 is the constant function and tracks constant potential. For each program, we select a set of N base functions, plus the constant function, notated b 0 , b 1 , . . . , b N , that capture the portions of the state relevant to calculating potential. A potential function Q is then a linear combination of the selected base functions:</p><p>In the analysis we use, base functions are generated by the following grammar:</p><p>In the above, x stands for a program variable and k &#8712; Q and |[x, y]| = max(0, yx). The latter function is useful for tracking the potential of a loop counter based on its distance from the loop bound (as we did in Figure <ref type="figure">5</ref>). These base functions allow the computation of intricate polynomial resource bounds; transferring potential between them is accomplished through the use of rewrite functions, described in more detail in prior work <ref type="bibr">[Carbonneaux et al. 2017]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">EVALUATION</head><p>We evaluated the range and and precision of RaCuda's analysis on a set of benchmarks drawn from various sources. In addition, we evaluated how well the cost model we developed in Section 3 approximates the actual cost of executing kernels on a GPU&#208;this is important because our analysis (and its soundness proofs) target our cost model, so the accuracy of our cost model is as important a factor in the overall performance of the analysis as is the precision of the analysis itself. Table <ref type="table">2</ref> lists the benchmarks we used for our experiments. For each benchmark, the table lists the source (benchmarks were either from sample kernels distributed with the CUDA SDK, modified from such kernels by us, or written entirely by us). The table also shows the number of lines of code in each kernel, and the arguments to the kernel whose values appear as parameters in the cost results. The kernels used may appear small, but they are representative of CUDA kernels used by many real applications; recall that a CUDA kernel corresponds essentially to a single C function and that an application will likely combine many kernels used for different purposes. We also give the x and y components of the block size we used as a parameter to the analysis for each benchmark (a z component of 1 was always used). Some of the benchmarks merit additional discussion. The matrix multiplication (matMul) benchmark came from the CUDA SDK; we also include two of our own modifications to it: one which deliberately introduces a number of performance bugs (matMulBad), and one (matMulTrans) which transposes one of the input matrices in an (as it happens, misguided) attempt to improve shared memory performance. For all matrix multiplication kernels, we use square matrices of dimension N . The CUDA SDK includes several versions of the &#322;reduce&#382; kernel reduceN), in which they iteratively improve performance between versions. We include the first 4 in our benchmark suite; later iterations use advanced features of CUDA which we do not currently support. Kernels reduce2 and reduce3 use complex loop indices that confuse our inference algorithm for some benchmarks, so we performed slight manual refactoring on these examples (kernels reduce2a and reduce3a) so our algorithm can derive bounds for them. We also include the original versions. Finally, we include the examples from Section 2 (addSubN). We analyzed each benchmark under the four resource metrics defined in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Evaluation of the Cost Model</head><p>Two of the resource metrics above, &#322;conflicts&#382; and &#322;sectors&#382;, correspond directly to metrics collected by NVIDIA's NSight Compute profiling tool for CUDA. This allows us to compare the upper bounds predicted by RaCuda with actual results from CUDA executions (which we will do in the next subsection) as well as to evaluate how closely the cost semantics we presented in Section 3 tracks with the real values of the corresponding metrics, which we now discuss.</p><p>To perform this comparison, we equipped RaCuda with an &#322;evaluation mode&#382; that simulates execution of the input kernel using rules similar to the cost semantics of Figure <ref type="figure">3</ref> under a given execution metric. The evaluation mode, like the analysis mode, parses the kernel and lowers it into the miniCUDA-like representation. The kernel code under this representation is then interpreted by the simulator. In addition, the evaluation mode takes an input file specifying various parameters such as the block and grid size, as well as the arguments passed to the kernel, including arrays and data structures stored in memory.</p><p>We ran each kernel on a range of input sizes. For kernels whose performance depends on the contents of the input, we used worst-case inputs. For the histogram benchmark, whose worst-case &#322;conflicts&#382; value depends heavily on the input (we will discuss this effect in more detail below), limitations of the OCaml implementation prevent us from simulating the worst-case input. We therefore leave this benchmark out of the results in this subsection.</p><p>Because of edge effects from an unfilled last warp, the precision of our analysis often depends on N mod 32 where N is the number of threads used. In order to quantify this effect, where possible we tested inputs that were 32N for some N , as well as inputs that were 32N + 1 for some N (which will generally be the best and worst case for precision) as well as random input sizes drawn uniformly from an appropriate range. The matrix multiplication, reduce and histogram benchmarks require (at least) that the input size is a multiple of 32, so we are not able to report on non-multiple-of-32 input sizes for these benchmarks. We report the average error for each class of input sizes separately, as well as in aggregate. Average error between the simulated value (derived from our tool simulating execution under the cost semantics) and the profiled value (taken from a GPU execution profiled with NSight Compute) is calculated as</p><p>neglecting inputs that would cause a division by zero. In almost all cases in which the cost semantics is not exactly precise, the cost semantics overestimates the actual cost of GPU execution (for some inputs, the cost semantics provides a slight underestimate but these differences are small enough that they do not appear in the average results below). Because the soundness result (Theorem 1) applies to the soundness of the analysis with respect to the cost semantics, this gives some assurance that the analysis is also a sound upper bound with respect to actual execution values. Table <ref type="table">3</ref> reports the average relative error of the cost model with respect to profiled values. In many cases, the cost model is extremely precise (exact or within 5%). Larger differences in a small number of cells could be due to a number of factors: CUDA's memory model is quite complex, and we model only part of the complexity. In addition, our simulator does not track all values and parameters, sometimes relying on default values or symbolic execution. Finally, our simulator is essentially an interpreter running the CUDA source code, while the GPU executes code that has been compiled to a special-purpose assembly language. We do not attempt to model performance differences that may be introduced in this compilation process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Evaluation of the Analysis</head><p>In this subsection, we compare the execution costs predicted by RaCuda with the actual execution costs obtained by profiling (for the &#322;conflicts&#382; and &#322;sectors&#382; metrics) or the simulated costs from our cost semantics (for the &#322;divwarps&#382; and &#322;steps&#382; metrics). We discuss the &#322;conflicts&#382; and &#322;sectors&#382; metrics first. Tables <ref type="table">4</ref> and<ref type="table">5</ref> contain the results for these two metrics. For each benchmark, we present the total time taken, and the cost bound inferred, by RaCuda's analysis. The timing results show that RaCuda is quite efficient, with analysis times usually under 1 second; analysis times on the order of minutes are seen for exceptionally complex kernels. Recall that RaCuda produces bounds for a single warp. To obtain bounds for the kernel, we multiplied by the number of warps required to process the entire input (often N 32 if the input size is N ). Several of the kernels perform internal loops with a stride of 32. Precise bounds of such loops would, like the number of warps, be of the form N 32 . However, Absynth can only produce polynomial bounds and so must approximate this bound by N +31 32 , which is the tightest possible polynomial bound. We also ran versions of each kernel on a GPU using NSight Compute. Similar to the above error calculation for evaluating the cost model, average error is calculated as</p><p>Actual (i) The analysis for global memory sectors is fairly precise. Note also that, for the reduce kernels, the error is the same as the error of the cost model in Table <ref type="table">3</ref>; this indicates that the analysis precisely predicts the modeled cost and the imprecision is in the cost model, rather than the analysis. Most other imprecisions are because our abstraction domain is insufficiently complex to show, e.g., that  memory accesses are properly aligned. We note, however, that more efficient versions of the same kernel (e.g., the successive versions of the reduce and addSub kernels) generally appear more efficient under our algorithm, and also that our analysis is most precise for better-engineered kernels that follow well-accepted design patterns (e.g., matMul, reduce3a, addSub3). These results indicate that our analysis can be a useful tool for improving CUDA kernels, because it can give useful feedback on whether modifications to a kernel have indeed improved its performance.</p><p>RaCuda is generally able to correctly infer that a kernel has no bank conflicts, but often overestimates the number of bank conflicts when some are present. Again, this means that RaCuda can be used to determine whether improvements need to be made to a kernel. We believe the bank conflicts analysis can be made more precise with a more complex abstraction domain.</p><p>Figure <ref type="figure">8</ref> plots our predicted cost versus the actual worst-case for two representative benchmarks. In the right figure, we plot executions of random inputs generated at runtime in addition to the worst-case input. The benchmark used for this figure is histogram, whose shared memory performance displays interesting behavior depending on the input. The benchmark computes a 256-bin histogram of the number of occurrences of each byte (0x0-0xff) in the input array. The bins are stored in shared memory, and so occurrences of bytes that map to the same shared memory bank (e.g. 0x00 and 0x20) in the same warp might result in bank conflicts 7 . In the worst case, all bytes handled by a warp map to the same memory bank, resulting in an 8-way conflict at each access (32 bins are accessed, but as there are only 256 bins, only 256/32 = 8 map to each memory 2011; <ref type="bibr">Radi&#269;ek et al. 2017]</ref> programs. However, there are very few tools for parallel <ref type="bibr">[Hoffmann and Shao 2015]</ref> and concurrent <ref type="bibr">[Albert et al. 2011;</ref><ref type="bibr">Das et al. 2018]</ref> execution and there are no such tools that take into account the aforementioned CUDA-specific performance bottlenecks.</p><p>Most closely related to our work is automatic amortized analysis (AARA) for imperative programs and quantitative program logics. <ref type="bibr">Carbonneaux et al. [2014]</ref> introduced the first imperative AARA in the form of a program logic for verifying stack bounds for C programs. The technique has then been automated <ref type="bibr">[Carbonneaux et al. 2017</ref><ref type="bibr">[Carbonneaux et al. , 2015] ]</ref> using templates and LP solving and applied to probabilistic programs <ref type="bibr">[Ngo et al. 2018]</ref>. A main innovation of this work is the development of an AARA for CUDA code: Previous work on imperative AARA cannot analyze parallel executions nor CUDA specific memory-access cost.</p><p>Parallel Cost Semantics. The model we use for reasoning about a CUDA block in terms of its work and span is derived from prior models for reasoning about parallel algorithms. The ideas of work and span (also known as depth) date back to work from the 1970s and 1980s showing bounds on execution time for a particular schedule <ref type="bibr">[Brent 1974</ref>] and later for any greedy schedule <ref type="bibr">[Eager et al. 1989</ref>]. Starting in the 1990s <ref type="bibr">[Blelloch and</ref><ref type="bibr">Greiner 1995, 1996]</ref>, parallel algorithms literature has used directed acyclic graphs (DAGs) to analyze the parallel structure of algorithms and calculate their work and span: the work is the total number of nodes in the DAG and the span is the longest path from source to sink. In the case of CUDA, we do not need the full generality of DAGs and so are able to simplify the notation somewhat, but our notation for costs of CUDA blocks in Section 5 remains inspired by this prior work. We build in particular on work by <ref type="bibr">Muller and Acar [2016]</ref> that extended the traditional DAG models to account for latency (Muller and Acar were considering primarily costly I/O operations; we use the same techniques for the latency of operations such as memory accesses). Their model adds latencies as edge weights on the graph and redefines the span (but not work) to include these weights.</p><p>Analysis of CUDA Code. Given its importance in fields such as machine learning and highperformance computing, CUDA has gained a fair amount of attention in the program analysis literature in recent years. There exist a number of static <ref type="bibr">[Li and Gopalakrishnan 2010;</ref><ref type="bibr">Pereira et al. 2016;</ref><ref type="bibr">Zheng et al. 2011]</ref> and dynamic <ref type="bibr">[Boyer et al. 2008;</ref><ref type="bibr">Eizenberg et al. 2017;</ref><ref type="bibr">Peng et al. 2018;</ref><ref type="bibr">Wu et al. 2019]</ref> analyses for verifying certain properties of CUDA programs, but much of this work focused on functional properties, e.g., freedom from data races. <ref type="bibr">Wu et al. [2019]</ref> investigate several classes of bugs, one of which is &#322;non-optimal implementation&#382;, including several types of performance problems. They don't give examples of kernels with non-optimal implementations, and don't specify whether or how their dynamic analysis detects such bugs. PUG <ref type="bibr">[Li and Gopalakrishnan 2010]</ref> and <ref type="bibr">Boyer et al. [2008]</ref> focus on detecting data races but both demonstrate an extension of their race detectors designed to detect bank conflicts, with somewhat imprecise results. <ref type="bibr">Kojima and Igarashi [2017]</ref> present a Hoare logic for proving functional properties of CUDA kernels. Their logic does not consider quantitative properties and, unlike our program logic, requires explicit reasoning about the sets of active threads at each program point, which poses problems for designing an efficient automated inference engine.</p><p>GKLEE <ref type="bibr">[Li et al. 2012</ref>] is an analysis for CUDA kernels based on concolic execution, and targets both functional errors and performance errors (including warp divergence, non-coalesced memory accesses and shared bank conflicts). GKLEE requires some user annotations in order to perform its analysis. <ref type="bibr">Alur et al. [2017]</ref> and <ref type="bibr">Singhania [2018]</ref> have developed several static analyses for performance properties of CUDA programs, including uncoalesced memory accesses. Their analysis for detecting uncoalesced memory accesses uses abstract interpretation with an abstract domain similar to ours but simpler (in our notation, it only tracks M tid (x) and only considers the values 0, 1, and -1 for it). Their work does not address shared bank conflicts or divergent warps. Moreover, they developed a separate analysis for each type of performance bug. In this work, we present a general analysis that detects and quantifies several different types of performance bugs.</p><p>The two systems described in the previous paragraph only detect performance errors (e.g., they might estimate what percentage of warps in an execution will diverge); they are not able to quantify the impact of these errors on the overall performance of a kernel. The analysis in this paper has the full power of amortized analysis and is able to generate a resource bound, parametric in the relevant costs, that takes into account divergence, uncoalesced memory accesses and bank conflicts.</p><p>Other work has focused on quantifying or mitigating, but not detecting, performance errors. <ref type="bibr">Bialas and Strzelecki [2016]</ref> use simple, tunable kernels to experimentally quantify the impact of warp divergence on performance using different GPUs. Their findings show that there is a nontrivial cost associated with a divergent warp even if the divergence involves some threads simply being inactive (e.g. threads exiting a loop early or a conditional with no &#322;else&#382; branch). This finding has shaped our thinking on the cost of divergent warps. Han and Abdelrahman <ref type="bibr">[2011]</ref> present two program transformations that lessen the impact of warp divergence; they quantify the benefit of these optimizations but do not have a way of statically identifying whether a kernel contains a potential for divergent warps and/or could benefit from their transformations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">CONCLUSION</head><p>We have presented a program logic for proving qualitative and quantitative properties of CUDA kernels, and proven it sound with respect to a model of GPU execution. We have used the logic to develop a resource analysis for CUDA as an extension to the Absynth tool, and shown that this analysis provides useful feedback on the performance characteristics of a variety of kernels.</p><p>This work has taken the first step toward automated static analysis tools for analyzing and improving performance of CUDA kernels. In the future, we plan to extend the logic to handle more advanced features of CUDA such as dynamic parallelism, by further embracing the connection to DAG-based models for dynamic parallelism (Sections 5 and 8).</p><p>The &#322;steps&#382; metric of Section 7 raises the question of whether it is possible to use our analysis to predict actual execution times by using appropriate resource metrics to analyze the work and span and combine them as in Section 5. Deriving such metrics is largely a question of careful profiling of specific hardware, which is outside the scope of this paper, but in future work we hope to bring these techniques closer to deriving wall-clock execution time bounds on kernels. Doing so may involve further extending the analysis to handle instruction-level parallelism, which hides latency by beginning to execute instructions in the same warp that do not depend on data being fetched.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 25. Publication date: January 2021.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>Warp refers to the set of parallel threads stretched across a loom during weaving; according to the CUDA programming manual [NVIDIA Corporation</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>2019], &#322;The term warp originates from weaving, the first parallel thread technology. &#382; Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 25. Publication date: January 2021.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_3"><p>the term &#322;sectors&#382; comes from the NVIDIA profiling tools' terminology for this metric Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 25. Publication date: January 2021.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_4"><p>To aid in reasoning, you can read the shorthands as &#322;&#963; is compatible with P &#382; and &#322;T is compatible with P . &#382; Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 25. Publication date: January 2021.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_5"><p>The amount of memory accessed by a single read is hardware-dependent and complex; this is outside the scope of this paper Proc. ACM Program. Lang., Vol.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_6"><p>5, No. POPL, Article 25. Publication date: January 2021.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_7"><p>We have not mentioned __syncthreads() up to this point because it was not particularly relevant for the warp-level analysis, but it is supported by our implementation of miniCUDA and used in many of the benchmarks.Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 25. Publication date: January 2021.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_8"><p>https://github.com/BinaryAnalysisPlatform/FrontC Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 25. Publication date: January 2021.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_9"><p>Of course, these would also result in data races in the absence of synchronization (which is present in the benchmark as originally written), but such synchronization would also be likely to result in performance impacts when such conflicts occur; for illustration purposes, we disable the synchronization so that this performance impact shows up as a bank conflict.Proc. ACM Program. Lang., Vol. 5, No. POPL, Article 25. Publication date: January 2021.</p></note>
		</body>
		</text>
</TEI>
