Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Motivated by the increasing importance of generalpurpose Graphic Processing Units (GPGPU) programming, exemplified by NVIDIAâ€™s CUDA framework, as well as the difficulty, especially for novice programmers, of reasoning about performance in GPGPU kernels, we introduce a novel quantitative program logic for CUDA kernels. The logic allows programmers to reason about both functional correctness and resource usage of CUDA kernels, paying particular attention to a set of common but CUDAspecific performance bottlenecks: warp divergences, uncoalesced memory accesses, and bank conflicts. The logic is proved sound with respect to a novel operational cost semantics for CUDA kernels. The semantics, logic, and soundness proofs are formalized in Coq. An inference algorithm based on LP solving automatically synthesizes symbolic resource bounds by generating derivations in the logic. This algorithm is the basis of RaCUDA, an endtoend resourceanalysis tool for kernels, which has been implemented using an existing resourceanalysis tool for imperative programs. An experimental evaluation on a suite of benchmarks shows that the analysis is effective in aiding the detection of performance bugs in CUDA kernels.more » « lessFree, publiclyaccessible full text available March 31, 2025

Deadlocks, in which threads wait on each other in a cyclic fashion and can't make progress, have plagued parallel programs for decades. In recent years, as the parallel programming mechanism known as futures has gained popularity, interest in preventing deadlocks in programs with futures has increased as well. Various static and dynamic algorithms exist to detect and prevent deadlock in programs with futures, generally by constructing some approximation of the dependency graph of the program but, as far as we are aware, all are specialized to a particular programming language. A recent paper introduced graph types, by which one can statically approximate the dependency graphs of a program in a languageindependent fashion. By analyzing the graph type directly instead of the source code, a graphbased program analysis, such as one to detect deadlock, can be made languageindependent. Indeed, the paper that proposed graph types also proposed a deadlock detection algorithm. Unfortunately, the algorithm was based on an unproven conjecture which we show to be false. In this paper, we present, and prove sound, a type system for finding possible deadlocks in programs that operates over graph types and can therefore be applied to many different languages. As a proof of concept, we have implemented the algorithm over a subset of the OCaml language extended with builtin futures.more » « lessFree, publiclyaccessible full text available February 20, 2025

Recent work has proposed a memory property for parallel programs, called disentanglement, and showed that it is pervasive in a variety of programs, written in different languages, ranging from C/C++ to Parallel ML, and showed that it can be exploited to improve the performance of parallel functional programs. All existing work on disentanglement, however, considers the fork/join model for parallelism and does not apply to futures, the more powerful approach to parallelism. This is not surprising: fork/join parallel programs exhibit a reasonably strict dependency structure (e.g., seriesparallel DAGs), which disentanglement exploits. In contrast, with futures, parallel computations become firstclass values of the language, and thus can be created, and passed between functions calls or stored in memory, just like other ordinary values, resulting in complex dependency structures, especially in the presence of mutable state. For example, parallel programs with futures can have deadlocks, which is impossible with forkjoin parallelism.
In this paper, we are interested in the theoretical question of whether disentanglement may be extended beyond fork/join parallelism, and specifically to futures. We consider a functional language with futures, Input/Output (I/O), and mutable state (references) and show that a broad range of programs written in this language are disentangled. We start by formalizing disentanglement for futures and proving that purely functional programs written in this language are disentangled. We then generalize this result in three directions. First, we consider state (effects) and prove that stateful programs are disentangled if they are race free. Second, we show that race freedom is sufficient but not a necessary condition and nondeterministic programs, e.g. those that use atomic readmodifyoperations and some nondeterministic combinators, may also be disentangled. Third, we prove that disentangled taskparallel programs written with futures are free of deadlocks, which arise due to interactions between state and the rich dependencies that can be expressed with futures. Taken together, these results show that disentanglement generalizes to parallel programs with futures and, thus, the benefits of disentanglement may go well beyond forkjoin parallelism.
Free, publiclyaccessible full text available January 5, 2025 
Parallel programs are frequently modeled as
dependency orcost graphs, which can be used to detect various bugs, or simply to visualize the parallel structure of the code. However, such graphs reflect just one particular execution and are typically constructed in aposthoc manner.Graph types , which were introduced recently to mitigate this problem, can be assigned statically to a program by a type system and compactly represent the family of all graphs that could result from the program.Unfortunately, prior work is restricted in its treatment of
futures , an increasingly common and especially dynamic form of parallelism. In short, each instance of a future must be statically paired with a vertex name. Previously, this led to the restriction that futures could not be placed in collections or be used to construct data structures. Doing so is not a niche exercise: such structures form the basis of numerous algorithms that use forms of pipelining to achieve performance not attainable without futures. All but the most limited of these examples are out of reach of prior graph type systems.In this paper, we propose a graph type system that allows for almost arbitrary combinations of futures and recursive data types. We do so by indexing datatypes with a typelevel
vertex structure , a codata structure that supplies unique vertex names to the futures in a data structure. We prove the soundness of the system in a parallel core calculus annotated with vertex structures and associated operations. Although the calculus is annotated, this is merely for convenience in defining the type system. We prove that it is possible to annotate arbitrary recursive types with vertex structures, and show using a prototype inference engine that these annotations can be inferred from OCamllike source code for several complex parallel algorithms.Free, publiclyaccessible full text available January 5, 2025 
Many concurrent programs assign priorities to threads to improve responsiveness. When used in conjunction with synchronization mechanisms such as mutexes and condition variables, however, priorities can lead to priority inversions, in which highpriority threads are delayed by lowpriority ones. Priority inversions in the use of mutexes are easily handled using dynamic techniques such as priority inheritance, but priority inversions in the use of condition variables are not wellstudied and dynamic techniques are not suitable. In this work, we use a combination of static and dynamic techniques to prevent priority inversion in code that uses mutexes and condition variables. A type system ensures that condition variables are used safely, even while dynamic techniques change thread priorities at runtime to eliminate priority inversions in the use of mutexes. We prove the soundness of our system, using a model of priority inversions based on cost models for parallel programs. To show that the type system is practical to implement, we encode it within the type systems of Rust and C++, and show that the restrictions are not overly burdensome by writing sizeable case studies using these encodings, including porting the Memcached object server to use our C++ implementation.more » « less

Many algorithms for analyzing parallel programs, for example to detect deadlocks or data races or to calculate the execution cost, are based on a model variously known as a cost graph, computation graph or dependency graph, which captures the parallel structure of threads in a program. In modern parallel programs, computation graphs are highly dynamic and depend greatly on the program inputs and execution details. As such, most analyses that use these graphs are either dynamic analyses or are specialized static analyses that gather a subset of dependency information for a specific purpose. This paper introduces graph types, which compactly represent all of the graphs that could arise from program execution. Graph types are inferred from a parallel program using a graph type system and inference algorithm, which we present drawing on ideas from HindleyMilner type inference, affine logic and region type systems. We have implemented the inference algorithm over a subset of OCaml, extended with parallelism primitives, and we demonstrate how graph types can be used to accelerate the development of new graphbased static analyses by presenting proofofconcept analyses for deadlock detection and cost analysis.more » « less

null (Ed.)Generalpurpose programming on GPUs (GPGPU) is becoming increasingly in vogue as applications such as machine learning and scientific computing demand high throughput in vectorparallel applications. NVIDIA's CUDA toolkit seeks to make GPGPU programming accessible by allowing programmers to write GPU functions, called kernels, in a small extension of C/C++. However, due to CUDA's complex execution model, the performance characteristics of CUDA kernels are difficult to predict, especially for novice programmers. This paper introduces a novel quantitative program logic for CUDA kernels, which allows programmers to reason about both functional correctness and resource usage of CUDA kernels, paying particular attention to a set of common but CUDAspecific performance bottlenecks. The logic is proved sound with respect to a novel operational cost semantics for CUDA kernels. The semantics, logic and soundness proofs are formalized in Coq. An inference algorithm based on LP solving automatically synthesizes symbolic resource bounds by generating derivations in the logic. This algorithm is the basis of RaCuda, an endtoend resourceanalysis tool for kernels, which has been implemented using an existing resourceanalysis tool for imperative programs. An experimental evaluation on a suite of CUDA benchmarks shows that the analysis is effective in aiding the detection of performance bugs in CUDA kernels.more » « less