skip to main content


Title: Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs
Domain-specific languages that execute image processing pipelines on GPUs, such as Halide and Forma, operate by 1) dividing the image into overlapped tiles, and 2) fusing loops to improve memory locality. However, current approaches have limitations: 1) they require intra thread block synchronization, which has a nontrivial cost, 2) they must choose between small tiles that require more overlapped computations or large tiles that increase shared memory access (and lowers occupancy), and 3) their autoscheduling algorithms use simplified GPU models that can result in inefficient global memory accesses. We present a new approach for executing image processing pipelines on GPUs that addresses these limitations as follows. 1) We fuse loops to form overlapped tiles that fit in a single warp, which allows us to use lightweight warp synchronization. 2) We introduce hybrid tiling, which stores overlapped regions in a combination of thread-local registers and shared memory. Thus hybrid tiling either increases occupancy by decreasing shared memory usage or decreases overlapping computations using larger tiles. 3) We present an automatic loop fusion algorithm that considers several factors that affect the performance of GPU kernels. We implement these techniques in PolyMage-GPU, which is a new GPU backend for PolyMage. Our approach produces code that is faster than Halide's manual schedules: 1.65x faster on an NVIDIA GTX 1080Ti and 1.33x faster on an NVIDIA Tesla V100.  more » « less
Award ID(s):
1717636
NSF-PAR ID:
10198225
Author(s) / Creator(s):
;
Date Published:
Journal Name:
International Conference on Parallel Architectures and Compilation Techniques
Page Range / eLocation ID:
317 to 328
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this strategy, we build a dynamic non-blocking concurrent linked list, the slab list, that supports asynchronous, concurrent updates (insertions and deletions) as well as search queries. We use the slab list to implement a dynamic hash table with chaining (the slab hash). On an NVIDIA Tesla K40c GPU, the slab hash performs updates with up to 512 M updates/s and processes search queries with up to 937 M queries/s. We also design a warp-synchronous dynamic memory allocator, SlabAlloc, that suits the high performance needs of the slab hash. SlabAlloc dynamically allocates memory at a rate of 600 M allocations/s, which is up to 37x faster than alternative methods in similar scenarios. 
    more » « less
  2. We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the tradi- tional way of per-thread (or per-warp) work assignment and processing. By using this strategy, we build a dynamic non- blocking concurrent linked list, the slab list, that supports asynchronous, concurrent updates (insertions and deletions) as well as search queries. We use the slab list to implement a dynamic hash table with chaining (the slab hash). On an NVIDIA Tesla K40c GPU, the slab hash performs updates with up to 512 M updates/s and processes search queries with up to 937 M queries/s. We also design a warp-synchronous dynamic memory allocator, SlabAlloc, that suits the high performance needs of the slab hash. SlabAlloc dynamically allocates memory at a rate of 600 M allocations/s, which is up to 37x faster than alternative methods in similar scenarios. 
    more » « less
  3. This paper extends the reach of General Purpose GPU programming by presenting a software architecture that supports efficient fine-grained synchronization over global memory. The key idea is to transform global synchronization into global communication so that conflicts are serialized at the thread block level. With this structure, the threads within each thread block can synchronize using low latency, high-bandwidth local scratchpad memory. To enable this architecture, we implement a scalable and efficient message passing library. Using Nvidia GTX 1080 ti GPUs, we evaluate our new software architecture by using it to solve a set of five irregular problems on a variety of workloads. We find that on average, our solutions improve performance over carefully tuned state-of-the-art solutions by 3.6×. 
    more » « less
  4. Summary

    Graphics processing unit (GPU) computing is a popular approach to simulating complex models and performing massive calculations. GPUs have attracted a great deal of interest because they offer both high performance and energy efficiency. Efficient General‐Purpose computation on Graphics Processing Units requires good parallelism, memory coalescing, regular memory access, small overhead on data exchange between the CPU and the GPU, and few explicit global synchronizations, which are usually gained from optimizing the algorithms. Besides these advantages, the proper use of some novel features provided on NVIDIA GPUs can offer further improvement. In this paper, we modify an existing optimized GPU application to illustrate the potential performance gains of these features and to demonstrate the performance trade offs of different implementation choices. The paper focuses on the challenges of reducing interactions between CPU and GPU and reducing the use of explicit synchronization. We explain how to achieve these objectives using two features of the Kepler architecture, warp shuffle, and dynamic parallelism. We find that a judicious use of these two techniques, eliminating repeated operations and synchronizations, results in significantly better performance. We describe various pitfalls encountered in optimizing our code to use these two features and how these were addressed. In particular, we identify a subtle data race condition when using dynamic parallelism under certain circumstances, and present our solution. Finally, we present a technique to trade off the allocation of various device resources to find the parameters that offer the best performance. Copyright © 2016 John Wiley & Sons, Ltd.

     
    more » « less
  5. Unified Virtual Memory (UVM) was recently introduced with CUDA version 8 and the Pascal GPU. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM has become increasing important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings use NVIDIA GPUs, with a current trend of ten additional NVIDIA-based supercomputers each year. A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart for Unified Memory), is demonstrated for hybrid CUDA/MPI computations across multiple computer nodes. The support for UVM is particularly attractive for programs requiring more memory than resides on the GPU, since the alternative to UVM is for the application to directly copy memory between device and host. Furthermore, CRUM supports a fast, forked checkpointing, which mostly overlaps the CUDA computation with storage of the checkpoint image in stable storage. The runtime overhead of using CRUM is 6% on average, and the time for forked checkpointing is seen to be a factor of up to 40 times less than traditional, synchronous checkpointing. 
    more » « less