skip to main content


Title: PILOT: a Runtime System to Manage Multi-tenant GPU Unified Memory Footprint
Concurrent kernel execution on GPU has proven an effective technique to improve system throughput by maximizing the resource utilization. In order to increase programmability and meet the increasing memory requirements of data-intensive applications, current GPUs support Unified Virtual Memory (UVM), which provides a virtual memory abstraction with demand paging. By allowing applications to oversubscribe GPU memory, UVM provides increased opportunities to share GPU resources across applications. However, in the presence of applications with competing memory requirements, GPU sharing can lead to performance degradation due to thrashing. NVIDIA's Multiple Process Service (MPS) offers the capability to space share bare metal GPUs, thereby enabling cluster workload managers, such as Slurm, to share a single GPU across MPI ranks with limited control over resource partitioning. However, it is not possible to preempt, schedule, or throttle a running GPU process through MPS. These features would enable new OS-managed scheduling policies to be implemented for GPU kernels to dynamically handle resource contention and offer consistent performance. The contribution of this paper is two-fold. We first show how memory oversubscription can impact the performance of concurrent GPU applications. Then, we propose three methods to transparently mitigate memory interference through kernel preemption and scheduling policies. To implement our policies, we develop our own runtime system (PILOT) to serve as an alternative to NVIDIA's MPS. In the presence of memory over-subscription, we noticed a dramatic improvement in the overall throughput when using our scheduling policies and runtime hints.  more » « less
Award ID(s):
1812727 1741683
PAR ID:
10324577
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)
Page Range / eLocation ID:
442 - 447
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Graphics Processing Units (GPUs) exploit large amounts of thread-level parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces application-level unfairness by 22.4%. MASK's system throughput is within 23.2% of an ideal GPU system with no address translation overhead. 
    more » « less
  2. The recent introduction of Unified Virtual Memory (UVM) in GPUs offers a new programming model that allows GPUs and CPUs to share the same virtual memory space, which shifts the complex memory management from programmers to GPU driver/ hardware and enables kernel execution even when memory is oversubscribed. Meanwhile, UVM may also incur considerable performance overhead due to tracking and data migration along with special handling of page faults and page table walk. As UVM is attracting significant attention from the research community to develop innovative solutions to these problems, in this paper, we propose a comprehensive UVM benchmark suite named UVMBench to facilitate future research on this important topic. The proposed UVMBench consists of 32 representative benchmarks from a wide range of application domains. The suite also features unified programming implementation and diverse memory access patterns across benchmarks, thus allowing thorough evaluation and comparison with current state-of-the-art. A set of experiments have been conducted on real GPUs to verify and analyze the benchmark suite behaviors under various scenarios. 
    more » « less
  3. Unified Virtual Memory (UVM) was recently introduced with CUDA version 8 and the Pascal GPU. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM has become increasing important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings use NVIDIA GPUs, with a current trend of ten additional NVIDIA-based supercomputers each year. A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart for Unified Memory), is demonstrated for hybrid CUDA/MPI computations across multiple computer nodes. The support for UVM is particularly attractive for programs requiring more memory than resides on the GPU, since the alternative to UVM is for the application to directly copy memory between device and host. Furthermore, CRUM supports a fast, forked checkpointing, which mostly overlaps the CUDA computation with storage of the checkpoint image in stable storage. The runtime overhead of using CRUM is 6% on average, and the time for forked checkpointing is seen to be a factor of up to 40 times less than traditional, synchronous checkpointing. 
    more » « less
  4. An increasing number of applications benefit from heterogeneous hardware accelerators. Such accelerators often require the application to manually manage memory buffers on devices and transfer data between host and device buffers. A programming model that unifies the virtual address space across the host and devices is appealing because it enables automatic memory transfers and simplifies application-level programming. However, the automatic memory transfers can sometimes be redundant, which decreases performance. NVIDIA’s UVM (unified virtual memory) driver provides a unified virtual address space for CPU-GPU programming. This paper identifies redundant memory transfers (RMTs) as a common performance issue with UVM. To address this issue, this paper proposes a data discard directive, and evaluates two implementations of that directive, UvmDiscard and UvmDiscardLazy. This directive exploits application-level knowledge to avoid RMTs. The implementations were integrated with NVIDIA’s open-source UVM driver to demonstrate their usefulness on real-world CUDA UVM applications. For example, the use of the discard directive increases training throughput by 61.2% on a large deep learning application that oversubscribes GPU memory. 
    more » « less
  5. The abstraction of a shared memory space over separate CPU and GPU memory domains has eased the burden of portability for many HPC codebases. However, users pay for ease of use provided by system-managed memory with a moderate-to-high performance overhead. NVIDIA Unified Virtual Memory (UVM) is currently the primary real-world implementation of such abstraction and offers a functionally equivalent testbed for in-depth performance study for both UVM and future Linux Heterogeneous Memory Management (HMM) compatible systems. The continued advocacy for UVM and HMM motivates improvement of the underlying system. We focus on UVM-based systems and investigate the root causes of UVM overhead, a non-trivial task due to complex interactions of multiple hardware and software constituents and the desired cost granularity.

    In our prior work, we delved deeply into UVM system architecture and showed internal behaviors of page fault servicing in batches. We provided quantitative evaluation of batch handling for various applications under different scenarios, including prefetching and oversubscription. We revealed that the driver workload depends on the interactions among application access patterns, GPU hardware constraints, and host OS components. Host OS components have significant overhead present across implementations, warranting close attention.

    This extension furthers our prior study in three aspects: fine-grain cost analysis and breakdown, extension to multiple GPUs, and investigation of platforms with different GPU-GPU interconnects. We take a top-down approach to quantitative batch analysis and uncover how constituent component costs accumulate and overlap, governed by synchronous and asynchronous operations. Our multi-GPU analysis shows reduced cost of GPU-GPU batch workloads compared to CPU-GPU workloads. We further demonstrate that while specialized interconnects, NVLink, can improve batch cost, their benefits are limited by host OS software overhead and GPU oversubscription. This study serves as a proxy for future shared memory systems, such as those that interface with HMM, and the development of interconnects.

     
    more » « less