As AI inference becomes mainstream, research has begun to focus on improving the energy consumption of inference servers. Inference kernels commonly underutilize a GPU’s compute resources and waste power from idling components. To improve utilization and energy efficiency, multiple models can co-locate and share the GPU. However, typical GPU spatial partitioning techniques often experience significant overheads when reconfiguring spatial partitions, which can waste additional energy through repartitioning overheads or non-optimal partition configurations. In this paper, we present ECLIP, a framework to enable low-overhead energy-efficient kernel-wise resource partitioning between co-located inference kernels. ECLIP minimizes repartitioning overheads by pre-allocating pools of CU masked streams and assigns optimal CU assignments to groups of kernels through our resource allocation optimizer. Overall, ECLIP achieves an average of 13% improvement to throughput and 25% improvement to energy efficiency.
more »
« less
Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems
As GPU-using tasks become more common in embedded, safety-critical systems, efficiency demands necessitate sharing a single GPU among multiple tasks. Unfortunately, existing ways to schedule multiple tasks onto a GPU often either result in a loss of ability to meet deadlines, or a loss of efficiency. In this work, we develop a system-level spatial compute partitioning mechanism for NVIDIA GPUs and demonstrate that it can be used to execute tasks efficiently without compromising timing predictability. Our tool, called nvtaskset, supports composable systems by not requiring task, driver, or hardware modifications. In our evaluation, we demonstrate sub-1-μs overheads, stronger partition enforcement, and finer-granularity partitioning when using our mechanism instead of NVIDIA’s Multi-Process Service (MPS) or Multi-instance GPU (MiG) features.
more »
« less
- Award ID(s):
- 2333120
- PAR ID:
- 10652973
- Editor(s):
- Mancuso, Renato
- Publisher / Repository:
- Schloss Dagstuhl – Leibniz-Zentrum für Informatik
- Date Published:
- Volume:
- 335
- ISSN:
- 1868-8969
- Page Range / eLocation ID:
- 21:1-21:25
- Subject(s) / Keyword(s):
- Real-time systems composable systems graphics processing units CUDA Computer systems organization → Heterogeneous (hybrid) systems Computer systems organization → Real-time systems Software and its engineering → Scheduling Software and its engineering → Concurrency control Computing methodologies → Graphics processors Computing methodologies → Concurrent computing methodologies
- Format(s):
- Medium: X Size: 25 pages; 1249849 bytes Other: application/pdf
- Size(s):
- 25 pages 1249849 bytes
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Massive multi-user multiple-input multiple-output (MU-MIMO) enables significant gains in spectral efficiency and link reliability compared to conventional, small-scale MIMO technology. In addition, linear precoding using zero forcing or Wiener filter (WF) precoding is sufficient to achieve excellent error rate performance in the massive MU-MIMO downlink. However, these methods typically require centralized processing at the base-station (BS), which causes (i) excessively high interconnect and chip input/output data rates, and (ii) high implementation complexity. We propose two feed-forward architectures and corresponding decentralized WF precoders that parallelize precoding across multiple computing fabrics, effectively mitigating the limitations of centralized approaches. To demonstrate the efficacy of our decentralized precoders, we provide implementation results on a multi-GPU system, which show that our solutions achieve throughputs in the Gbit/s regime while achieving (near-)optimal error-rate performance in the massive MU-MIMO downlink.more » « less
-
Massive multi-user multiple-input multiple-output (MU-MIMO) enables significant gains in spectral efficiency and link reliability compared to conventional, small-scale MIMO technology. In addition, linear precoding using zero forcing or Wiener filter (WF) precoding is sufficient to achieve excellent error rate performance in the massive MU-MIMO downlink. However, these methods typically require centralized processing at the base-station (BS), which causes (i) excessively high interconnect and chip input/output data rates, and (ii) high implementation complexity. We propose two feedforward architectures and corresponding decentralized WF precoders that parallelize precoding across multiple computing fabrics, effectively mitigating the limitations of centralized approaches. To demonstrate the efficacy of our decentralized precoders, we provide implementation results on a multi-GPU system, which show that our solutions achieve throughputs in the Gbit/s regime while achieving (near-)optimal error-rate performance in the massive MU-MIMO downlink.more » « less
-
For a CPU-GPU heterogeneous computing system, different types of processors have load balancing problems in the calculation process. What’s more, multitasking cannot be matched to the appropriate processor core is also an urgent problem to be solved. In this paper, we propose a task scheduling strategy for high-performance CPU-GPU heterogeneous computing platform to solve these problems. For the single task model, a task scheduling strategy based on loadaware for CPU-GPU heterogeneous computing platform is proposed. This strategy detects the computing power of the CPU and GPU to process specified tasks, and allocates computing tasks to the CPU and GPU according to the perception ratio. The tasks are stored in a bidirectional queue to reduce the additional overhead brought by scheduling. For the multi-task model, a task scheduling strategy based on the genetic algorithm for CPU-GPU heterogeneous computing platform is proposed. The strategy aims at improving the overall operating efficiency of the system, and accurately binds the execution relationship between different types of tasks and heterogeneous processing cores. Our experimental results show that the scheduling strategy can improve the efficiency of parallel computing as well as system performance.more » « less
-
null (Ed.)Edge cloud data centers (Edge) are deployed to provide responsive services to the end-users. Edge can host more powerful CPUs and DNN accelerators such as GPUs and may be used for offloading tasks from end-user devices that require more significant compute capabilities. But Edge resources may also be limited and must be shared across multiple applications that process requests concurrently from several clients. However, multiplexing GPUs across applications is challenging. With edge cloud servers needing to process a lot of streaming and the advent of multi-GPU systems, getting that data from the network to the GPU can be a bottleneck, limiting the amount of work the GPU cluster can do. The lack of prompt notification of job completion from the GPU can also result in poor GPU utilization. We build on our recent work on controlled spatial sharing of a single GPU to expand to support multi-GPU systems and propose a framework that addresses these challenges. Unlike the state-of-the-art uncontrolled spatial sharing currently available with systems such as CUDA-MPS, our controlled spatial sharing approach uses each of the GPU in the cluster efficiently by removing interference between applications, resulting in much better, predictable, inference latency We also use each of the cluster GPU's DMA engines to offload data transfers to the GPU complex, thereby preventing the CPU from being the bottleneck. Finally, our framework uses the CUDA event library to give timely, low overhead GPU notifications. Our evaluations show we can achieve low DNN inference latency and improve DNN inference throughput by at least a factor of 2.more » « less
An official website of the United States government

