Compute heterogeneity is increasingly gaining prominence in modern datacenters due to the addition of accelerators like GPUs and FPGAs. We observe that datacenter schedulers are agnostic of these emerging accelerators, especially their resource utilization footprints, and thus, not well equipped to dynamically provision them based on the application needs. We observe that the state-of-the-art datacenter schedulers fail to provide fine-grained resource guarantees for latency-sensitive tasks that are GPU-bound. Specifically for GPUs, this results in resource fragmentation and interference leading to poor utilization of allocated GPU resources. Furthermore, GPUs exhibit highly linear energy efficiency with respect to utilization and hence proactive management of these resources is essential to keep the operational costs low while ensuring the end-to-end Quality of Service (QoS) in case of user-facing queries.Towards addressing the GPU orchestration problem, we build Knots, a GPU-aware resource orchestration layer and integrate it with the Kubernetes container orchestrator to build Kube- Knots. Kube-Knots can dynamically harvest spare compute cycles through dynamic container orchestration enabling co-location of latency-critical and batch workloads together while improving the overall resource utilization. We design and evaluate two GPU-based scheduling techniques to schedule datacenter-scale workloads through Kube-Knots on a ten node GPU cluster. Our proposed Correlation Based Prediction (CBP) and Peak Prediction (PP) schemes together improves both average and 99 th percentile cluster-wide GPU utilization by up to 80% in case of HPC workloads. In addition, CBP+PP improves the average job completion times (JCT) of deep learning workloads by up to 36% when compared to state-of-the-art schedulers. This leads to 33% cluster-wide energy savings on an average for three different workloads compared to state-of-the-art GPU-agnostic schedulers. Further, the proposed PP scheduler guarantees the end-to-end QoS for latency-critical queries by reducing QoS violations by up to 53% when compared to state-of-the-art GPU schedulers.
more »
« less
Energy-efficient GPU SM allocation
GPU sharing between workloads is an e!ective approach to increase GPU utilization and reduce idle power waste. To minimize resource contention under GPU sharing, current architectures allow users to allocate core GPU compute resources exclusively to workloads. However, identifying the most e''cient GPU compute resource allocation for colocated workloads is challenging, as it requires balancing potential performance degradation and power savings. This paper presents a framework for finding the most energy-e''cient compute allocation for colocated workload pairs under NVIDIA MPS using lightweight prediction models. Experimental results, using a range of training, inference, and general CUDA workloads, demonstrate that our solution outperforms the equal sharing strategy by 35%, on average, and is within 1.5% of the o#ine optimal strategy.
more »
« less
- PAR ID:
- 10668941
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- ACM SIGMETRICS Performance Evaluation Review
- Volume:
- 53
- Issue:
- 2
- ISSN:
- 0163-5999
- Page Range / eLocation ID:
- 33 to 38
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Deep neural networks (DNNs) are increasingly used for real-time inference, requiring low latency, but require significant computational power as they continue to increase in complexity. Edge clouds promise to offer lower latency due to their proximity to end-users and having powerful accelerators like GPUs to provide the computation power needed for DNNs. But it is also important to ensure that the edge-cloud resources are utilized well. For this, multiplexing several DNN models through spatial sharing of the GPU can substantially improve edge-cloud resource usage. Typical GPU runtime environments have significant interactions with the CPU, to transfer data to the GPU, for CPU-GPU synchronization on inference task completions, etc. These result in overheads. We present a DNN inference framework with a set of software primitives that reduce the overhead for DNN inference, increase GPU utilization and improve performance, with lower latency and higher throughput. Our first primitive uses the GPU DMA effectively, reducing the CPU cycles spent to transfer the data to the GPU. A second primitive uses asynchronous ‘events’ for faster task completion notification. GPU runtimes typically preclude fine-grained user control on GPU resources, causing long GPU downtimes when adjusting resources. Our third primitive supports overlapping of model-loading and execution, thus allowing GPU resource re-allocation with very little GPU idle time. Our other primitives increase inference throughput by improving scheduling and processing more requests. Overall, our primitives decrease inference latency by more than 35% and increase DNN throughput by 2-3×.more » « less
-
As AI inference becomes mainstream, research has begun to focus on improving the energy consumption of inference servers. Inference kernels commonly underutilize a GPU’s compute resources and waste power from idling components. To improve utilization and energy efficiency, multiple models can co-locate and share the GPU. However, typical GPU spatial partitioning techniques often experience significant overheads when reconfiguring spatial partitions, which can waste additional energy through repartitioning overheads or non-optimal partition configurations. In this paper, we present ECLIP, a framework to enable low-overhead energy-efficient kernel-wise resource partitioning between co-located inference kernels. ECLIP minimizes repartitioning overheads by pre-allocating pools of CU masked streams and assigns optimal CU assignments to groups of kernels through our resource allocation optimizer. Overall, ECLIP achieves an average of 13% improvement to throughput and 25% improvement to energy efficiency.more » « less
-
Dynamic spectrum access (DSA) is regarded as one of the key enabling technologies for future communication networks. In this paper, we introduce a power allocation strategy for distributed DSA networks using a powerful machine learning tool, namely deep reinforcement learning. The introduced power allocation strategy enables DSA users to conduct power allocation in a distributed fashion without relying on channel state information and cooperations among DSA users. Furthermore, to capture the temporal correlation of the underlying DSA network environments, the reservoir computing, a special class of recurrent neural network, is employed to realize the introduced deep reinforcement learning scheme. The combination of reservoir computing and deep reinforcement learning significantly improves the efficiency of the introduced resource allocation scheme. Simulation evaluations are conducted to demonstrate the effectiveness of the introduced power allocation strategy.more » « less
-
Embedded and autonomous systems are increasingly integrating AI/ML features, often enabled by a hardware accelerator such as a GPU. As these workloads become increasingly demanding, but size, weight, power, and cost constraints remain unyielding, ways to increase GPU capacity are an urgent need. In this work, we provide a means by which to spatially partition the computing units of NVIDIA GPUs transparently, allowing oft-idled capacity to be reclaimed via safe and effcient GPU sharing. Our approach works on any NVIDIA GPU since 2013, and can be applied via our easy-to-use, user-space library titled libsmctrl. We back the design of our system with deep investigations into the hardware scheduling pipeline of NVIDIA GPUs. We provide guidelines for the use of our system, and demonstrate it via an object detection case study using YOLOv2.more » « less
An official website of the United States government

