skip to main content


Title: Scheduling Challenges for Variable Capacity Resources
Datacenter scheduling research often assumes resources as a constant quantity, but increasingly external factors shape capacity dynamically, and beyond the control of an operator. Based on emerging examples, we define a new, open research challenge: the variable capacity resource scheduling problem. The objective here is effective resource utilization despite sudden, perhaps large, changes in the available resources. We define the problem, key dimensions of resource capacity variation, and give specific examples that arise from the natural world (carboncontent, power price, datacenter cooling, and more). Key dimensions of the resource capacity variation include dynamic range, frequency, and structure. With these dimensions, an empirical trace can be characterized, abstracting it from the many possible important real-world generators of variation. Resource capacity variation can arise from many causes including weather, market prices, renewable energy, carbon emission targets, and internal dynamic power management constraints. We give examples of three different sources of variable capacity. Finally, we show variable resource capacity presents new scheduling challenges. We show how variation can cause significant performance degradation in existing schedulers, with up to 60% goodput reduction. Further, initial results also show intelligent scheduling techniques can be helpful. These insights show the promise and opportunity for future scheduling studies on resource volatility.  more » « less
Award ID(s):
1901466
PAR ID:
10253544
Author(s) / Creator(s):
;
Editor(s):
Cirne, Walfredo; Rodrigo, Gonzalo P.; Klusáček, Dalibor
Date Published:
Journal Name:
Workshop on Job Scheduling for Parallel Processing (JSSPP)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Cirne, Walfredo ; Rodrigo, Gonzalo P. ; Klusáček, Dalibor (Ed.)
    Datacenter scheduling research often assumes resources as a constant quantity, but increasingly external factors shape capacity dynamically, and beyond the control of an operator. Based on emerging examples, we define a new, open research challenge: the variable capacity resource scheduling problem. The objective here is effective resource utilization despite sudden, perhaps large, changes in the available resources. We define the problem, key dimensions of resource capacity variation, and give specific examples that arise from the natural world (carbon- content, power price, datacenter cooling, and more). Key dimensions of the resource capacity variation include dynamic range, frequency, and structure. With these dimensions, an empirical trace can be character- ized, abstracting it from the many possible important real-world generators of variation. Resource capacity variation can arise from many causes including weather, market prices, renewable energy, carbon emission targets, and internal dynamic power management constraints. We give examples of three dif- ferent sources of variable capacity. Finally, we show variable resource capacity presents new scheduling challenges. We show how variation can cause significant performance degra- dation in existing schedulers, with up to 60% goodput reduction. Further, initial results also show intelligent scheduling techniques can be helpful. These insights show the promise and opportunity for future scheduling studies on resource volatility. 
    more » « less
  2. Traditional datacenter design and optimization for TCO and PUE is based on static views of power grids as well as computational loads. Power grids exhibit increasingly variable price and carbon-emissions, becoming more so as government initiatives drive further decarbonization. The resulting opportunities require dynamic, temporal metrics (eg. not simple averages), flexible systems and intelligent adaptive control. Two research areas represent new opportunities to reduce both carbon and cost in this world of variable power, carbon, and price. First, the design and optimization of flexible datacenters. Second, cloud resource, power, and application management for variable-capacity datacenters. For each, we describe the challenges and potential benefits. 
    more » « less
  3. We investigate virtual-network-function placement and scheduling problem in optical datacenter networks, considering the installation/de-installation latency of VNF and the rapid variation of low-latency-demands. The proposed scheme achieves low blocking probability, latency, and spectrum resource consumption. 
    more » « less
  4. When scheduling multi-mode real-time systems on multi-core platforms, a key question is how to dynamically adjust shared resources, such as cache and memory bandwidth, when resource demands change, without jeopardizing schedulability during mode changes. This paper presents Omni, a first end-to-end solution to this problem. Omni consists of a novel multi-mode resource allocation algorithm and a resource-aware schedulability test that supports general mode-change semantics as well as dynamic cache and bandwidth resource allocation. Omni's resource allocation leverages the platform's concurrency and the diversity of the tasks' demands to minimize overload during mode transitions; it does so by intelligently co-distributing tasks and resources across cores. Omni's schedulability test ensures predictable mode transitions, and it takes into account mode-change effects on the resource demands on different cores, so as to best match their dynamic needs using the available resources. We have implemented a prototype of Omni, and we have evaluated it using randomly generated multi-mode systems with several real-world benchmarks as the workload. Our results show that Omni has low overhead, and that it is substantially more effective in improving schedulability than the state of the art 
    more » « less
  5. Compute heterogeneity is increasingly gaining prominence in modern datacenters due to the addition of accelerators like GPUs and FPGAs. We observe that datacenter schedulers are agnostic of these emerging accelerators, especially their resource utilization footprints, and thus, not well equipped to dynamically provision them based on the application needs. We observe that the state-of-the-art datacenter schedulers fail to provide fine-grained resource guarantees for latency-sensitive tasks that are GPU-bound. Specifically for GPUs, this results in resource fragmentation and interference leading to poor utilization of allocated GPU resources. Furthermore, GPUs exhibit highly linear energy efficiency with respect to utilization and hence proactive management of these resources is essential to keep the operational costs low while ensuring the end-to-end Quality of Service (QoS) in case of user-facing queries.Towards addressing the GPU orchestration problem, we build Knots, a GPU-aware resource orchestration layer and integrate it with the Kubernetes container orchestrator to build Kube- Knots. Kube-Knots can dynamically harvest spare compute cycles through dynamic container orchestration enabling co-location of latency-critical and batch workloads together while improving the overall resource utilization. We design and evaluate two GPU-based scheduling techniques to schedule datacenter-scale workloads through Kube-Knots on a ten node GPU cluster. Our proposed Correlation Based Prediction (CBP) and Peak Prediction (PP) schemes together improves both average and 99 th percentile cluster-wide GPU utilization by up to 80% in case of HPC workloads. In addition, CBP+PP improves the average job completion times (JCT) of deep learning workloads by up to 36% when compared to state-of-the-art schedulers. This leads to 33% cluster-wide energy savings on an average for three different workloads compared to state-of-the-art GPU-agnostic schedulers. Further, the proposed PP scheduler guarantees the end-to-end QoS for latency-critical queries by reducing QoS violations by up to 53% when compared to state-of-the-art GPU schedulers. 
    more » « less