- Award ID(s):
- 1740263
- PAR ID:
- 10111055
- Date Published:
- Journal Name:
- 2018 IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Apache Mesos, a cluster-wide resource manager, is widely deployed in massive scale at several Clouds and Data Centers. Mesos aims to provide high cluster utilization via fine grained resource co-scheduling and resource fairness among multiple users through Dominant Resource Fairness (DRF) based allocation. DRF takes into account different resource types (CPU, Memory, Disk I/O) requested by each application and determines the share of each cluster resource that could be allocated to the applications. Mesos has adopted a two-level scheduling policy: (1) DRF to allocate resources to competing frameworks and (2) task level scheduling by each framework for the resources allocated during the previous step. We have conducted experiments in a local Mesos cluster when used with frameworks such as Apache Aurora, Marathon, and our own framework Scylla, to study resource fairness and cluster utilization. Experimental results show how informed decision regarding second level scheduling policy of frameworks and attributes like offer holding period, offer refusal cycle and task arrival rate can reduce unfair resource distribution. Bin-Packing scheduling policy on Scylla with Marathon can reduce unfair allocation from 38% to 3%. By reducing unused free resources in offers we bring down the unfairness from to 90% to 28%. We also show the effect of task arrival rate to reduce the unfairness from 23% to 7%.more » « less
-
Summary Fair allocation has been studied intensively in both economics and computer science. Many existing mechanisms that consider fairness of resource allocation focus on a single resource. With the advance of cloud computing that centralizes multiple types of resources under one shared platform, multi‐resource allocation has come into the spotlight. In fact, fair/efficient multi‐resource allocation has become a fundamental problem in any shared computer system. The widely used solution is to partition resources into bundles that contain fixed amounts of different resources, so that multiple resources are abstracted as a single resource. However, this abstraction cannot satisfy different demands from heterogeneous users, especially on ensuring fairness among users competing for resources with different capacity limits. A promising approach to this problem is dominant resource fairness (DRF), which tries to equalize each user's dominant share (share of a user's most highly demanded resource, that is, the largest fraction of any resource that the user has required for a task), but this method may still suffer from significant loss of efficiency (i.e., some resources are underused). This article develops a new allocation mechanism based on DRF aiming to balance fairness and efficiency. We consider fairness not only in terms of a user's dominant resource, but also in another resource dimension which is secondarily desired by this user. We call this allocation mechanism 2‐dominant resource fairness (2‐DF). Then, we design a non‐trivial on‐line algorithm to find a 2‐DF allocation and extend this concept to
k ‐dominant resource fairness (k ‐DF). -
Traditional systems for allocating finite cluster resources among competing jobs have either aimed at providing fairness, relied on users to specify their resource requirements, or have estimated these requirements via surrogate metrics (e.g. CPU utilization). These approaches do not account for a job’s real world performance (e.g. P95 latency). Existing performance-aware systems use offline profiled data and/or are designed for specific allocation objectives. In this work, we argue that resource allocation systems should directly account for real-world performance and the varied allocation objectives of users. In this pursuit, we build Cilantro. At the core of Cilantro is an online learning mechanism which forms feedback loops with the jobs to estimate the resource to performance mappings and load shifts. This relieves users from the onerous task of job profiling and collects reliable real-time feedback. This is then used to achieve a variety of user-specified scheduling objectives. Cilantro handles the uncertainty in the learned models by adapting the underlying policy to work with confidence bounds. We demonstrate this in two settings. First, in a multi-tenant 1000 CPU cluster with 20 independent jobs, three of Cilantro’s policies outperform 9 other baselines on three different performance-aware scheduling objectives, improving user utilities by up to 1.2 − 3.7x. Second, in a microservices setting, where 160 CPUs must be distributed between 19 inter-dependent microservices, Cilantro outperforms 3 other baselines, reducing the end-to-end P99 latency to x0.57 the next best baseline.more » « less
-
We first consider the static problem of allocating resources to (i.e., scheduling) multiple distributed application frameworks, possibly with different priorities and server preferences, in a private cloud with heterogeneous servers. Several fair scheduling mechanisms have been proposed for this purpose. We extend prior results on max-min fair (MMF) and proportional fair (PF) scheduling to this constrained multiresource and multiserver case for generic fair scheduling criteria. The task efficiencies (a metric related to proportional fairness) of max- min fair allocations found by progressive filling are compared by illustrative examples. In the second part of this paper, we consider the online problem (with framework churn) by implementing variants of these schedulers in Apache Mesos using progressive filling to dynamically approximate max-min fair allocations. We evaluate the implemented schedulers in terms of overall execution time of realistic distributed Spark workloads. Our experiments show that resource efficiency is improved and execution times are reduced when the scheduler is “server specific” or when it leverages characterized required resources of the workloads (when known).more » « less
-
null (Ed.)We consider an LTE downlink scheduling system where a base station allocates resource blocks (RBs) to users running delay-sensitive applications. We aim to find a scheduling policy that minimizes the queuing delay experienced by the users. We formulate this problem as a Markov Decision Process (MDP) that integrates the channel quality indicator (CQI) of each user in each RB, and queue status of each user. To solve this complex problem involving high dimensional state and action spaces, we propose a Deep Reinforcement Learning based scheduling framework that utilizes the Deep Deterministic Policy Gradient (DDPG) algorithm to minimize the queuing delay experienced by the users. Our extensive experiments demonstrate that our approach outperforms state-of-the-art benchmarks in terms of average throughput, queuing delay, and fairness, achieving up to 55% lower queuing delay than the best benchmark.more » « less