Dynamic workflow management systems offer a solution to the problem of distributing a local application by packaging individual computations and their dependencies on- the-fly into tasks executable on remote workers. Such inde- pendent task execution allows workers to be launched in an opportunistic manner to maximize the current pool of resources at any given time, either through opportunistic systems (e.g., HTCondor, AWS Spot Instances), or conventional systems (e.g., SLURM, SGE) with backfilling enabled, as opposed to monolithic or message-passing applications requiring a fixed block of non- preemptible workers. However, the dynamic nature of task generation presents a significant challenge in terms of resource management as tasks must be allocated with some unknown amount of resources pre-execution but are only observable at runtime. This in turn results in potentially huge resource waste per task as (1) users lack direct knowledge about the relationship between tasks and resources, and thus cannot correctly specify the amount of resources a task needs in advance, and (2) workflows and tasks may exhibit stochastic behaviors at runtime, which complicates the process of resource management. In this paper, we (1) argue for the need of an adaptive resource allocator capable of allocating tasks at runtime and adjusting to random fluctuations and abrupt changes in a dynamic workflow without requiring any prior knowledge, and (2) introduce Greedy Bucketing and Exhaustive Bucketing: two robust, online, general- purpose, and prior-free allocation algorithms capable of producing quality estimates of a task’s resource consumption as the work- flow runs. Our results show that a resource allocator equipped with either algorithm consistently outperforms 5 alternative allocation algorithms on 7 diverse workflows and incurs at most 1.6 ms overhead per allocation in the steady state.
more »
« less
Not All Tasks Are Created Equal: Adaptive Resource Allocation for Heterogeneous Tasks in Dynamic Workflows
Users running dynamic workflows in distributed systems usually have inadequate expertise to correctly size the allocation of resources (cores, memory, disk) to each task due to the difficulty in uncovering the obscure yet important correlation between tasks and their resource consumption. Thus, users typically pay little attention to this problem of allocation sizing and either simply apply an error-prone upper bound of resource allocation to all tasks, or delegate this responsibility to underlying distributed systems, resulting in substantial waste from allocated yet unused resources. In this paper, we will first show that tasks performing different work may have significantly different resource consumption. We will then show that exploiting the heterogeneity of tasks is a desirable way to reveal and predict the relationship between tasks and their resource consumption, reduce waste from resource misallocation, increase tasks' consumption efficiency, and incentivize users' cooperation. We have developed two info-aware allocation strategies capitalizing on this characteristic and will show their effectiveness through simulations on two modern applications with dynamic workflows and five synthetic datasets of resource consumption. Our results show that info-aware strategies can cut down up to 98.7% of the total waste incurred by a best-effort strategy, and increase the efficiency in resource consumption of each task on average anywhere up to 93.9%.
more »
« less
- Award ID(s):
- 1931348
- PAR ID:
- 10356914
- Date Published:
- Journal Name:
- WORKS Workshop on Workflows at Supercomputing
- Page Range / eLocation ID:
- 17 to 24
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Apache Mesos, a two-level resource scheduler, provides resource sharing across multiple users in a multi-tenant clustered environment. Computational resources (i.e., CPU, memory, disk, etc.) are distributed according to the Dominant Resource Fairness (DRF) policy. Mesos frameworks (users) receive resources based on their current usage and are responsible for scheduling their tasks within the allocation. We have observed that multiple frameworks can cause fairness imbalance in a multi-user environment. For example, a greedy framework consuming more than its fair share of resources can deny resource fairness to others. The user with the least Dominant Share is considered first by the DRF module to get its resource allocation. However, the default DRF implementation, in Apache Mesos' Master allocation module, does not consider the overall resource demands of the tasks in the queue for each user/framework. This lack of awareness can lead to poor performance as users without any pending task may receive more resource offers, and users with a queue of pending tasks can starve due to their high dominant shares. In a multi-tenant environment, the characteristics of frameworks and workloads must be understood by cluster managers to be able to define fairness based on not only resource share but also resource demand and queue wait time. We have developed a policy driven queue manager, Tromino, for an Apache Mesos cluster where tasks for individual frameworks can be scheduled based on each framework's overall resource demands and current resource consumption. Dominant Share and demand awareness of Tromino and scheduling based on these attributes can reduce (1) the impact of unfairness due to a framework specific configuration, and (2) unfair waiting time due to higher resource demand in a pending task queue. In the best case, Tromino can significantly reduce the average waiting time of a framework by using the proposed Demand-DRF aware policy.more » « less
-
null (Ed.)Federated scheduling is a generalization of partitioned scheduling for parallel tasks on multiprocessors, and has been shown to be a competitive scheduling approach. However, federated scheduling may waste resources due to its dedicated allocation of processors to parallel tasks. In this work we introduce a novel algorithm for scheduling parallel tasks that require more than one processor to meet their deadlines (i.e., heavy tasks). The proposed algorithm computes a deterministic schedule for each heavy task based on its internal graph structure. It efficiently exploits the processors allocated to each task and thus reduces the number of processors required by the task. Experimental evaluation shows that our new federated scheduling algorithm significantly outperforms other state-of-the-art federated-based scheduling approaches, including semi-federated scheduling and reservation-based federated scheduling, that were developed to tackle resource waste in federated scheduling, and a stretching algorithm that also uses the tasks' graph structures.more » « less
-
This study investigates the problem of decentralized dynamic resource allocation optimization for ad-hoc network communication with the support of reconfigurable intelligent surfaces (RIS), leveraging a reinforcement learning framework. In the present context of cellular networks, device-to-device (D2D) communication stands out as a promising technique to enhance the spectrum efficiency. Simultaneously, RIS have gained considerable attention due to their ability to enhance the quality of dynamic wireless networks by maximizing the spectrum efficiency without increasing the power consumption. However, prevalent centralized D2D transmission schemes require global information, leading to a significant signaling overhead. Conversely, existing distributed schemes, while avoiding the need for global information, often demand frequent information exchange among D2D users, falling short of achieving global optimization. This paper introduces a framework comprising an outer loop and inner loop. In the outer loop, decentralized dynamic resource allocation optimization has been developed for self-organizing network communication aided by RIS. This is accomplished through the application of a multi-player multi-armed bandit approach, completing strategies for RIS and resource block selection. Notably, these strategies operate without requiring signal interaction during execution. Meanwhile, in the inner loop, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm has been adopted for cooperative learning with neural networks (NNs) to obtain optimal transmit power control and RIS phase shift control for multiple users, with a specified RIS and resource block selection policy from the outer loop. Through the utilization of optimization theory, distributed optimal resource allocation can be attained as the outer and inner reinforcement learning algorithms converge over time. Finally, a series of numerical simulations are presented to validate and illustrate the effectiveness of the proposed scheme.more » « less
-
Traditional systems for allocating finite cluster resources among competing jobs have either aimed at providing fairness, relied on users to specify their resource requirements, or have estimated these requirements via surrogate metrics (e.g. CPU utilization). These approaches do not account for a job’s real world performance (e.g. P95 latency). Existing performance-aware systems use offline profiled data and/or are designed for specific allocation objectives. In this work, we argue that resource allocation systems should directly account for real-world performance and the varied allocation objectives of users. In this pursuit, we build Cilantro. At the core of Cilantro is an online learning mechanism which forms feedback loops with the jobs to estimate the resource to performance mappings and load shifts. This relieves users from the onerous task of job profiling and collects reliable real-time feedback. This is then used to achieve a variety of user-specified scheduling objectives. Cilantro handles the uncertainty in the learned models by adapting the underlying policy to work with confidence bounds. We demonstrate this in two settings. First, in a multi-tenant 1000 CPU cluster with 20 independent jobs, three of Cilantro’s policies outperform 9 other baselines on three different performance-aware scheduling objectives, improving user utilities by up to 1.2 − 3.7x. Second, in a microservices setting, where 160 CPUs must be distributed between 19 inter-dependent microservices, Cilantro outperforms 3 other baselines, reducing the end-to-end P99 latency to x0.57 the next best baseline.more » « less
An official website of the United States government

