NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cilantro: Performance-Aware Resource Allocation for General Objectives via Online Feedback

Bhardwaj, Romil; Kandasamy, Kirthevasan; Biswal, Asim; Guo, Wenshuo; Hindman, Benjamin; Gonzalez, Joseph; Jordan, Michael I; Stoica, Ion (July 2023, USENIX Association)

Traditional systems for allocating finite cluster resources among competing jobs have either aimed at providing fairness, relied on users to specify their resource requirements, or have estimated these requirements via surrogate metrics (e.g. CPU utilization). These approaches do not account for a job’s real world performance (e.g. P95 latency). Existing performance-aware systems use offline profiled data and/or are designed for specific allocation objectives. In this work, we argue that resource allocation systems should directly account for real-world performance and the varied allocation objectives of users. In this pursuit, we build Cilantro. At the core of Cilantro is an online learning mechanism which forms feedback loops with the jobs to estimate the resource to performance mappings and load shifts. This relieves users from the onerous task of job profiling and collects reliable real-time feedback. This is then used to achieve a variety of user-specified scheduling objectives. Cilantro handles the uncertainty in the learned models by adapting the underlying policy to work with confidence bounds. We demonstrate this in two settings. First, in a multi-tenant 1000 CPU cluster with 20 independent jobs, three of Cilantro’s policies outperform 9 other baselines on three different performance-aware scheduling objectives, improving user utilities by up to 1.2 − 3.7x. Second, in a microservices setting, where 160 CPUs must be distributed between 19 inter-dependent microservices, Cilantro outperforms 3 other baselines, reducing the end-to-end P99 latency to x0.57 the next best baseline.
more » « less
Full Text Available
ESCHER: expressive scheduling with ephemeral resources

https://doi.org/10.1145/3542929.3563498

Bhardwaj, Romil; Tumanov, Alexey; Wang, Stephanie; Liaw, Richard; Moritz, Philipp; Nishihara, Robert; Stoica, Ion (November 2022, SoCC '22: Proceedings of the 13th Symposium on Cloud Computing)

As distributed applications become increasingly complex, so do their scheduling requirements. This development calls for cluster schedulers that are not only general, but also evolvable. Unfortunately, most existing cluster schedulers are not evolvable: when confronted with new requirements, they need major rewrites to support these requirements. Examples include gang-scheduling support in Kubernetes [6, 39] or task-affinity in Spark [39]. Some cluster schedulers [14, 30] expose physical resources to applications to address this. While these approaches are evolvable, they push the burden of implementing scheduling mechanisms in addition to the policies entirely to the application. ESCHER is a cluster scheduler design that achieves both evolvability and application-level simplicity. ESCHER uses an abstraction exposed by several recent frameworks (which we call ephemeral resources) that lets the application express scheduling constraints as resource requirements. These requirements are then satisfied by a simple mechanism matching resource demands to available resources. We implement ESCHER on Kubernetes and Ray, and show that this abstraction can be used to express common policies offered by monolithic schedulers while allowing applications to easily create new custom policies hitherto unsupported.
more » « less
Full Text Available
SkyPilot: An Intercloud Broker for Sky Computing

Yang, Zongheng; Wu, Zhanghao; Luo, Michael; Chiang, Wei-Lin; Bhardwaj, Romil; Kwon, Woosuk; Zhuang, Siyuan; Luan, Frank_Sifei; Mittal, Gautam; Shenker, Scott; et al (April 2023, USENIX Association)

Full Text Available
RubberBand: cloud-based hyperparameter tuning

https://doi.org/10.1145/3447786.3456245

Misra, Ujval; Liaw, Richard; Dunlap, Lisa; Bhardwaj, Romil; Kandasamy, Kirthevasan; Gonzalez, Joseph E.; Stoica, Ion; Tumanov, Alexey (April 2021, EuroSys '21: Proceedings of the Sixteenth European Conference on Computer Systems)

Hyperparameter tuning is essential to achieving state-of-the-art accuracy in machine learning (ML), but requires substantial compute resources to perform. Existing systems primarily focus on effectively allocating resources for a hyperparameter tuning job under fixed resource constraints. We show that the available parallelism in such jobs changes dynamically over the course of execution and, therefore, presents an opportunity to leverage the elasticity of the cloud. In particular, we address the problem of minimizing the financial cost of executing a hyperparameter tuning job, subject to a time constraint. We present RubberBand---the first framework for cost-efficient, elastic execution of hyperparameter tuning jobs in the cloud. RubberBand utilizes performance instrumentation and cloud pricing to model job completion time and cost prior to runtime, and generate a cost-efficient, elastic resource allocation plan. RubberBand is able to efficiently execute this plan and realize a cost reduction of up to 2x in comparison to static allocation baselines.
more » « less
Full Text Available

Search for: All records