FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms.

Qiu, H; Mao, W; Patke, A; Cui, S; Wang, C; Franke, H; Kalbarczyk, Z; Başar, T; Iyer, R

Citation Details

The emergence of ML in various cloud system management tasks (e.g., workload autoscaling and job scheduling) has become a core driver of ML-centric cloud platforms. However, there are still numerous algorithmic and systems challenges that prevent ML-centric cloud platforms from being production-ready. In this paper, we focus on the challenges of model performance variability and costly model retraining, introduced by dynamic workload patterns and heterogeneous applications and infrastructures in cloud environments. To address these challenges, we present FLASH, an extensible framework for fast model adaptation in ML-based system management tasks. We show how FLASH leverages existing ML agents and their training data to learn to generalize across applications/environments with meta-learning. FLASH can be easily integrated with an existing ML-based system management agent with a unified API. We demonstrate the use of FLASH by implementing three existing ML agents that manage (1) resource configurations, (2) autoscaling, and (3) server power. Our experiments show that FLASH enables fast adaptation to new, previously unseen applications/environments (e.g., 5.5× faster than transfer learning in the autoscaling task), indicating significant potential for adopting ML-centric cloud platforms in production. more »

Award ID(s):: 2029049

PAR ID:: 10546469

Author(s) / Creator(s):: Qiu, H; Mao, W; Patke, A; Cui, S; Wang, C; Franke, H; Kalbarczyk, Z; Başar, T; Iyer, R

Corporate Creator(s):: MLSys

Editor(s):: Gibbons, PhillipB; Pekhimenko, Gennady; De_Sa, Christopher

Publisher / Repository:: MLSys

Date Published:: 2024-09-01

Edition / Version:: 1

Volume:: 1

Issue:: 1

Page Range / eLocation ID:: nd

Subject(s) / Keyword(s):: ML-centric cloud system management

Format(s):: Medium: X Size: 1648 kb Other: pdf

Size(s):: 1648 kb

Location:: Santa Clara, CA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Proceeding:
The DOI is not currently available.

More Like this