Gibbons, PhillipB
; Pekhimenko, Gennady
; De_Sa, Christopher
(Ed.)
The emergence of ML in various cloud system management tasks (e.g., workload autoscaling and job scheduling)
has become a core driver of ML-centric cloud platforms. However, there are still numerous algorithmic and systems
challenges that prevent ML-centric cloud platforms from being production-ready. In this paper, we focus on the
challenges of model performance variability and costly model retraining, introduced by dynamic workload patterns
and heterogeneous applications and infrastructures in cloud environments. To address these challenges, we present
FLASH, an extensible framework for fast model adaptation in ML-based system management tasks. We show how
FLASH leverages existing ML agents and their training data to learn to generalize across applications/environments
with meta-learning. FLASH can be easily integrated with an existing ML-based system management agent with
a unified API. We demonstrate the use of FLASH by implementing three existing ML agents that manage (1)
resource configurations, (2) autoscaling, and (3) server power. Our experiments show that FLASH enables fast
adaptation to new, previously unseen applications/environments (e.g., 5.5× faster than transfer learning in the
autoscaling task), indicating significant potential for adopting ML-centric cloud platforms in production.
more »
« less