On the Promise and Challenges of Foundation Models for Learning-based Cloud Systems Management

Abstract

Foundation models (FMs) are machine learning models that are trained broadly on large-scale data and can be adapted to a set of downstream tasks via fine-tuning, few-shot learning, or even zero-shot learning. Despite the successes of FMs in the language and vision domain, we have yet to see an attempt to develop FMs for cloud systems management (or known as cloud intelligence/AIOps). In this work, we explore the opportunities of developing FMs for cloud systems management. We propose an initial FM design (i.e., the FLASH framework) based on meta-learning and demonstrate its usage in the task of resource configuration search and workload autoscaling. Preliminary results show that FLASH achieves 52.3-90.5% less performance degradation with no adaptation and provides 5.5x faster adaptation. We conclude this paper by discussing the unique risks and challenges of developing FMs for cloud systems management.

Publication
Workshop on ML for Systems at NeurIPS 2023