The emergence of ML in various cloud system management tasks (e.g., workload autoscaling and job scheduling) has become a core driver of ML-centric cloud platforms. However, there are still numerous algorithmic and systems challenges that prevent …
Foundation models (FMs) are machine learning models that are trained broadly on large-scale data and can be adapted to a set of downstream tasks via fine-tuning, few-shot learning, or even zero-shot learning. Despite the successes of FMs in the …
Energy efficiency is pressing in today's cloud datacenters. Various power management strategies, such as oversubscription, power capping, and dynamic voltage and frequency scaling, have been proposed and are in use by datacenter operators to better …
Workload autoscaling is widely used in public and private cloud systems to maintain stable service performance and save resources. However, it remains challenging to set the optimal resource limits and dynamically scale each workload at runtime. …
Multi-agent reinforcement learning (MARL) has primarily focused on solving a single task in isolation, while in practice the environment is often evolving, leaving many related tasks to be solved. In this paper, we investigate the benefits of …
Serverless Function-as-a-Service (FaaS) offers improved programmability for customers, yet it is not server-"less" and comes at the cost of more complex infrastructure management (e.g., resource provisioning and scheduling) for cloud providers. To …
Reinforcement learning (RL) has gained increasing popularity for resource management in cloud services such as serverless computing. As self-interested users compete for shared resources in a cluster, the multi-tenancy nature of serverless platforms …
Serverless Function-as-a-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and improve …