Reinforcement Learning

FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms

The emergence of ML in various cloud system management tasks (e.g., workload autoscaling and job scheduling) has become a core driver of ML-centric cloud platforms. However, there are still numerous algorithmic and systems challenges that prevent …

On the Promise and Challenges of Foundation Models for Learning-based Cloud Systems Management

Foundation models (FMs) are machine learning models that are trained broadly on large-scale data and can be adapted to a set of downstream tasks via fine-tuning, few-shot learning, or even zero-shot learning. Despite the successes of FMs in the …

PARM: Adaptive Resource Allocation for Datacenter Power Capping

Energy efficiency is pressing in today's cloud datacenters. Various power management strategies, such as oversubscription, power capping, and dynamic voltage and frequency scaling, have been proposed and are in use by datacenter operators to better …

AWARE: Automate Workload Autoscaling with Reinforcement Learning in Production Cloud Systems

Workload autoscaling is widely used in public and private cloud systems to maintain stable service performance and save resources. However, it remains challenging to set the optimal resource limits and dynamically scale each workload at runtime. …

Multi-Agent Meta-Reinforcement Learning: Sharper Convergence Rates with Task Similarity

Multi-agent reinforcement learning (MARL) has primarily focused on solving a single task in isolation, while in practice the environment is often evolving, leaving many related tasks to be solved. In this paper, we investigate the benefits of …

SIMPPO: A Scalable and Incremental Online Learning Framework for Serverless Resource Management

Serverless Function-as-a-Service (FaaS) offers improved programmability for customers, yet it is not server-"less" and comes at the cost of more complex infrastructure management (e.g., resource provisioning and scheduling) for cloud providers. To …

A Mean-Field Game Approach to Cloud Resource Management with Function Approximation

Reinforcement learning (RL) has gained increasing popularity for resource management in cloud services such as serverless computing. As self-interested users compete for shared resources in a cluster, the multi-tenancy nature of serverless platforms …

Reinforcement Learning for Resource Management in Multi-tenant Serverless Platforms

Serverless Function-as-a-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and improve …