PARM: Adaptive Resource Allocation for Datacenter Power Capping

Abstract

Energy efficiency is pressing in today’s cloud datacenters. Various power management strategies, such as oversubscription, power capping, and dynamic voltage and frequency scaling, have been proposed and are in use by datacenter operators to better control power consumption at any management unit (e.g., node-level or rack-level) without breaking power budgets. In addition, by gaining more control over different management units within a datacenter (or across datacenters), operators are able to shift the energy consumption either spatially or temporally to optimize carbon footprint based on the spatio-temporal patterns of carbon intensity. The drive for automation has resulted in the exploration of learning-based resource management approaches. In this work, we first systematically investigate the impact of power capping on both latency-critical datacenter workloads and learning-based resource management solutions (i.e., reinforcement learning or RL). We show that even a 20% reduction in power limit (power capping) leads to an 18% degradation in resource management effectiveness (i.e., defined by an RL reward function) which causes 50% higher application latency. We then propose PALM, an adaptive resource allocation framework that provides graceful performance-preserving transition under power capping for latency-critical workloads. Evaluation results show that PALM achieves 10.2-99.3% improvement in service-level objective (SLO) preservation under power capping while improving 3.1-5.8% utilization.

Publication
Workshop on ML for Systems at NeurIPS 2023