SmartOClock: Workload- and Risk-Aware Overclocking in the Cloud

Abstract

Operating server components beyond their voltage and power design limits (i.e., overclocking) enables improving performance and lowering cost for cloud workloads. However, overclocking can significantly degrade component lifetime, increase power consumption, and cause power capping events, eventually diminishing the performance benefits. In this paper, we characterize the impact of overclocking on cloud workloads by studying their profiles from production deployments. Based on the characterization insights, we propose SmartOClock, the first distributed overclocking management platform specifically designed for cloud environments. SmartOClock is a workload-aware scheme that relies on power predictions to heterogeneously distribute the power budgets across its servers based on their needs and then enforce budget compliance locally, per-server, in a decentralized manner. SmartOClock reduces the tail latency by 9%, application cost by 30% and total energy consumption by 10% for latencysensitive microservices on a 36-server deployment. Simulation analysis using production traces show that SmartOClock reduces the number of power capping events by up to 95% while increasing the overclocking success rate by up to 62%. We also describe lessons from building a first-of-its-kind overclockable cluster at a cloud provider for production experiments.

Publication
Proceedings of the 51st International Symposium on Computer Architecture (ISCA 2024)