Machine Learning

Power-aware Deep Learning Model Serving with µ-Serve

With the increasing popularity of large deep learning modelserving workloads, there is a pressing need to reduce the energy consumption of a model-serving cluster while maintaining satisfied throughput or model-serving latency requirements. Model …

When Green Computing Meets Performance and Resilience SLOs

This paper addresses the urgent need to transition to global net-zero carbon emissions by 2050 while retaining the ability to meet joint performance and resilience objectives. The focus is on the computing infrastructures, such as hyperscale cloud …

FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms

The emergence of ML in various cloud system management tasks (e.g., workload autoscaling and job scheduling) has become a core driver of ML-centric cloud platforms. However, there are still numerous algorithmic and systems challenges that prevent …

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the …

Delay Sensitivity-driven Congestion Mitigation for HPC Systems

Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are …

FIRM: An Intelligent Fine-Grained Resource Management Frameworkfor SLO-Oriented Microservices

Modern user-facing, latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing compute-resources across microservices is still …

AutoMice: A Testbed Framework for Self-Driving Systems

AutoMice is designed to ease the transition from testbed validation to deployment in production by using two abstraction layers on both the input and output of a self-driving system.