With the increasing popularity of large deep learning modelserving workloads, there is a pressing need to reduce the energy consumption of a model-serving cluster while maintaining satisfied throughput or model-serving latency requirements. Model …
Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, …
The emergence of ML in various cloud system management tasks (e.g., workload autoscaling and job scheduling) has become a core driver of ML-centric cloud platforms. However, there are still numerous algorithmic and systems challenges that prevent …
This paper addresses the urgent need to transition to global net-zero carbon emissions by 2050 while retaining the ability to meet joint performance and resilience objectives. The focus is on the computing infrastructures, such as hyperscale cloud …
Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the …
Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to …
Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are …
Modern user-facing, latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing compute-resources across microservices is still …
AutoMice is designed to ease the transition from testbed validation to deployment in production by using two abstraction layers on both the input and output of a self-driving system.