Scheduling

TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms

The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques often are inadequate for LLM inference due to the fine-grained, millisecond-scale execution …

QLM: Queue Management for Large Language Model Serving

Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, …

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the …

QLM: Queue Management for Large Language Model Serving

Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to …