Scheduling

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the …

QLM: Queue Management for Large Language Model Serving

The emergence of large language models (LLMs) has introduced excessive computational demands and unique execution patterns (i.e., nondeterministic execution time due to autoregressive patterns) for cloud providers. Consequently, existing LLM serving …