LLM

Power-aware Deep Learning Model Serving with µ-Serve

With the increasing popularity of large deep learning modelserving workloads, there is a pressing need to reduce the energy consumption of a model-serving cluster while maintaining satisfied throughput or model-serving latency requirements. Model …

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the …

QLM: Queue Management for Large Language Model Serving

The emergence of large language models (LLMs) has introduced excessive computational demands and unique execution patterns (i.e., nondeterministic execution time due to autoregressive patterns) for cloud providers. Consequently, existing LLM serving …