Long Context | Haoran Qiu | Microsoft AzRS

Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

As large language models (LLMs) handle increasingly longer contexts, serving long inference requests of millions of tokens presents unique challenges. We show that existing work for long context inference is largely based on techniques from long …