Beyond Prediction: Tail-Aware Scheduling for LLM Inference

Yueying Li, Yuanfan Chen, Jiayang Chen, Esha Choukse, Haoran Qiu, Edward Suh, Rodrigo Fonseca, Ziv Scully, Udit Gupta

April 2026

Abstract

LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or rank and primarily report mean-centric metrics (e.g., TTFT/TBT). We show these prediction-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, and still offer limited control over tail latency (P90–P99) that dominates user experience—even with perfect decode-length knowledge. We introduce a distribution-aware, prediction-free scheduling framework that replaces explicit length prediction with soft, gamma-parameterized priority boosting driven by lightweight statistical signals. Our design co-optimizes scheduling with cache-aware preemption to account for memory-coupled decode dynamics that vary across workload mixes. Evaluated on Azure production traces, our method achieves a P99 TTLT up to 35–50% lower than SRPT with perfect length prediction and a TTFT 34–47% lower across various workloads, including reasoning-heavy and chat-heavy tasks, demonstrating a robust alternative for tail-latency optimization in online LLM serving.

Type

Conference paper

Publication

Proceedings of the 43rd International Conference on Machine Learning (ICML ‘26)