Haoran Qiu | Microsoft AzRS
Home
Publications
Awards
Services
Experiences
Contact
LLM Inference
From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models
Serving large generative models such as LLMs and multi- modal transformers requires balancing user-facing SLOs (e.g., time-to-first-token, time-between-tokens) with provider goals of efficiency and cost reduction. Existing solutions rely on static …
Cite
×