Machine Learning

Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

As large language models (LLMs) handle increasingly longer contexts, serving long inference requests of millions of tokens presents unique challenges. We show that existing work for long context inference is largely based on techniques from long …

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or rank and primarily report mean-centric metrics (e.g., TTFT/TBT). We show …

Fast or Slow? Human-Inspired Self-Evolving Framework for Resilient AI Systems

This paper proposes a disruptive shift toward human-like self-evolving loops as a foundation for resilient AI systems. At the core of our proposal is the PURER loop (Perceive, Update, Reason, Execute, Reflection), a cognitive- inspired framework that …

ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving

Large multimodal models (LMMs) demonstrate impressive capabilities in understanding images, videos, and audio beyond text. However, efficiently serving LMMs in production environments poses significant challenges due to their complex architectures …

Towards Efficient Large Multimodal Model Serving

Recent advances in generative AI have led to large multi-modal models (LMMs) capable of simultaneously processing inputs of various modalities such as text, images, video, and audio. While these models demonstrate impressive capabilities, efficiently …

TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms

The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques often are inadequate for LLM inference due to the fine-grained, millisecond-scale execution …

INDIGO: Page Migration for Hardware Memory Disaggregation Across a Network

Hardware memory disaggregation (HMD) is an emerging technology that enables access to remote memory, thereby creating expansive memory pools and reducing memory underutilization in datacenters. However, a significant challenge arises when accessing …

Power-aware Deep Learning Model Serving with µ-Serve

With the increasing popularity of large deep learning modelserving workloads, there is a pressing need to reduce the energy consumption of a model-serving cluster while maintaining satisfied throughput or model-serving latency requirements. Model …

QLM: Queue Management for Large Language Model Serving

Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, …

FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms

The emergence of ML in various cloud system management tasks (e.g., workload autoscaling and job scheduling) has become a core driver of ML-centric cloud platforms. However, there are still numerous algorithmic and systems challenges that prevent …