Machine Learning

QLM: Queue Management for Large Language Model Serving

Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to …

Delay Sensitivity-driven Congestion Mitigation for HPC Systems

Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are …

FIRM: An Intelligent Fine-Grained Resource Management Frameworkfor SLO-Oriented Microservices

Modern user-facing, latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing compute-resources across microservices is still …

AutoMice: A Testbed Framework for Self-Driving Systems

AutoMice is designed to ease the transition from testbed validation to deployment in production by using two abstraction layers on both the input and output of a self-driving system.