Skip to main content

4 posts tagged with "blog posts"

everyday blog posts

View All Tags

Predicted-Latency Based Scheduling for LLMs

· 28 min read
Kaushik Mitra
Software Engineer, Google
Benjamin Braun
Software Engineer, Google
Abdullah Gharaibeh
Senior Staff Software Engineer, Google
Clayton Coleman
Distinguished Engineer, Google

Not all LLM requests cost the same. A short prompt might complete in milliseconds, while a long one can occupy a GPU for seconds. If we can predict how long a request will take on each candidate server before dispatching it, we can make substantially better routing decisions. This post describes a system that does exactly that: a lightweight ML model trained online from live traffic that replaces manually tuned heuristic weights with direct latency predictions.

Native KV Cache Offloading to Any Filesystem with llm-d

· 11 min read
Kfir Toledo
Research Staff Member, IBM
Danny Harnik
Senior Technical Staff Member, IBM
Effi Ofer
Research Staff Member, IBM
Or Ozeri
Research Staff Member, IBM
Guy Margalit
Senior Technical Staff Member, IBM Storage CTO Office

llm-d is a distributed inference platform spanning multiple vLLM instances. KV cache hits are critical to achieving high inference throughput. Yet, in a distributed environment, cache hits do not occur across different nodes as the KV cache is local to each vLLM instance. In addition, this local cache is limited in size, further limiting KV data reuse. This blog presents a new way to offload KV cache to storage, tackling both aforementioned challenges – KV cache sharing and KV cache scale. llm-d's filesystem (FS) backend is a KV cache storage connector for vLLM that offloads KV blocks to shared storage based on vLLM's native Offloading Connector. While the llm-d FS backend can speed up serving of single requests (improve TTFT), its main goal is rather to preserve stable throughput and low latency at scale, as concurrency and context lengths grow. This is accomplished by significantly enlarging the cache space and enabling KV reuse across multiple replicas and nodes in llm-d.

While there are a number of existing solutions for KV cache offload to storage (e.g. LMCache or Dynamo KVBM), the new connector offers simplicity, can run with llm-d and vLLM as the only dependency, and exhibits improved performance over state-of-the-art shared storage connectors.

KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d

· 21 min read
Maroon Ayoub
Research Scientist & Architect, IBM
Danny Harnik
Senior Technical Staff Member, IBM
Tyler Smith
Member of Technical Staff, Red Hat
Kellen Swain
Software Engineer, Google
Xining Wang
Xining Wang
Senior Technical Expert, Alibaba Cloud
Hang Yin
Hang Yin
Senior R&D Engineer, Alibaba Cloud
Kay Yan
Principal Software Engineer, DaoCloud

The llm-d project provides a series of “well-lit paths” - tested, benchmarked solutions for deploying large language models in production. Our first path, Intelligent Inference Scheduling, established a baseline for AI-aware routing by balancing both cluster load and prefix-cache affinities. The default configuration for that path uses an approximate method for the latter, predicting cache locality based on request traffic.

This blog illuminates a more advanced and powerful path: precise prefix-cache aware scheduling.

We take a deep dive into the next generation of this feature, which moves beyond prediction and gives the scheduler direct introspection into distributed vLLM caches. This precision is key to maximizing cache hit rates and achieving a new level of performance and maximizing cost-efficiency in your distributed deployments.

Blog key takeaways
  • KV-cache hit rates directly impact your bottom line: With 10x cost differences between cached and uncached tokens, cache efficiency isn't just a performance optimization — it's a fundamental cost and performance driver
  • This isn't theoretical: Real production workloads like conversational AI and agentic workflows naturally create the prefix-heavy patterns where this approach excels
  • vLLM's prefix caching breaks in distributed deployments: Standard load balancers scatter related requests across pods, destroying cache locality and forcing expensive re-computation
  • Precise prefix-cache aware scheduling delivers order-of-magnitude gains: Our benchmarks show 57x faster response times and double the throughput on identical hardware

Intelligent Inference Scheduling with llm-d

· 10 min read
Nili Guy
R&D Manager, AI Infrastructure, IBM
Vita Bortnikov
IBM Fellow, IBM
Etai Lev Ran
Cloud Architect, IBM
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google

The llm-d project lays out clear, “well-lit” paths for anyone to adopt the leading inference optimizations within their existing deployment framework - Kubernetes. These are tested approaches designed to make complex deployments easier and more efficient. In this post, we explore the first of these paths: intelligent inference scheduling. Unlike basic round-robin load balancing, this method takes the unique demands of LLMs into account, leading to better performance across the board: higher throughput, lower latency, and efficient use of resources.

Why Intelligent Inference Is Needed for LLM Inference​

Deploying large language models (LLMs) on Kubernetes has become the norm, but LLM inference workloads behave very differently from standard microservices. Traditional patterns like uniform replicas paired with round-robin load balancing assume each request uses the same amount of resources and finishes in roughly the same time. In contrast, LLM requests can vary wildly in token count and compute needs, making simple load-spread strategies prone to bottlenecks and imbalanced traffic.

Intelligent inference scheduling diagram