10 posts tagged with "blog posts"

everyday blog posts

View All Tags

Networking for Distributed Inference in llm-d

June 23, 2026 · 18 min read

Pravein Govindan Kannan

Staff Research Scientist, IBM

Liran Schour

Senior Research Scientist, IBM Research

Aleksander Slominski

Senior Research Scientist, IBM Research

Raj Joshi

Senior Machine Learning Engineer, Red Hat

Nicolò Lucchesi

Senior Machine Learning Engineer, Red Hat

Carlos Costa

Distinguished Engineer, IBM

Moein Khazraee

Senior Architect, NVIDIA

Omri Kahalon

Senior Manager, NVIDIA

Networking: The Critical Path in P/D Disaggregation

llm-d's prefill-decode disaggregation unlocks significant efficiency gains by separating compute-heavy prefill from memory-bandwidth-heavy decode onto dedicated GPU pools. But it introduces a hard dependency on the network: the KV Cache must be transferred from prefill to decode before the first token can be generated. This transfer time lands directly on the Time to First Token (TTFT) — making networking a first-order concern for end-to-end inference latency.

This post dives into llm-d's networking stack — how it works today and how it's evolving in collaboration with NVIDIA.

Serving Hybrid Models at Scale in llm-d

June 13, 2026 · 14 min read

Kfir Toledo

Research Staff Member, IBM

Or Ozeri

Research Staff Member, IBM

Danny Harnik

Senior Technical Staff Member, IBM

Itay Etelis

Research Staff Member, IBM

Rachel Brill

Senior Technical Staff Member, IBM

Maroon Ayoub

Senior Principal Machine Learning Engineer, Red Hat

For most of the transformer era, the KV cache rested on a quiet assumption: one model, one uniform cache. Every layer attended the same way, every block was the same size, and everything built on top of the cache (allocators, offload connectors, schedulers) could treat it as a single pool.

Hybrid models broke this assumption. Many recent frontier and open-weight models increasingly mix attention types within a single model (full attention next to sliding-window, linear, or Mamba layers), making the cache heterogeneous: different layers now hold different amounts of state, in different shapes, with different reuse rules. A cache block that used to be allocated as one uniform unit is now constituted of several distinct parts.

To serve a hybrid model efficiently, an AI inference platform has to handle that heterogeneity in at least three aspects of the stack:

GPU Memory Allocation: How the cache is laid out and allocated on the GPU. vLLM solved this with its Hybrid Memory Allocator (HMA), rebuilt around a unified allocator (see Hybrid Models as First-Class Citizens in vLLM).
KV Offloading: Extending the KV cache to CPU and storage. Without HMA awareness, an offloading connector turns the HMA off and therefore discards the GPU memory improvements or potential data movement savings.
KV-Aware Routing: Sending each request to the right model-server replica. Ignoring hybrid memory structure may erroneously list nodes as having or not having the required KV data based on information stemming from just part of the layers.

vLLM's HMA solved hybrid GPU memory allocation when handling a single vLLM instance. This post shows how llm-d extends that to tiered KV cache management - including KV offloading to CPU and storage, and KV-aware request routing - significantly improving throughput and latency at scale for hybrid models.

Heterogeneous inference serving across three GPU vendors with llm-d

June 9, 2026 · 10 min read

Pravein Govindan Kannan

Staff Research Scientist, IBM

Praveen Jayachandran

Senior Technical Staff Member, IBM

Jaikrishnan Hari

Research Partnerships & BD Executive, IBM

Varun Raste

Solution Architect, IBM

Prasad Mukhedkar

Associate Principal AI Architect, Red Hat

Vinod Pathangay

Chief Architect, Field CTO Organization, Red Hat

Jayanth Babu Reddy

Principal Architect, NxtGen Cloud Technologies

Abhisyant Anasapurapu

VP, NxtGen Cloud Technologies

Most production inference clusters today are single-vendor because that is often the simplest way to configure and operate a cluster.

That is starting to change. Procurement cycles bring new generations alongside older ones, supply planning spans multiple accelerator options, and cost/performance profiles differ by workload. Real production fleets are accumulating heterogeneity whether or not the architecture planned for it.

This is an opportunity to unlock real value: different accelerator classes can be matched to workload requirements, stranded capacity gets reclaimed, and operators gain more flexibility in capacity planning. The case is stronger still for sovereign and on-premise deployments, where data residency, regulatory alignment, and the long-term economics of high-volume inference make local fleet optimization especially important.

Making that work in practice is a non-trivial systems problem. Each accelerator stack brings its own optimized drivers, firmware, container images, runtime settings, and attention kernels. A coherent serving layer needs to preserve those platform-specific optimizations while still giving operators one control plane for routing, observability, and policy.

BLIS: Evolving llm-d at Simulation Speed

June 5, 2026 · 16 min read

Mert Toslali

Research Scientist, IBM

Dipanwita Guhathakurta

Software Engineer, IBM

Srinivasan Parthasarathy

Principal Research Scientist, IBM

Jing Chen

Software Engineer, IBM

Nick Masluk

Research Scientist, IBM

Vishakha Ramani

Research Scientist, IBM

Michael Kalantar

Software Engineer, IBM

Asser Tantawi

Research Scientist, IBM

Fabio Oliveira

Senior Research Manager, IBM

Carlos Costa

Distinguished Engineer, IBM

Deploying llm-d is not just a question of choosing a model server and adding GPUs. In a production inference deployment, operators have to choose routing policies, admission behavior, batching settings, KV-cache reuse strategies, prefill/decode placement, and autoscaling rules under concrete TTFT, ITL, throughput, and cost constraints.

These choices are coupled. A routing change that improves cache locality can concentrate load. A prefill/decode threshold that helps one workload can hurt another. An admission policy that protects critical traffic can reduce total served volume. A change in any one policy can shift TTFT, inter-token latency, throughput, SLO compliance, and accelerator cost in ways that are difficult to predict analytically.

The only reliable way to confirm those tradeoffs is to measure them in a GPU-backed llm-d cluster. But using cluster runs as the first step in every policy or capacity-planning experiment is too slow and expensive. BLIS provides a faster inner loop: a calibrated discrete-event simulator for distributed inference systems like llm-d. Developers can evaluate candidate policies and deployment configurations locally, then reserve cluster validation for the candidates most likely to matter.

Blog key takeaways

BLIS is a discrete-event simulator: — it models admission, routing, scheduling, KV cache, batching, and prefill/decode placement without loading model weights or occupying GPUs.
Calibrated fidelity: Median 7–9% error on end-to-end and inter-token latency across 36 validation experiments spanning 8B–141B parameter models, H100/A100/L40S GPUs, and diverse workloads. Approximately 200× faster than equivalent cluster runs.
Admission control case study: An AI-native policy-search loop using BLIS discovered a probabilistic admission controller that reduced critical-tier TTFT p90 by up to 97% and end-to-end latency by up to 50%, validated on a real llm-d cluster.
Capacity planning: BLIS evaluates hundreds of deployment configurations in minutes, producing ranked Pareto-optimal candidates before any GPU time is spent.

No Kubernetes? No Problem: llm-d Now Runs Anywhere

May 26, 2026 · 17 min read

Ezra Silvera

Senior Technical Staff Member, IBM

llm-d was born Kubernetes-native. Its workers are Deployments, its endpoints live in an InferencePool, and its guides assume a cluster is one kubectl away. That made sense: Kubernetes is where most production inference runs, and building on it gave llm-d a head start on networking, lifecycle, and scale.

But the thing that makes llm-d llm-d - KV-cache-aware scoring, prefix-cache affinity, prefill/decode disaggregation, flow control - was never fundamentally about Kubernetes. It is routing intelligence. It reasons about the state of a fleet of model servers and decides where each request should go. Nothing about that logic needs an API server. The dependency on Kubernetes was incidental, inherited from how endpoints happened to be discovered, not essential to what the router actually does.

This post is about pulling those two things apart. We introduce the EndpointDiscovery abstraction in the llm-d router that separates what endpoints exist from how to route across them, and the first plugin built on it - file discovery - which lets the full routing stack run as a plain process or container with no Kubernetes anywhere in sight: on an HPC cluster, inside a Ray job, on a bare-metal rack, or on your laptop.

llm-d's EndpointDiscovery module with Kube and File discovery plugins feeding the same router across Kubernetes, Slurm, Ray, and bare metal

Figure 1: The big picture - one routing stack under every platform. llm-d discovers endpoints through its EndpointDiscovery module (Kube Discovery against an InferencePool, File Discovery against everything else) and serves requests the same way on Kubernetes, Slurm, Ray, or bare metal (inference, HPC, and RL rollout workloads: veRL, SkyRL, prime-rl). The rest of this post explains how.

Production-Grade LLM Inference at Scale with KServe, llm-d, and vLLM

April 21, 2026 · 5 min read

Yuan Tang

Senior Principal Software Engineer, Red Hat

Scott Cabrinha

Staff Site Reliability Engineer, Tesla

Robert Shaw

Director of Engineering, Red Hat

Sai Krishna

Staff Software Engineer, Tesla

The Problem with "Simple" LLM Deployments

Everyone is racing to run Large Language Models (LLMs), in the cloud, on-prem, and even on edge devices. The real challenge, however, isn't the first deployment; it's scaling, managing, and maintaining hundreds of LLMs efficiently. We initially approached this challenge with a straightforward vLLM deployment wrapped in a Kubernetes StatefulSet.

Predicted-Latency Based Scheduling for LLMs

March 13, 2026 · 28 min read

Kaushik Mitra

Software Engineer, Google

Benjamin Braun

Software Engineer, Google

Abdullah Gharaibeh

Senior Staff Software Engineer, Google

Clayton Coleman

Distinguished Engineer, Google

Not all LLM requests cost the same. A short prompt might complete in milliseconds, while a long one can occupy a GPU for seconds. If we can predict how long a request will take on each candidate server before dispatching it, we can make substantially better routing decisions. This post describes a system that does exactly that: a lightweight ML model trained online from live traffic that replaces manually tuned heuristic weights with direct latency predictions.

Native KV Cache Offloading to Any Filesystem with llm-d

February 10, 2026 · 11 min read

Kfir Toledo

Research Staff Member, IBM

Danny Harnik

Senior Technical Staff Member, IBM

Effi Ofer

Research Staff Member, IBM

Or Ozeri

Research Staff Member, IBM

Guy Margalit

Senior Technical Staff Member, IBM Storage CTO Office

llm-d is a distributed inference platform spanning multiple vLLM instances. KV cache hits are critical to achieving high inference throughput. Yet, in a distributed environment, cache hits do not occur across different nodes as the KV cache is local to each vLLM instance. In addition, this local cache is limited in size, further limiting KV data reuse. This blog presents a new way to offload KV cache to storage, tackling both aforementioned challenges – KV cache sharing and KV cache scale. llm-d's filesystem (FS) backend is a KV cache storage connector for vLLM that offloads KV blocks to shared storage based on vLLM's native Offloading Connector. While the llm-d FS backend can speed up serving of single requests (improve TTFT), its main goal is rather to preserve stable throughput and low latency at scale, as concurrency and context lengths grow. This is accomplished by significantly enlarging the cache space and enabling KV reuse across multiple replicas and nodes in llm-d.

While there are a number of existing solutions for KV cache offload to storage (e.g. LMCache or Dynamo KVBM), the new connector offers simplicity, can run with llm-d and vLLM as the only dependency, and exhibits improved performance over state-of-the-art shared storage connectors.

KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d

September 24, 2025 · 21 min read

Maroon Ayoub

Research Scientist & Architect, IBM

Danny Harnik

Senior Technical Staff Member, IBM

Tyler Smith

Member of Technical Staff, Red Hat

Kellen Swain

Software Engineer, Google

Xining Wang

Senior Technical Expert, Alibaba Cloud

Hang Yin

Senior R&D Engineer, Alibaba Cloud

Kay Yan

Principal Software Engineer, DaoCloud

The llm-d project provides a series of “well-lit paths” - tested, benchmarked solutions for deploying large language models in production. Our first path, Intelligent Inference Scheduling, established a baseline for AI-aware routing by balancing both cluster load and prefix-cache affinities. The default configuration for that path uses an approximate method for the latter, predicting cache locality based on request traffic.

This blog illuminates a more advanced and powerful path: precise prefix-cache aware scheduling.

We take a deep dive into the next generation of this feature, which moves beyond prediction and gives the scheduler direct introspection into distributed vLLM caches. This precision is key to maximizing cache hit rates and achieving a new level of performance and maximizing cost-efficiency in your distributed deployments.

Blog key takeaways

KV-cache hit rates directly impact your bottom line: With 10x cost differences between cached and uncached tokens, cache efficiency isn't just a performance optimization — it's a fundamental cost and performance driver
This isn't theoretical: Real production workloads like conversational AI and agentic workflows naturally create the prefix-heavy patterns where this approach excels
vLLM's prefix caching breaks in distributed deployments: Standard load balancers scatter related requests across pods, destroying cache locality and forcing expensive re-computation
Precise prefix-cache aware scheduling delivers order-of-magnitude gains: Our benchmarks show 57x faster response times and double the throughput on identical hardware

Intelligent Inference Scheduling with llm-d

September 3, 2025 · 10 min read

Nili Guy

R&D Manager, AI Infrastructure, IBM

Vita Bortnikov

IBM Fellow, IBM

Etai Lev Ran

Cloud Architect, IBM

Robert Shaw

Director of Engineering, Red Hat

Clayton Coleman

Distinguished Engineer, Google

The llm-d project lays out clear, “well-lit” paths for anyone to adopt the leading inference optimizations within their existing deployment framework - Kubernetes. These are tested approaches designed to make complex deployments easier and more efficient. In this post, we explore the first of these paths: intelligent inference scheduling. Unlike basic round-robin load balancing, this method takes the unique demands of LLMs into account, leading to better performance across the board: higher throughput, lower latency, and efficient use of resources.

Why Intelligent Inference Is Needed for LLM Inference

Deploying large language models (LLMs) on Kubernetes has become the norm, but LLM inference workloads behave very differently from standard microservices. Traditional patterns like uniform replicas paired with round-robin load balancing assume each request uses the same amount of resources and finishes in roughly the same time. In contrast, LLM requests can vary wildly in token count and compute needs, making simple load-spread strategies prone to bottlenecks and imbalanced traffic.

Intelligent inference scheduling diagram

Networking: The Critical Path in P/D Disaggregation​

The Problem with "Simple" LLM Deployments​

Why Intelligent Inference Is Needed for LLM Inference​

Networking: The Critical Path in P/D Disaggregation

The Problem with "Simple" LLM Deployments

Why Intelligent Inference Is Needed for LLM Inference