Skip to main content

llm-d v0.8: From Platform to Control Plane

ยท 12 min read

If v0.7 was about making features deployable โ€” standalone mode, Kustomize, documentation from scratch โ€” then v0.8 is the payoff. Flow Control, Batch Gateway, and multi-modal serving graduate from experimental to production. The routing layer reaches beyond Kubernetes into RL training loops. And the image pipeline simplifies from a growing matrix of custom builds to upstream vLLM images with llm-d as the control plane on top. Three axes define this release: graduating capabilities that were introduced in v0.7 to production readiness, extending llm-d into non-Kubernetes environments for reinforcement learning and Slurm-based research, and aligning the project's identity around what it actually is โ€” an inference control plane, not a fork of any engine underneath it.

The scope of what llm-d orchestrates has broadened considerably. A single deployment can now serve interactive chat, batch processing, multi-modal requests, and agentic multi-step workflows โ€” with routing intelligence that understands the differences between them. Forty-eight new contributors joined the project since v0.6, more than doubling v0.7's twenty-three, reflecting broadening adoption across the industry.

Delivering on Promises: Feature Graduationโ€‹

Flow Control moves from experimental to production. The centralized request queuing and admission control system introduced in v0.7 now ships with production guides for multi-tenant deployments. Flow Control classifies incoming requests by a FlowKey combining Fairness ID and Priority, maintains separate in-memory queues per flow, and dispatches based on priority bands, tenant fairness cycling, and request ordering. The result is no-regret scheduling: under saturation, the router holds requests rather than committing them to already-overloaded backend queues. The defaults still mimic first-come-first-served behavior, so existing deployments see no change until operators opt into multi-tenant policy.

Batch Gateway graduates to production with enterprise storage backend hardening. The OpenAI-compatible /v1/batches and /v1/files API handles up to 50,000 requests per batch job, with pluggable storage across PostgreSQL or Redis/Valkey for metadata, Redis/Valkey for queue and event streams, and S3 or filesystem for input/output files. Batch traffic coexists with interactive serving on shared infrastructure through integration with Flow Control backpressure, so operators no longer need to maintain separate clusters for offline workloads.

Multi-modal serving โ€” image and text models like LLaVA and Qwen-VL โ€” gains production guides with end-to-end testing, including disaggregated prefill/decode paths for multi-modal inputs. The inference simulator now supports multi-modal payloads, making it possible to validate routing behavior without provisioning GPU hardware.

Upstream, Not Fork: Aligning with vLLMโ€‹

One of the most consequential changes in v0.8 is also one of the simplest to describe: llm-d no longer builds its own vLLM images for most platforms.

The default GPU image is now vllm/vllm-openai:v0.23.0 โ€” the upstream release, unmodified. AMD ROCm moves to vllm/vllm-openai-rocm. Intel XPU moves to the prebuilt upstream vLLM XPU image. The custom llm-d-cuda-gb200, llm-d-xpu, and llm-d-rocm images are removed. The llm-d-cuda image remains available for advanced builds that require custom wheels (DeepGEMM, DeepEP), but it is no longer the default path.

This is not just a packaging simplification โ€” it is a statement about what llm-d is. The project is a control plane that wraps inference engines, not a distribution that replaces them. When vLLM ships a release, llm-d users get it directly. When operators need to pin a specific vLLM commit for a hotfix, they build that image themselves and point llm-d at it. The control plane โ€” the router, the endpoint picker, the KV cache management layer, the autoscaler โ€” is where llm-d's value lives.

The component renames in this release reflect the same thinking. llm-d-inference-scheduler becomes llm-d-router-endpoint-picker. llm-d-routing-sidecar becomes llm-d-router-disagg-sidecar. The names now describe function rather than aspiration: these are components of the llm-d Router, not a standalone scheduling system.

HPU (Intel Gaudi) support is removed in this release. The upstream vLLM project does not currently maintain Gaudi images, and maintaining a custom build path for a single accelerator ran counter to the upstream-first direction. If upstream vLLM adds Gaudi support, llm-d will pick it up automatically.

Beyond Kubernetes: RL, Slurm, and File Discoveryโ€‹

Every previous llm-d release assumed Kubernetes. v0.8 breaks that assumption.

The FileDiscovery plugin enables llm-d's routing layer to discover inference endpoints through filesystem-based service discovery rather than the Kubernetes API. Operators write endpoint addresses to a file; the router watches it. This unlocks llm-d on Slurm clusters, bare-metal research labs, and any environment where Kubernetes is not present or not wanted.

This matters most for reinforcement learning workflows. RL training loops require high-throughput inference for rollout generation โ€” the model being trained serves thousands of inference requests per training step. These workloads typically run on Slurm, not Kubernetes, and they need the inference scheduler to be embedded in the training loop rather than running as a separate service.

This is where the "From Platform to Control Plane" thesis lands concretely. llm-d's routing intelligence โ€” prefix cache locality, saturation awareness, flow control โ€” is useful anywhere inference happens, not just inside a Kubernetes cluster.

Intelligent Routing for the Agentic Eraโ€‹

Workloads are getting more complex. A single user session might involve a chain of inference calls โ€” retrieval, reasoning, tool use, synthesis โ€” each with different latency requirements, context dependencies, and failure modes. v0.8 adds routing intelligence designed for this reality.

Agentic workload routing enables the router to understand multi-step inference patterns. Combined with Responses API support, llm-d can now serve OpenAI-compatible agentic endpoints where the model orchestrates tool calls and multi-turn reasoning. The router tracks session affinity and prefix cache state across the steps of an agentic workflow, keeping related requests on the same backend where prior KV cache is resident.

DP-Aware scheduling graduates from its initial experimental form with WideEP support for DeepSeek-class MoE models. In expert-parallel deployments, different experts run on different GPU groups; the router needs to understand this topology to place requests on the correct data-parallel replica. The WideEP guide has been migrated to the main example path with nightly CI coverage on GKE and CKS.

Predicted latency scheduling receives a complete rewrite with a new production-ready guide. The latency predictor provides model-specific latency estimates that the endpoint picker uses for routing decisions, moving beyond queue-depth heuristics to predictions grounded in actual model serving characteristics. Nightly CI now validates that the predictor is serving real predictions rather than falling back to heuristics โ€” asserting that the optimization path operators configured is actually running.

SGLang support lands in the precise prefix cache routing path. The prefix cache scorer, previously vLLM-only, now works with SGLang backends through an updated well-lit path guide with optimal configuration and benchmark numbers. This makes llm-d's most impactful routing optimization โ€” prefix-cache-aware request placement โ€” engine-agnostic.

Multi-Tier KV Cache and Hybrid Model Supportโ€‹

KV cache management continues to be one of llm-d's deepest areas of investment. v0.8 extends the multi-tier architecture introduced in previous releases with two significant additions.

HMA-aware KV offloading adds support for hybrid model architectures that mix attention mechanisms โ€” full attention, sliding-window, Mamba/SSM, and linear attention โ€” in the same model. These architectures, exemplified by Jamba and emerging hybrid designs, allocate KV cache differently across layer types. The KV cache management layer now understands the Hybrid Memory Allocator (HMA) in vLLM, enabling offloading from GPU HBM to CPU DRAM to persistent storage while respecting the heterogeneous memory layout. For a deep dive into HMA-aware offloading, see the companion blog post on serving hybrid models at scale.

The Mooncake connector adds Alibaba's KV cache transfer engine as a NIXL backend for both P/D disaggregation transfers and tiered offloading. Combined with the existing UCX and UCCL backends, this gives operators three transfer engines to choose from based on their network fabric and deployment constraints. For comprehensive benchmarks comparing these backends across RoCE, InfiniBand, and TCP, see the networking for distributed inference blog post.

Storage events enable reactive tier management โ€” the KV cache layer can respond to storage-level signals (capacity pressure, eviction notifications) rather than relying solely on proactive policies. This is a foundation for more sophisticated tiered caching strategies in future releases.

New Well-Lit Paths and Ecosystem Reachโ€‹

Each release expands the set of tested, documented deployment patterns โ€” what we call well-lit paths. v0.8 adds several that reflect the project's broadening ecosystem.

TensorRT-LLM receives a full recipe and optimized baseline guide. This is the first well-lit path using a non-vLLM inference engine, demonstrating that llm-d's control plane works independently of the model server underneath. The recipe uses trtllm-serve with llm-d's router handling prefix-cache-aware scheduling on top.

DeepSeek-V4 on GB200 NVL72 gets a dedicated deployment guide, running a frontier MoE model on NVIDIA's latest hardware with WideEP expert parallelism. The guide includes API server scaling (bumped from 1 to 4 replicas) and NVSHMEM configuration for the GB200's NVLink domain.

Envoy AI Gateway installation instructions land as a new guide, offering an alternative gateway layer alongside Istio and kgateway/agentgateway. This reflects the project's philosophy of providing pluggable infrastructure components rather than mandating a specific stack.

The Workload Variant Autoscaler (WVA) CRD migrates to the llm-d.ai API group, with improved observability and a replicas rebalancing guide. Infrastructure dependencies update to Istio 1.29.4, kgateway/agentgateway v2.3.3, and Gateway API Inference Extension v1.5.0, with the GAIE Helm charts now published from the llm-d-router OCI registry.

P/D disaggregation guides expand with GKE RDMA/DRA/DRANET configurations for high-performance KV cache transfer, and preflight networking checks that validate RDMA bindings and NIC configuration before vLLM starts serving traffic.

What This Means for Youโ€‹

Platform teams running llm-d in production can now graduate Flow Control and Batch Gateway from experimental flags to production configuration. The shift to upstream vLLM images simplifies the image pipeline โ€” no more tracking llm-d-specific builds for each accelerator platform. The WVA CRD migration to llm-d.ai is a breaking change that requires updating existing autoscaling configurations, but it aligns the resource model with the project's long-term API surface.

ML researchers working on reinforcement learning can now use llm-d's routing intelligence without Kubernetes, bringing prefix-cache-aware scheduling into Slurm-based training loops, where it directly reduces the wall-clock time of RL rollout generation. If your RL pipeline spends time on inference and you want cache-aware routing without re-architecting onto Kubernetes, this is built for you.

Agentic application architects building multi-step inference workflows get native routing support for session affinity, Responses API compatibility, and prefix cache awareness across the steps of an agent chain. Combined with SGLang support in the prefix cache scorer, the agentic routing path is no longer tied to a single inference engine.

What Is Next?โ€‹

The v0.9 milestone targets early August and continues the trajectory this release established.

Agentic inference patterns are a primary focus. v0.8 introduced agentic routing and Responses API support; v0.9 aims to make the agentic story end-to-end โ€” from request classification through multi-step orchestration to session-level observability. The goal is that operators deploying agentic workloads on llm-d get routing intelligence that understands the structure of agent chains, not just individual requests.

RL workflow maturity The initial integration demonstrated feasibility; v0.9 targets usable end-to-end RL workflows with deeper scheduler integration, more sophisticated cache management for rollout patterns, and broader framework support.

Scaling and hardening turns attention to the control plane itself. As llm-d deployments grow โ€” more pods, more models, higher QPS โ€” the router and endpoint picker need to scale horizontally without becoming bottlenecks. v0.9 addresses this directly with EPP horizontal scaling and smarter autoscaling policies.

Continued accelerator and runtime expansion includes broader SGLang CI coverage, AMD ecosystem growth, and ongoing upstream alignment. The direction is clear: llm-d supports every runtime and accelerator that upstream vLLM and SGLang support, with the control plane adapting to each platform's characteristics rather than requiring platform-specific builds.

Community and Contributionโ€‹

Forty-eight new contributors joined the project since v0.6, more than doubling the contributor base added in v0.7. The project's velocity continues to accelerate, with the v0.8 cycle delivering feature graduation, ecosystem expansion, and a clearer architectural identity simultaneously.

To get started with llm-d v0.8:

Follow @llm_d on Twitter/X, llm-d on Bluesky, or llm-d on LinkedIn for updates. Check llm-d.ai/community/events for upcoming community events and llm-d on YouTube for talks and demos. Come build with us.