2 posts tagged with "Announcements"

llm-d 0.3: Wider Well-Lit Paths for Scalable Inference

October 10, 2025 · 10 min read

Robert Shaw

Director of Engineering, Red Hat

Clayton Coleman

Distinguished Engineer, Google

Carlos Costa

Distinguished Engineer, IBM

In our 0.2 release, we introduced the first well-lit paths, tested blueprints for scaling inference on Kubernetes. With our 0.3 release, we double down on the mission: to provide a fast path to deploying high performance, hardware-agnostic, easy to operationalize, at scale inference.

This release delivers:

Expanded hardware support, now including Google TPU and Intel support
TCP and RDMA over RoCE validated for disaggregation
A predicted latency based balancing preview that improves P90 latency by up to 3x in long-prefill workloads
Wide expert parallel (EP) scaling to 2.2k tokens per second per H200 GPU
The GA release of the Inference Gateway (IGW v1.0).

Taken together, these results redefine the operating envelope for inference. llm-d enables clusters to run hotter before scaling out, extracting more value from each GPU, and still meet strict latency objectives. The result is a control plane built not just for speed, but for predictable, cost-efficient scale.

llm-d 0.2: Our first well-lit paths (mind the tree roots!)

July 29, 2025 · 11 min read

Robert Shaw

Director of Engineering, Red Hat

Clayton Coleman

Distinguished Engineer, Google

Carlos Costa

Distinguished Engineer, IBM

Our 0.2 release delivers progress against our three well-lit paths to accelerate deploying large scale inference on Kubernetes - better load balancing, lower latency with disaggregation, and native vLLM support for very large Mixture of Expert models like DeepSeek-R1.

We’ve also enhanced our deployment and benchmarking tooling, incorporating lessons from real-world infrastructure deployments and addressing key antipatterns. This release gives llm-d users, contributors, researchers, and operators, clearer guides for efficient use in tested, reproducible scenarios.