llm-d 0.3: Wider Well-Lit Paths for Scalable Inference
In our 0.2 release, we introduced the first well-lit paths, tested blueprints for scaling inference on Kubernetes. With our 0.3 release, we double down on the mission: to provide a fast path to deploying high performance, hardware-agnostic, easy to operationalize, at scale inference.
This release delivers:
- Expanded hardware support, now including Google TPU and Intel support
- TCP and RDMA over RoCE validated for disaggregation
- A predicted latency based balancing preview that improves P90 latency by up to 3x in long-prefill workloads
- Wide expert parallel (EP) scaling to 2.2k tokens per second per H200 GPU
- The GA release of the Inference Gateway (IGW v1.0).
Taken together, these results redefine the operating envelope for inference. llm-d enables clusters to run hotter before scaling out, extracting more value from each GPU, and still meet strict latency objectives. The result is a control plane built not just for speed, but for predictable, cost-efficient scale.