Skip to main content

4 posts tagged with "Announcements"

Announcements that aren't news releases

View All Tags

llm-d 0.5: Sustaining Performance at Scale

· 13 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

In our previous release (v0.4), we focused on improving the end-to-end latency of production inference, introducing speculative decoding and extending prefill/decode disaggregation across a broader set of accelerator architectures. That work established llm-d’s ability to deliver state-of-the-art latency along the critical serving path. Sustaining low latency increasingly depended on how KV-cache pressure is handled once GPU memory is saturated, whether cached state can be reused across replicas instead of being repeatedly rebuilt, and how requests are routed when workloads mix adapters, models, and availability requirements.

With v0.5, llm-d expands its focus from peak performance to the operational rigor required to sustain performance at scale. This release prioritizes reproducibility, resilience, and cost efficiency, with concrete improvements across the following areas:

  1. Developer Experience and reproducibility: We have simplified the benchmarking workflow with dedicated, in-guide benchmark support, allowing users to validate each “well-lit path” with a single command.
  2. Hierarchical KV Offloading: A new storage architecture decouples cache capacity from GPU memory through native CPU and filesystem tiers.
  3. Advanced Scheduling: Cache-aware routing now supports LoRA adapters and active-active high availability.
  4. Resilient Networking: A new transport backend (UCCL) improves stability in congested networks.
  5. Autoscaling Updates: We have introduced scale-to-zero capabilities for cost-efficient intermittent workloads.

llm-d 0.4: Achieve SOTA Performance Across Accelerators

· 10 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

llm-d’s mission is to provide the fastest time to SOTA inference performance across any accelerator and cloud. In our 0.3 release we enabled wide expert parallelism for large mixture-of-expert models to provide extremely high output token throughput - a key enabler for reinforcement learning - and we added preliminary support for multiple non-GPU accelerator families.

This release brings the complement to expert parallelism throughput: improving end-to-end request latency of production serving. We reduce DeepSeek per token latency up to 50% with speculative decoding and vLLM optimizations for latency critical workloads. We add dynamic disaggregated serving support to Google TPU and Intel XPU to further reduce time to first token latency when traffic is unpredictable, while our new well-lit path for prefix cache offloading helps you leverage CPU memory and high performance remote storage to increase hit rates and reduce tail latency. For users with multiple model deployments our workload autoscaler preview takes real-time server capacity and traffic into account to reduce the amount of time a model deployment is queuing requests - lessening the operational toil running multiple models over constrained accelerator capacity.

These OSS inference stack optimizations, surfaced through our well-lit paths, ensure you reach SOTA latency on frontier OSS models in real world scenarios.

llm-d 0.3: Wider Well-Lit Paths for Scalable Inference

· 10 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

In our 0.2 release, we introduced the first well-lit paths, tested blueprints for scaling inference on Kubernetes. With our 0.3 release, we double down on the mission: to provide a fast path to deploying high performance, hardware-agnostic, easy to operationalize, at scale inference.

This release delivers:

  • Expanded hardware support, now including Google TPU and Intel support
  • TCP and RDMA over RoCE validated for disaggregation
  • A predicted latency based balancing preview that improves P90 latency by up to 3x in long-prefill workloads
  • Wide expert parallel (EP) scaling to 2.2k tokens per second per H200 GPU
  • The GA release of the Inference Gateway (IGW v1.0).

Taken together, these results redefine the operating envelope for inference. llm-d enables clusters to run hotter before scaling out, extracting more value from each GPU, and still meet strict latency objectives. The result is a control plane built not just for speed, but for predictable, cost-efficient scale.

llm-d 0.2: Our first well-lit paths (mind the tree roots!)

· 11 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

Our 0.2 release delivers progress against our three well-lit paths to accelerate deploying large scale inference on Kubernetes - better load balancing, lower latency with disaggregation, and native vLLM support for very large Mixture of Expert models like DeepSeek-R1.

We’ve also enhanced our deployment and benchmarking tooling, incorporating lessons from real-world infrastructure deployments and addressing key antipatterns. This release gives llm-d users, contributors, researchers, and operators, clearer guides for efficient use in tested, reproducible scenarios.