The fastest path to state-of-the-art LLM inference on any accelerator
llm-d is an open-source inference serving stack for Kubernetes. It runs your model server of choice—vLLM and SGLang, and more—across your cluster, turning single-node engines into production-grade distributed inference on the infrastructure you already run. Get state-of-the-art performance for leading open models—on NVIDIA, AMD, and custom accelerators.
Built for agentic pipelines, LLMs, multimodal models, and high-throughput serving. Completely engine- and hardware-agnostic.
Well-Lit Paths
In addition to the software components, llm-d provides Well-Lit Paths — tested, benchmarked deployment recipes for common production patterns. These paths are starting points designed to be adapted for your models, hardware, and traffic patterns to support agentic, multimodal, and batch workloads.
Each path includes:
- Deployable Helm charts and Kustomize manifests
- Key configuration knobs for performance tuning
- Sample workloads and benchmarks against baseline setups
- Monitoring and observability configuration