The fastest path to state-of-the-art LLM inference on any accelerator

llm-d is an open-source inference serving stack for Kubernetes. It runs your model server of choice—vLLM and SGLang, and more—across your cluster, turning single-node engines into production-grade distributed inference on the infrastructure you already run. Get state-of-the-art performance for leading open models—on NVIDIA, AMD, and custom accelerators.

Built for agentic pipelines, LLMs, multimodal models, and high-throughput serving. Completely engine- and hardware-agnostic.

Get started Explore the performance data

llm-d is a CNCF Sandbox project, founded by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA.

Well-Lit Paths

In addition to the software components, llm-d provides Well-Lit Paths — tested, benchmarked deployment recipes for common production patterns. These paths are starting points designed to be adapted for your models, hardware, and traffic patterns to support agentic, multimodal, and batch workloads.

Each path includes:

Deployable Helm charts and Kustomize manifests
Key configuration knobs for performance tuning
Sample workloads and benchmarks against baseline setups
Monitoring and observability configuration

See all Well-Lit Paths

Well-Lit Paths​

Well-Lit Paths