llm-d

Start building with a well-lit path.

llm-d provides Well-Lit Paths—tested deployment recipes for common production patterns. They’re adaptable starting points across models, hardware, and workloads.

Deploy an optimized baseline

Get started

Route Requests by Predicted Latency

Get started

Enable Precise Prefix-Cache Aware Routing

Get started

Configure a Tiered Prefix Cache

Get started

Deploy Prefill/Decode Disaggregation

Get started

Scale MoE Models with Wide Expert Parallelism

Get started

Apply Flow Control and Fairness

Get started

Autoscale your Inference Pool

Get started

See all Well-Lit paths

The fastest path to state-of-the-art LLM inference on any accelerator.

llm-d is an open-source, Kubernetes-native stack for distributed LLM inference. It runs vLLM, SGLang, and more across your cluster, turning single-node engines into production-grade serving on the hardware you already have.

Get started

Explore llm-d benchmarks

3x higher output throughput

2x faster TTFT

Up to 70% higher tokens/sec

30% throughput improvement

40% reduction in TTFT and ITL in Google Vertex

Portability across heterogeneous hardware

llm-d runs production LLM inference across GPUs, TPUs, XPUs, CPUs, and emerging NPUs with consistent performance patterns, so teams can deploy once and scale anywhere with predictable cost and behavior.

GPU

Google TPU

Intel XPU

CPU

View all supported accelerators

v0.8.1