Artifacts

This page lists the llm-d release artifacts and dependencies:

CRDs — the Kubernetes Custom Resource Definitions used by llm-d
llm-d Router — the Helm chart and container images for the routing layer
Model Servers and Extensions — the inference engine images and extensions for advanced functionality
Well-Lit Path Guides — deployment manifests and benchmark scripts for key user stories
Gateway Recipes — optional recipes for installing Gateways and integrating them with llm-d

important

llm-d follows a modular deployment pattern, enabling gradual feature adoption. Users seeking a single CRD-driven deployment pattern should consider KServe's LLMInferenceService.

1. GAIE CRDs

llm-d uses the APIs defined in the Gateway API Inference Extension (GAIE) project:

CRD	Purpose
InferencePool	Defines a pool of inference endpoints (model servers) and configures the EPP and proxy for LLM-aware routing.
InferenceObjective	Defines performance goals (priority, latency) for specific model workloads within a pool.
InferenceModelRewrite	Specifies rules for rewriting model names in request bodies, enabling traffic splitting and canary rollouts.

Manifests are published at kubernetes-sigs/gateway-api-inference-extension/config/crd and can be installed like:

export GAIE_VERSION=v1.5.0
kubectl apply -k "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=${GAIE_VERSION}"

2. llm-d Router

llm-d Router is deployed via Helm. We offer a chart both Standalone and Gateway Mode:

Chart	Version	OCI Registry	Description
Standalone Mode	v1.5.0	`oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone`	Deploys an InferencePool and EPP with a standalone Envoy proxy as sidecar in EPP pod
Gateway Mode	v1.5.0	`oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool`	Deploys an InferencePool and EPP for use with an existing Kubernetes Gateway (e.g. Istio, AgentGateway, GKE)

The charts are currently published by the Gateway API Inference Extension (GAIE) project (see standalone mode source and gateway mode source). Each well-lit path guides provides values files on top of the chart defaults to enable the functionality implemented in EPP.

note

In a future release, the Helm Charts will be published from the llm-d project rather than from GAIE.

Images

llm-d releases the core EPP image as well as additional sidecar images for advanced functionality:

Image	Description	Version
`ghcr.io/llm-d/llm-d-inference-scheduler`	Core EPP image	v0.8.0
`ghcr.io/llm-d/llm-d-uds-tokenizer`	Optional sidecar for EPP, enabling tokenization for precise cache aware routing	v0.8.0
`ghcr.io/llm-d/llm-d-routing-sidecar`	Optional sidecar for model servers, enabling KV cache transfer for P/D	v0.8.0
`registry.k8s.io/gateway-api-inference-extension/latency-training-server`	Optional sidecar for EPP, for predicted-latency model training	v1.5.0
`registry.k8s.io/gateway-api-inference-extension/latency-prediction-server`	Optional sidecar for EPP, for predicted-latency scheduling	v1.5.0

note

In a future release, the latency server images will be released from the llm-d/llm-d-latency-predictor repo.

3. Model Servers and Extensions

The llm-d stack supports vLLM and SGLang.

important

llm-d validates each released guide against specific versions of each model server, but llm-d Router communicates with model servers over the OpenAI-compatible HTTP API and standard inference engine metrics, so any recent release should work.

Upstream Images

We recommend using the upstream images for most guides:

Engine	Image	Tag
vLLM	`vllm/vllm-openai`	`v0.19.1`
vLLM TPU	`vllm/vllm-tpu`	`v0.18.0`
SGLang	`lmsysorg/sglang`	`v0.5.10.post1`

Custom Images

In addition to the upstream images, llm-d also builds and releases vLLM images with features not yet merged into vLLM upstream such as:

EFA support for AWS HPC networking
GKE IB networking patches
DeepEP patches for GB200 support
RIXL support on AMD ROCm

Image	Tag	Accelerator	Base OS	Architectures
`ghcr.io/llm-d/llm-d-cuda`	`v0.7.0`	NVIDIA GPU	RHEL UBI9	amd64, arm64
`ghcr.io/llm-d/llm-d-cuda-gb200`	`v0.7.0`	NVIDIA GPU	RHEL UBI9	amd64, arm64
`ghcr.io/llm-d/llm-d-aws`	`v0.7.0`	NVIDIA GPU + EFA	RHEL UBI9	amd64, arm64
`ghcr.io/llm-d/llm-d-rocm`	`v0.7.0`	AMD ROCm	RHEL UBI9	amd64
`ghcr.io/llm-d/llm-d-xpu`	`v0.7.0`	Intel XPU	Ubuntu 24.04	amd64
`ghcr.io/llm-d/llm-d-hpu`	`v0.7.0`	Intel Gaudi HPU	Ubuntu 22.04	amd64
`ghcr.io/llm-d/llm-d-cpu`	`v0.7.0`	CPU	RHEL UBI9	amd64

FS Offloading Extension

llmd-fs-connector adds filesystem offloading to vLLM's OffloadingConnector. It is released from llm-d-kv-cache as a python wheel and hosted on the following pypi registry https://llm-d.github.io/llm-d-kv-cache/simple/builds.

4. Well-Lit Path Guides

Well-Lit Paths are tested, benchmarked deployment recipes that show off llm-d's key user stories. Each guide lives under guides/<path>/ and contains:

EPP Configurations - Helm values files with EPP configurations for usage with the charts for llm-d Router.
Model Server Manifests - Kustomize manifests for model server with labels and flags needed for usage with llm-d Router.

important

For some guides, we provide cloud provider specific settings. This is especially important for guides requiring IB and RoCE networking, which is not yet standardized. Users can adapt the examples to other platforms as needed.

See the full list of guides for more details.

5. Gateways

llm-d Router supports optional integration with Kubernetes Gateways. These are the versions we test against for the v0.7.0 release:

Dependency	Tested Versions	Notes
Gateway API CRDs	`v1.5.x`	Kubernetes SIG (required if using a Gateway)
Istio	`1.29.x`	Default gateway provider
AgentGateway	`v1.0.x`	Preferred for new deployments
kgateway	`v2.2.x`	Deprecated — will be removed in the next release

Install instructions live under guides/recipes/gateway/.

Source Repositories

Core Libraries

Repository	Language	Description
llm-d/llm-d	—	Main repo: docs, Dockerfiles, guides, CI
llm-d/llm-d-inference-scheduler	Go	EPP routing engine and P/D sidecar
llm-d/llm-d-latency-predictor	Python	XGBoost training and prediction server
llm-d/llm-d-kv-cache	Go, Python, CPP	KV-cache block locality indexer, FS offloading
llm-d/llm-d-workload-variant-autoscaler	Go	SLO-aware workload autoscaler
llm-d-incubation/llm-d-async	Go	Asynchronous request processor for latency insensitive traffic
llm-d-incubation/batch-gateway	Go	OpenAI-compatible API for submitting, tracking, and managing batch inference jobs.

Supporting Libraries

Repository	Language	Description
llm-d/llm-d-benchmark	Python	Benchmarking framework
llm-d/llm-d-inference-sim	Go	GPU-free vLLM simulator

1. GAIE CRDs​

2. llm-d Router​

Images​

3. Model Servers and Extensions​

Upstream Images​

Custom Images​

FS Offloading Extension​

4. Well-Lit Path Guides​

5. Gateways​

Source Repositories​

Core Libraries​

Supporting Libraries​

1. GAIE CRDs

2. llm-d Router

Images

3. Model Servers and Extensions

Upstream Images

Custom Images

FS Offloading Extension

4. Well-Lit Path Guides

5. Gateways

Source Repositories

Core Libraries

Supporting Libraries