Skip to main content

Artifacts

This page lists the llm-d release artifacts and dependencies:

  1. CRDs — the Kubernetes Custom Resource Definitions used by llm-d
  2. llm-d Router — the Helm chart and container images for the routing layer
  3. Model Servers and Extensions — the inference engine images and extensions for advanced functionality
  4. Well-Lit Path Guides — deployment manifests and benchmark scripts for key user stories
  5. Gateway Recipes — optional recipes for installing Gateways and integrating them with llm-d
important

llm-d follows a modular deployment pattern, enabling gradual feature adoption. Users seeking a single CRD-driven deployment pattern should consider KServe's LLMInferenceService.

1. GAIE CRDs

llm-d uses the APIs defined in the Gateway API Inference Extension (GAIE) project:

CRDPurpose
InferencePoolDefines a pool of inference endpoints (model servers) and configures the EPP and proxy for LLM-aware routing.
InferenceObjectiveDefines performance goals (priority, latency) for specific model workloads within a pool.
InferenceModelRewriteSpecifies rules for rewriting model names in request bodies, enabling traffic splitting and canary rollouts.

Manifests are published at kubernetes-sigs/gateway-api-inference-extension/config/crd and can be installed like:

export GAIE_VERSION=v1.5.0
kubectl apply -k "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=${GAIE_VERSION}"

2. llm-d Router

llm-d Router is deployed via Helm. We offer a chart both Standalone and Gateway Mode:

ChartVersionOCI RegistryDescription
Standalone Modev1.5.0oci://registry.k8s.io/gateway-api-inference-extension/charts/standaloneDeploys an InferencePool and EPP with a standalone Envoy proxy as sidecar in EPP pod
Gateway Modev1.5.0oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepoolDeploys an InferencePool and EPP for use with an existing Kubernetes Gateway (e.g. Istio, AgentGateway, GKE)

The charts are currently published by the Gateway API Inference Extension (GAIE) project (see standalone mode source and gateway mode source). Each well-lit path guides provides values files on top of the chart defaults to enable the functionality implemented in EPP.

note

In a future release, the Helm Charts will be published from the llm-d project rather than from GAIE.

Images

llm-d releases the core EPP image as well as additional sidecar images for advanced functionality:

ImageDescriptionVersion
ghcr.io/llm-d/llm-d-inference-schedulerCore EPP imagev0.8.0
ghcr.io/llm-d/llm-d-uds-tokenizerOptional sidecar for EPP, enabling tokenization for precise cache aware routingv0.8.0
ghcr.io/llm-d/llm-d-routing-sidecarOptional sidecar for model servers, enabling KV cache transfer for P/Dv0.8.0
registry.k8s.io/gateway-api-inference-extension/latency-training-serverOptional sidecar for EPP, for predicted-latency model trainingv1.5.0
registry.k8s.io/gateway-api-inference-extension/latency-prediction-serverOptional sidecar for EPP, for predicted-latency schedulingv1.5.0
note

In a future release, the latency server images will be released from the llm-d/llm-d-latency-predictor repo.

3. Model Servers and Extensions

The llm-d stack supports vLLM and SGLang.

important

llm-d validates each released guide against specific versions of each model server, but llm-d Router communicates with model servers over the OpenAI-compatible HTTP API and standard inference engine metrics, so any recent release should work.

Upstream Images

We recommend using the upstream images for most guides:

EngineImageTag
vLLMvllm/vllm-openaiv0.19.1
vLLM TPUvllm/vllm-tpuv0.18.0
SGLanglmsysorg/sglangv0.5.10.post1

Custom Images

In addition to the upstream images, llm-d also builds and releases vLLM images with features not yet merged into vLLM upstream such as:

  • EFA support for AWS HPC networking
  • GKE IB networking patches
  • DeepEP patches for GB200 support
  • RIXL support on AMD ROCm
ImageTagAcceleratorBase OSArchitectures
ghcr.io/llm-d/llm-d-cudav0.7.0NVIDIA GPURHEL UBI9amd64, arm64
ghcr.io/llm-d/llm-d-cuda-gb200v0.7.0NVIDIA GPURHEL UBI9amd64, arm64
ghcr.io/llm-d/llm-d-awsv0.7.0NVIDIA GPU + EFARHEL UBI9amd64, arm64
ghcr.io/llm-d/llm-d-rocmv0.7.0AMD ROCmRHEL UBI9amd64
ghcr.io/llm-d/llm-d-xpuv0.7.0Intel XPUUbuntu 24.04amd64
ghcr.io/llm-d/llm-d-hpuv0.7.0Intel Gaudi HPUUbuntu 22.04amd64
ghcr.io/llm-d/llm-d-cpuv0.7.0CPURHEL UBI9amd64

FS Offloading Extension

llmd-fs-connector adds filesystem offloading to vLLM's OffloadingConnector. It is released from llm-d-kv-cache as a python wheel and hosted on the following pypi registry https://llm-d.github.io/llm-d-kv-cache/simple/builds.

4. Well-Lit Path Guides

Well-Lit Paths are tested, benchmarked deployment recipes that show off llm-d's key user stories. Each guide lives under guides/<path>/ and contains:

  • EPP Configurations - Helm values files with EPP configurations for usage with the charts for llm-d Router.
  • Model Server Manifests - Kustomize manifests for model server with labels and flags needed for usage with llm-d Router.
important

For some guides, we provide cloud provider specific settings. This is especially important for guides requiring IB and RoCE networking, which is not yet standardized. Users can adapt the examples to other platforms as needed.

See the full list of guides for more details.

5. Gateways

llm-d Router supports optional integration with Kubernetes Gateways. These are the versions we test against for the v0.7.0 release:

DependencyTested VersionsNotes
Gateway API CRDsv1.5.xKubernetes SIG (required if using a Gateway)
Istio1.29.xDefault gateway provider
AgentGatewayv1.0.xPreferred for new deployments
kgatewayv2.2.xDeprecated — will be removed in the next release

Install instructions live under guides/recipes/gateway/.

Source Repositories

Core Libraries

RepositoryLanguageDescription
llm-d/llm-dMain repo: docs, Dockerfiles, guides, CI
llm-d/llm-d-inference-schedulerGoEPP routing engine and P/D sidecar
llm-d/llm-d-latency-predictorPythonXGBoost training and prediction server
llm-d/llm-d-kv-cacheGo, Python, CPPKV-cache block locality indexer, FS offloading
llm-d/llm-d-workload-variant-autoscalerGoSLO-aware workload autoscaler
llm-d-incubation/llm-d-asyncGoAsynchronous request processor for latency insensitive traffic
llm-d-incubation/batch-gatewayGoOpenAI-compatible API for submitting, tracking, and managing batch inference jobs.

Supporting Libraries

RepositoryLanguageDescription
llm-d/llm-d-benchmarkPythonBenchmarking framework
llm-d/llm-d-inference-simGoGPU-free vLLM simulator