Skip to main content

Precise Prefix Cache Routing

Nightly - Precise Prefix Cache E2E (OpenShift) Nightly - Precise Prefix Cache E2E (CKS) Nightly - Precise Prefix Cache E2E (GKE)

Overview

This guide routes requests on precise per-pod KV-cache state rather than request-traffic heuristics. Each vLLM pod publishes KV-cache events over ZMQ; the router subscribes, builds an index keyed by block hash, and scores candidate pods by the fraction of an incoming request's prefix that is already resident.

Two scorers make up the routing decision alongside the load-aware stack:

Default Configuration

ParameterValue
ModelQwen/Qwen3-32B
Replicas8 (reduce for smaller fleets — see notes below)
Tensor Parallelism2
GPUs per replica2
Total GPUs16
vLLM --block-size64 (must match scorer tokenProcessorConfig.blockSize)

Supported Hardware Backends

BackendDirectoryDefault modelNotes
NVIDIA GPUmodelserver/gpu/vllm/Qwen/Qwen3-32BDefault configuration
AMD GPUmodelserver/amd/vllm/Qwen/Qwen3-32BAMD GPU
Intel XPUmodelserver/xpu/vllm/Qwen/Qwen3-0.6BCI-sized; update router modelName for real use
Intel Gaudi (HPU)modelserver/hpu/vllm/Qwen/Qwen3-8B--block-size=128; update scorer blockSize to match
Google TPU v6emodelserver/tpu-v6/vllm/Llama-3.1-70B-InstructGKE TPU
Google TPU v7modelserver/tpu-v7/vllm/Qwen3-Coder-480B-FP8GKE TPU
CPUmodelserver/cpu/vllm/Llama-3.2-3B-InstructCI-sized
note

Some hardware variants use reduced configurations (fewer replicas, smaller models) to enable CI testing for compatibility and regression checks.

note

For precise prefix cache scoring to match reality, the tokenizer modelName and the scorer's indexerConfig.tokenizersPoolConfig.modelName in router/precise-prefix-cache-routing.values.yaml must match the model the overlay deploys. HPU and anything that tunes --block-size also requires updating tokenProcessorConfig.blockSize on the router side.

note

The gpu/vllm/ overlay defaults to 8 replicas to match the canonical 16×H100 benchmark. For smaller fleets (or quick smoke tests), reduce replicas in the deployment patch (modelserver/gpu/vllm/patch-vllm.yaml) before applying.

note

The router runs in active-active HA by default — two replicas behind one Service, each subscribing to every vLLM pod via pod-discovery so both indexes converge. Scale to a single replica with --set inferenceExtension.replicas=1 if HA isn't needed (small fleets, smoke tests).

Prerequisites

  • Have the proper client tools installed on your local system to use this guide.

  • Checkout llm-d repo:

    export branch="main" # branch, tag, or commit hash
    git clone https://github.com/llm-d/llm-d.git && cd llm-d && git checkout ${branch}
  • Set the following environment variables:

    export GAIE_VERSION=v1.5.0
    export GUIDE_NAME="precise-prefix-cache-routing"
    export NAMESPACE=llm-d-precise
  • Install the Gateway API Inference Extension CRDs:

    kubectl apply -k "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=${GAIE_VERSION}"
  • Create a target namespace for the installation

    kubectl create namespace ${NAMESPACE}
  • Set the following environment variables:

export GAIE_VERSION=v1.5.0
export GUIDE_NAME="precise-prefix-cache-routing"
export NAMESPACE="llm-d-${GUIDE_NAME}"
  • Install the Gateway API Inference Extension CRDs:
kubectl apply -k "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=${GAIE_VERSION}"
  • Create a target namespace for the installation
kubectl create namespace ${NAMESPACE}

Installation Instructions

1. Prepare HF Token

Create the llm-d-hf-token secret in the namespace. The UDS tokenizer sidecar reads HF_TOKEN to reach gated tokenizers — Qwen/Qwen3-32B is public but the secret makes swapping in a gated model a no-op. See helpers/hf-token.md for the full helper.

kubectl -n ${NAMESPACE} create secret generic llm-d-hf-token --from-literal=HF_TOKEN="${HF_TOKEN}"

2. Deploy the llm-d Router

Standalone Mode

This deploys the llm-d Router in the simple Standalone Mode:

helm install ${GUIDE_NAME} \
oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone \
-f ${REPO_ROOT}/guides/recipes/router/base.values.yaml \
-f ${REPO_ROOT}/guides/${GUIDE_NAME}/router/${GUIDE_NAME}.values.yaml \
--post-renderer ${REPO_ROOT}/guides/${GUIDE_NAME}/router/patches/uds-tokenizer/post-renderer.sh \
-n ${NAMESPACE} --version ${GAIE_VERSION}
Helm v4

Helm v4's --post-renderer only accepts a registered plugin name, not a path. Install once, then swap the flag value:

helm plugin install guides/${GUIDE_NAME}/router/patches/uds-tokenizer
# in the helm install above, replace the --post-renderer line with:
# --post-renderer uds-tokenizer

The release name ${GUIDE_NAME} is mandatory for standard deployments — the inference pool selector matches a guide label that pairs with this release.

Why a helm post-renderer is required (chart limitation)

The standalone chart's sidecar.* slot is occupied by its Envoy proxy — overriding it would lose HTTP serving — so the UDS tokenizer container is appended via a helm post-render hook instead. The post-renderer runs kustomize build on the chart's rendered manifests with a strategic merge patch that adds the tokenizer-uds container (image ghcr.io/llm-d/llm-d-uds-tokenizer:vllm-v0.19.1), two emptyDir volumes (tokenizers, tokenizer-uds), and a /tmp/tokenizer volumeMount on the existing epp container so the tokenizer plugin can reach the UDS socket. Tracking removal of this workaround upstream — once the chart supports multiple sidecars natively, the post-renderer goes away.

Gateway Mode

To use a Kubernetes Gateway managed proxy instead of the standalone Envoy sidecar, do not apply the standalone chart above. Instead:

  1. Deploy a Kubernetes Gateway. See the gateway guides for step-by-step deployment of a Gateway named llm-d-inference-gateway.

  2. Deploy the llm-d Router and HTTPRoute via the inferencepool chart with experimentalHttpRoute.enabled=true. Same UDS post-renderer applies:

export REPO_ROOT=$(realpath $(git rev-parse --show-toplevel))
export PROVIDER_NAME=istio # options: none, gke, agentgateway, istio
helm install precise-prefix-cache-routing \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
-f ${REPO_ROOT}/guides/recipes/router/base.values.yaml \
-f ${REPO_ROOT}/guides/recipes/router/features/httproute-flags.yaml \
-f ${REPO_ROOT}/guides/${GUIDE_NAME}/router/${GUIDE_NAME}.values.yaml \
--set provider.name=${PROVIDER_NAME} \
--post-renderer ${REPO_ROOT}/guides/${GUIDE_NAME}/router/patches/uds-tokenizer/post-renderer.sh
-n ${NAMESPACE} --version ${GAIE_VERSION}

3. Deploy the Model Server

Apply the Kustomize overlay for your backend (defaulting to NVIDIA GPU / vLLM):

export INFRA_PROVIDER=base # base | gke
kubectl apply -n ${NAMESPACE} -k ${REPO_ROOT}/guides/${GUIDE_NAME}/modelserver/gpu/vllm/${INFRA_PROVIDER}/

4. (Optional) Enable Monitoring

note

GKE provides automatic application monitoring out of the box. The llm-d Monitoring stack is not required for GKE, but it is available if you prefer to use it.

  • Install the Monitoring stack.

  • Deploy the monitoring resources for this guide:

    kubectl apply -n ${NAMESPACE} -k ${REPO_ROOT}/guides/recipes/modelserver/components/monitoring
  • Enable Prometheus scrape for the router by layering -f ${REPO_ROOT}/guides/recipes/router/features/monitoring.values.yaml onto the helm command in step 2.

Verification

1. Get the IP of the Proxy

Standalone Mode

export IP=$(kubectl get service ${GUIDE_NAME}-epp -n ${NAMESPACE} -o jsonpath='{.spec.clusterIP}')
Gateway Mode
export IP=$(kubectl get gateway llm-d-inference-gateway -n ${NAMESPACE} -o jsonpath='{.status.addresses[0].value}')

2. Send Test Requests

Open a temporary interactive shell inside the cluster:

kubectl run curl-debug --rm -it \
--image=cfmanteiga/alpine-bash-curl-jq \
--env="IP=$IP" \
--env="NAMESPACE=$NAMESPACE" \
-- /bin/bash

Send a completion request:

curl -X POST http://${IP}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3-32B",
"prompt": "How are you today?"
}' | jq

Benchmarking

The benchmark launches a pod (llmdbench-harness-launcher) that uses inference-perf with a shared-prefix synthetic workload. Each experiment is saved under the specified output folder, e.g. ./results/<experiment ID>/inference-perf_<experiment ID>_precise-guide-<model name>. See the benchmark instructions doc for details.

1. Prepare the Benchmarking Suite

  • Download the benchmark script:

    curl -L -O https://raw.githubusercontent.com/llm-d/llm-d-benchmark/main/existing_stack/run_only.sh
    chmod u+x run_only.sh
  • Create HuggingFace token

2. Download the Workload Template

curl -LJO "https://raw.githubusercontent.com/llm-d/llm-d/main/guides/precise-prefix-cache-routing/benchmark-templates/guide.yaml"

3. Execute Benchmark

export IP=$(kubectl get service ${GUIDE_NAME}-epp -n ${NAMESPACE} -o jsonpath='{.spec.clusterIP}')
envsubst < guide.yaml > config.yaml
./run_only.sh -c config.yaml -o ./results

Cleanup

helm uninstall ${GUIDE_NAME} -n ${NAMESPACE}
kubectl delete -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/gpu/vllm/${INFRA_PROVIDER}/

How It Works

  1. vLLM pods publish KV-cache events — each pod runs vllm serve ... --kv-events-config '{...,"publisher":"zmq","endpoint":"$(KV_EVENTS_ENDPOINT)","topic":"kv@$(POD_IP):$(POD_PORT)@<model>"}' with KV_EVENTS_ENDPOINT=tcp://*:5556, binding its own ZMQ socket. On every KV block allocation/eviction, vLLM emits a ZMQ message.
  2. Router subscribes per pod — pod-discovery (kvEventsConfig.discoverPods: true) wires the data-layer endpoint-notification-source into the scorer's ExtractEndpoint, so each router replica installs a ZMQ subscriber per vLLM pod independently. All replicas converge to the same index.
  3. Scoring — the precise-prefix-cache-scorer returns the fraction of the request's prefix blocks that are resident on each candidate pod. The max-score-picker routes to the highest-scoring pod.

The tokenizer plugin and the scorer's internal tokenizersPoolConfig both point at /tmp/tokenizer/tokenizer-uds.socket — a UDS tokenizer sidecar (ghcr.io/llm-d/llm-d-uds-tokenizer) owns tokenizer model downloads and caching, keeping tokenization out of the EPP main container.

Benchmarking Report

The benchmark runs on 16× H100 GPUs, distributed across 8 model servers (2 H100s per server with TP=2).

Comparing llm-d Scheduling to a Simple Kubernetes Service

Graphs below compare the precise path to a stock Kubernetes Service that round-robins requests across the same 8 vLLM pods (no EPP, no scoring).

Throughput vs QPS Latency vs QPS TTFT p90 vs QPS

Summary across the full ladder (rates 3 → 60):

Metrick8s service (RR)llm-d PreciseΔ% vs k8s
Output tokens/sec5,72212,598+120.2%
Requests/sec35.8736.01+0.4%
TTFT mean (s)58.100.247−99.57%
TTFT p90 (s)107.430.262−99.76%
ITL mean (ms)44.047.0+6.8%
Click to view the per-rate breakdown across the full ladder

Output tokens/sec — higher is better; TTFT in seconds — lower is better.

Ratek8s Outputllm-d Outputk8s TTFT meanllm-d TTFT meank8s TTFT p90llm-d TTFT p90
31,7971,7070.4150.1550.5220.187
104,2154,9040.6300.1501.0140.199
155,3816,8870.8810.1551.5930.225
206,20511,22418.1030.20635.3440.320
225,51711,98020.1710.15239.4360.191
255,96512,54821.8420.15842.8130.200
305,70213,50724.5970.15546.0360.193
355,89013,80324.1620.15745.1900.202
406,33615,59368.6730.494126.2380.272
436,58815,61272.4290.422130.2750.265
466,45915,46270.0840.257129.8100.273
496,26515,60770.6590.200133.7180.267
526,30315,72874.3260.208134.9810.279
556,29015,61272.5640.199134.0340.272
576,08915,66772.3290.211135.0230.293
606,55115,73375.5860.214138.6630.300