Skip to main content

Precise Prefix Cache Aware Routing

Nightly - Precise Prefix Cache E2E (OpenShift) Nightly - Precise Prefix Cache E2E (CKS) Nightly - Precise Prefix Cache E2E (GKE)

Overview

This guide routes requests on precise per-pod KV-cache state rather than request-traffic heuristics. Each vLLM pod publishes KV-cache events over ZMQ; the scheduler subscribes, builds an index keyed by block hash, and scores candidate pods by the fraction of an incoming request's prefix that is already resident.

Two scorers make up the routing decision alongside the load-aware stack:

Default Configuration

ParameterValue
ModelQwen/Qwen3-32B
Replicas8 (reduce for smaller fleets — see notes below)
Tensor Parallelism2
GPUs per replica2
Total GPUs16
vLLM --block-size64 (must match scorer tokenProcessorConfig.blockSize)
Scheduler imageghcr.io/llm-d/llm-d-inference-scheduler:v0.8.0-rc.1

Supported Hardware Backends

BackendDirectoryDefault modelNotes
NVIDIA GPUmodelserver/gpu/vllm/Qwen/Qwen3-32BDefault configuration
AMD GPUmodelserver/amd/vllm/Qwen/Qwen3-32BAMD GPU
Intel XPUmodelserver/xpu/vllm/Qwen/Qwen3-0.6BCI-sized; update scheduler modelName for real use
Intel Gaudi (HPU)modelserver/hpu/vllm/Qwen/Qwen3-8B--block-size=128; update scorer blockSize to match
Google TPU v6emodelserver/tpu-v6/vllm/Llama-3.1-70B-InstructGKE TPU
Google TPU v7modelserver/tpu-v7/vllm/Qwen3-Coder-480B-FP8GKE TPU
CPUmodelserver/cpu/vllm/Llama-3.2-3B-InstructCI-sized
note

Some hardware variants use reduced configurations (fewer replicas, smaller models) to enable CI testing for compatibility and regression checks.

note

For precise prefix cache scoring to match reality, the tokenizer modelName and the scorer's indexerConfig.tokenizersPoolConfig.modelName in scheduler/precise-prefix-cache-aware.values.yaml must match the model the overlay deploys. HPU and anything that tunes --block-size also requires updating tokenProcessorConfig.blockSize on the scheduler side.

note

The gpu/vllm/ overlay defaults to 8 replicas to match the canonical 16×H100 benchmark. For smaller fleets (or quick smoke tests), reduce replicas in the deployment patch (modelserver/gpu/vllm/patch-vllm.yaml) before applying.

Prerequisites

  • Install the Gateway API Inference Extension CRDs.

  • Have the proper client tools installed on your local system. This guide requires Helm v4 (the post-renderer plugin uses the v4 plugin manifest format) and a standalone kustomize binary (v5+) on $PATH, in addition to kubectl.

  • Check out the llm-d repo:

    export branch="main" # branch, tag, or commit hash
    git clone https://github.com/llm-d/llm-d.git && cd llm-d && git checkout ${branch}

Installation Instructions

1. Prepare a Target Namespace

export NAMESPACE=llm-d-precise
kubectl create namespace ${NAMESPACE}

Create the llm-d-hf-token secret in the namespace. The UDS tokenizer sidecar reads HF_TOKEN to reach gated tokenizers — Qwen/Qwen3-32B is public but the secret makes swapping in a gated model a no-op. See helpers/hf-token.md for the full helper.

kubectl -n ${NAMESPACE} create secret generic llm-d-hf-token --from-literal=HF_TOKEN="${HF_TOKEN}"

2. Deploy the Inference Scheduler

Standalone Mode

This deploys the inference scheduler with an Envoy sidecar — no Kubernetes Gateway required.

helm plugin install guides/precise-prefix-cache-aware/scheduler/patches/uds-tokenizer # once
helm install precise-prefix-cache-aware \
oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone \
-f guides/recipes/scheduler/base.values.yaml \
-f guides/precise-prefix-cache-aware/scheduler/precise-prefix-cache-aware.values.yaml \
--post-renderer uds-tokenizer \
-n ${NAMESPACE} --version v1.4.0

The release name precise-prefix-cache-aware is mandatory for standard deployments. The vLLM patches hardcode the endpoint as KV_EVENTS_ENDPOINT=tcp://<release>-epp.<ns>.svc.cluster.local:5556. If you choose a custom release name, you must manually update the KV_EVENTS_ENDPOINT environment variable in your modelserver overlay to match <your-release-name>-epp.

Why a helm post-renderer is required (chart limitation)

The standalone chart's sidecar.* slot is occupied by its Envoy proxy — overriding it would lose HTTP serving — so the UDS tokenizer container is appended via a helm post-render hook instead. The post-renderer runs kustomize build on the chart's rendered manifests with a strategic merge patch that adds the tokenizer-uds container (image ghcr.io/llm-d/llm-d-uds-tokenizer:v0.7.1), two emptyDir volumes (tokenizers, tokenizer-uds), and a /tmp/tokenizer volumeMount on the existing epp container so the tokenizer plugin can reach the UDS socket. Tracking removal of this workaround upstream — once the chart supports multiple sidecars natively, the post-renderer goes away.

Gateway Mode

To use a Kubernetes Gateway managed proxy instead of the standalone Envoy sidecar, do not apply the standalone chart above. Instead:

  1. Deploy a Kubernetes Gateway. See the gateway guides for step-by-step deployment of a Gateway named llm-d-inference-gateway.

  2. Deploy the Inference Scheduler and HTTPRoute via the inferencepool chart with experimentalHttpRoute.enabled=true. Same UDS post-renderer applies:

    export PROVIDER_NAME=istio # options: none, gke, agentgateway, istio
    helm install precise-prefix-cache-aware \
    oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
    -f guides/recipes/scheduler/base.values.yaml \
    -f guides/precise-prefix-cache-aware/scheduler/precise-prefix-cache-aware.values.yaml \
    --set provider.name=${PROVIDER_NAME} \
    --set experimentalHttpRoute.enabled=true \
    --set experimentalHttpRoute.inferenceGatewayName=llm-d-inference-gateway \
    --post-renderer uds-tokenizer \
    -n ${NAMESPACE} --version v1.4.0

3. Deploy the Model Server

Apply the Kustomize overlay for your backend (defaulting to NVIDIA GPU / vLLM):

kubectl apply -n ${NAMESPACE} -k guides/precise-prefix-cache-aware/modelserver/gpu/vllm/

4. (Optional) Enable Monitoring

note

GKE provides automatic application monitoring out of the box. The llm-d Monitoring stack is not required for GKE, but it is available if you prefer to use it.

  • Install the Monitoring stack.

  • Deploy the monitoring resources for this guide:

    kubectl apply -n ${NAMESPACE} -k guides/recipes/modelserver/components/monitoring
  • Enable Prometheus scrape for the scheduler by layering -f guides/recipes/scheduler/features/monitoring.values.yaml onto the helm command in step 2.

5. (Optional) Enable Active-Active High Availability

The default single-replica install uses central ZMQ — vLLM publishers connect into the scheduler service. To run multiple scheduler replicas simultaneously (each with its own Envoy gateway sidecar) behind a single load-balancing Service, see active-active.md.

Verification

1. Get the IP of the Proxy

Standalone Mode

export IP=$(kubectl get service precise-prefix-cache-aware-epp -n ${NAMESPACE} -o jsonpath='{.spec.clusterIP}')
Gateway Mode
export IP=$(kubectl get gateway llm-d-inference-gateway -n ${NAMESPACE} -o jsonpath='{.status.addresses[0].value}')

2. Send Test Requests

Open a temporary interactive shell inside the cluster:

kubectl run curl-debug --rm -it \
--image=cfmanteiga/alpine-bash-curl-jq \
--env="IP=$IP" \
--env="NAMESPACE=$NAMESPACE" \
-- /bin/bash

Send a completion request:

curl -X POST http://${IP}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3-32B",
"prompt": "How are you today?"
}' | jq

Benchmarking

The benchmark launches a pod (llmdbench-harness-launcher) that uses inference-perf with a shared-prefix synthetic workload. Each experiment is saved under the specified output folder, e.g. ./results/<experiment ID>/inference-perf_<experiment ID>_shared_prefix_precise-guide-<model name>. See the benchmark instructions doc for details.

1. Prepare the Benchmarking Suite

  • Download the benchmark script:

    curl -L -O https://raw.githubusercontent.com/llm-d/llm-d-benchmark/main/existing_stack/run_only.sh
    chmod u+x run_only.sh
  • Create HuggingFace token

2. Download the Workload Template

curl -LJO "https://raw.githubusercontent.com/llm-d/llm-d/main/guides/precise-prefix-cache-aware/benchmark-templates/guide.yaml"

3. Execute Benchmark

export IP=$(kubectl get service precise-prefix-cache-aware-epp -n ${NAMESPACE} -o jsonpath='{.spec.clusterIP}')
envsubst < guide.yaml > config.yaml
./run_only.sh -c config.yaml -o ./results

Cleanup

helm uninstall precise-prefix-cache-aware -n ${NAMESPACE}
kubectl delete -n ${NAMESPACE} -k guides/precise-prefix-cache-aware/modelserver/gpu/vllm/

How It Works

  1. vLLM pods publish KV-cache events — each pod runs vllm serve ... --kv-events-config '{...,"publisher":"zmq","endpoint":"$(KV_EVENTS_ENDPOINT)","topic":"kv@$(POD_IP):$(POD_PORT)@<model>"}'. On every KV block allocation/eviction, vLLM emits a ZMQ message.
  2. Scheduler subscribes — in central mode the scheduler's scorer binds tcp://*:5556 and all vLLM publishers connect in. A single kv@-prefixed topic filter passes all events through.
  3. Index is keyed by block hash — the scorer hashes tokens using blockSize=64 + hashSeed="42" (must match vLLM's PYTHONHASHSEED=42 env var) to produce the same block IDs vLLM emits. Incoming requests are tokenized via the UDS tokenizer sidecar, hashed with the same parameters, and looked up in the index.
  4. Scoring — the precise-prefix-cache-scorer returns the fraction of the request's prefix blocks that are resident on each candidate pod. The max-score-picker routes to the highest-scoring pod.

The tokenizer plugin and the scorer's internal tokenizersPoolConfig both point at /tmp/tokenizer/tokenizer-uds.socket — a UDS tokenizer sidecar (ghcr.io/llm-d/llm-d-uds-tokenizer) owns tokenizer model downloads and caching, keeping tokenization out of the EPP main container.

Benchmarking Report

The benchmark runs on 16× H100 GPUs, distributed across 8 model servers (2 H100s per server with TP=2).

Click to view the report for rate=60
metrics:
latency:
request_latency:
mean: 63.34
p50: 60.84
p90: 75.70
p99: 77.97
units: s
time_to_first_token:
mean: 0.192
p50: 0.178
p90: 0.260
p99: 0.564
units: s
time_per_output_token:
mean: 0.063
p50: 0.061
p90: 0.075
p99: 0.078
units: s/token
requests:
failures: 0
input_length: {mean: 7584}
output_length: {mean: 937}
total: 1500
throughput:
requests_per_sec: 14.87
output_tokens_per_sec: 13932.0
total_tokens_per_sec: 126727.5
time:
duration: 24.92

Comparing LLM-d Scheduling to a Simple Kubernetes Service

Graphs below are from inference-perf --analyze comparing the precise path to a stock Kubernetes service routing directly to the vLLM pods.

Latency vs QPS Throughput vs QPS

Stage at rate=60:

  • Throughput: Requests/sec +159.5%; Output tokens/sec +159.8%
  • Latency: TTFT (mean) -99.5%; E2E request latency (mean) -39.9%
  • Per-token speed: Inter-token latency (mean) -10.4% (faster)
Metrick8s (Mean)llm-d precise (Mean)Δ (llm-d − k8s)Δ% vs k8s
Requests/sec5.730614.8719+9.1413+159.5%
Input tokens/sec43,417.86112,795.47+69,377.61+159.8%
Output tokens/sec5,362.1613,931.99+8,569.83+159.8%
Total tokens/sec48,780.02126,727.46+77,947.44+159.8%
Request latency (s)105.413363.3376-42.0757-39.9%
TTFT (s)34.91450.1916-34.7229-99.5%
Inter-token latency (ms)70.4263.07-7.35-10.4%
Content Source

This content is automatically synced from guides/precise-prefix-cache-aware/README.md on the main branch of the llm-d/llm-d repository.

📝 To suggest changes, please edit the source file or create an issue.