Skip to main content

Offloading Prefix Cache to CPU Memory

CPU Offloading (vLLM Native)

Nightly - Tiered Prefix Cache E2E (GKE) Nightly - Tiered Prefix Cache E2E (OpenShift)

CPU Offloading (LMCache)

Nightly - Tiered Prefix Cache LMCache E2E (GKE)

Overview

This guide provides recipes to offload prefix cache to CPU RAM via the vLLM native offloading connector, LMCache connector and tpu-inference KVCache connector. Offloading prefix cache to CPU helps in increasing overall throughput and mitigating memory starvation on HBM for large context models and frequent multi-turn user sessions.

Default Configuration

GPU

ParameterValue
ModelQwen/Qwen3-32B
GPUs per replica (TP)4
GPU AcceleratorNVIDIA H100
CPU Cache Offload Size100 GB

TPU

ParameterValue
ModelQwen/Qwen3-32B
TPUs per replica (TP)8
TPU AcceleratorTPU7x
HBM Staging Buffer Size1000 Blocks (~34 GB)
CPU Cache Offload Size25000 Chunks (~780 GB)

Supported Hardware Backends

This guide supports both GPU and TPU. GPU defaults to NVIDIA H100 and TPU defaults to TPU7x. The Kustomize overlays are available in modelserver/gpu/vllm/ and modelserver/tpu-v7/vllm/.


Prerequisites

  • Have the proper client tools installed on your local system to use this guide.

  • Checkout llm-d repo:

    export branch="main" # branch, tag, or commit hash
    git clone https://github.com/llm-d/llm-d.git && cd llm-d && git checkout ${branch}
  • Set the following environment variables:

    export GAIE_VERSION=v1.5.0
    export GUIDE_NAME="tiered-prefix-cache-cpu"
    export NAMESPACE=llm-d-${GUIDE_NAME}
  • Install the Gateway API Inference Extension CRDs:

    kubectl apply -k "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=${GAIE_VERSION}"
  • Create a target namespace for the installation

    kubectl create namespace ${NAMESPACE}

Installation Instructions

1. Deploy the llm-d Router

Standalone Mode

This deploys the llm-d Router with an Envoy sidecar side-by-side. Default mode for standalone deployments:

helm install ${GUIDE_NAME} \
oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone \
-f guides/recipes/router/base.values.yaml \
-f guides/tiered-prefix-cache/cpu/router/${GUIDE_NAME}.values.yaml \
-n ${NAMESPACE} --version ${GAIE_VERSION}

Gateway Mode

To use a Kubernetes Gateway managed proxy rather than the standalone version, follow these steps instead of applying the previous Helm chart:

  1. Deploy a Kubernetes Gateway by following one of the gateway guides.
  2. Deploy the llm-d Router and an HTTPRoute connecting to the Gateway:
export PROVIDER_NAME=gke # options: none, gke, agentgateway, istio
helm install ${GUIDE_NAME} \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
-f guides/recipes/router/base.values.yaml \
-f guides/tiered-prefix-cache/cpu/router/${GUIDE_NAME}.values.yaml \
--set provider.name=${PROVIDER_NAME} \
--set experimentalHttpRoute.enabled=true \
--set experimentalHttpRoute.inferenceGatewayName=llm-d-inference-gateway \
-n ${NAMESPACE} --version ${GAIE_VERSION}

2. Deploy the Model Server

Apply the Kustomize overlay setup matching your preferred offloading medium:

For NVIDIA GPU:

export CONNECTOR=offloading-connector # offloading-connector | lmcache-connector
export INFRA_PROVIDER=base # base | gke
kubectl apply -n ${NAMESPACE} -k guides/tiered-prefix-cache/cpu/modelserver/gpu/vllm/${CONNECTOR}/${INFRA_PROVIDER}/

For Google TPU v7:

kubectl apply -n ${NAMESPACE} -k guides/tiered-prefix-cache/cpu/modelserver/tpu-v7/vllm/tpu-offloading-connector/
note

To enable tiered prefix caching, we customize the llm-d EPP configuration. We configure two prefix cache scorers: one for the GPU/TPU cache and another for the CPU cache. LRU capacity for the CPU cache must be manually configured (lruCapacityPerServer) because vLLM currently does not emit CPU block metrics.


3. (Optional) Enable monitoring

kubectl apply -n ${NAMESPACE} -k guides/recipes/modelserver/components/monitoring

Verification

1. Get the IP of the Proxy

Standalone Mode

export IP=$(kubectl get service ${GUIDE_NAME}-epp -n ${NAMESPACE} -o jsonpath='{.spec.clusterIP}')
Gateway Mode
export IP=$(kubectl get gateway llm-d-inference-gateway -n ${NAMESPACE} -o jsonpath='{.status.addresses[0].value}')

2. Send Test Requests

Open a temporary interactive shell inside the cluster:

kubectl run curl-debug --rm -it \
--image=cfmanteiga/alpine-bash-curl-jq \
--env="IP=$IP" \
--env="NAMESPACE=$NAMESPACE" \
-- /bin/bash

Send a completion request:

curl -X POST http://${IP}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3-32B",
"prompt": "How are you today?"
}' | jq

Cleanup

To clean up the applied deployment components:

helm uninstall ${GUIDE_NAME} -n ${NAMESPACE}
kubectl delete -n ${NAMESPACE} -k guides/tiered-prefix-cache/cpu/modelserver/gpu/vllm/${CONNECTOR}/${INFRA_PROVIDER}
kubectl delete namespace ${NAMESPACE}

Benchmarking

For instructions on setting up standard workloads and running performance analyses against this guide, refer to the benchmark instructions doc.

The current weight configuration defaults to 2:2:1:1 (Queue Scorer : KV Cache Utilization Scorer : GPU/TPU Prefix Cache Scorer : CPU Prefix Cache Scorer). This configuration defaults to a safe performance profile.

note

The following benchmark results were from a previous release and does not match the deployment of the current release. A follow up benchmark will be conducted and the results will be updated accordingly. See https://github.com/llm-d/llm-d/issues/680.

GPU

High Cache Scenario (HBM < KVCache < HBM + CPU RAM)

Medium ConfigurationMean TTFT (second)P90 TTFT (second)Mean E2E Latency (second)P90 E2E Latency (second)Overall Throughput (token per second)
Baseline vLLM9.020.937.849.738,534.8
vLLM + CPU offloading 100GB6.7 (-25.6%)20.2 (-3.3%)30.9 (-18.3%)44.2 (-11.1%)46,751.0 (+21.3%)
vLLM + LMCache CPU offloading 100GB6.5 (-27.8%)18.8 (-10.0%)30.8 (-18.5%)43.0 (-13.5%)46,910.6 (+21.7%)

Low Cache Scenario (KVCache < HBM)

Medium ConfigurationMean TTFT (second)P90 TTFT (second)Mean E2E Latency (second)P90 E2E Latency (second)Overall Throughput (token per second)
Baseline vLLM0.120.0918.419.623,389.6
vLLM + CPU offloading 100GB0.130.1118.620.623,032.6
vLLM + LMCache CPU offloading 100GB0.150.1018.919.622,772.5

TPU

High Cache Scenario (HBM < KVCache < HBM + CPU RAM)

Medium ConfigurationMean TTFT (second)P90 TTFT (second)Mean E2E Latency (second)P90 E2E Latency (second)Overall Throughput (token per second)
Baseline vLLM0.982.122.126.267262.3
vLLM + CPU offloading 25000 Chunks0.56 (-49%)0.5 (-75.7%)20.3 (-8.1%)23.6 (-9.9%)73178.1 (+8.9%)

Low Cache Scenario (KVCache < HBM)

Medium ConfigurationMean TTFT (second)P90 TTFT (second)Mean E2E Latency (second)P90 E2E Latency (second)Overall Throughput (token per second)
Baseline vLLM0.240.2316.919.925715.9
vLLM + CPU offloading 25000 Chunks0.260.2417.420.223,032.6