Skip to main content

Offloading Prefix Cache to Shared Storage

Overview

This guide explains how to offload the vLLM prefix cache (KV cache) to shared storage using the native llm-d FS connector or the LMCache connector. This allows prefix cache reuse across multiple vLLM instances and across nodes that mount the same shared path.

Default Configuration

ParameterValue
ModelQwen/Qwen3-32B
GPUs per replica (TP)4
GPU AcceleratorNVIDIA H100
CPU Cache Offload Size100 GB

Supported Connectors

ConnectorDirectory
llm-d FS Connectormodelserver/gpu/vllm/llm-d-fs-connector/
LMCache Connectormodelserver/gpu/vllm/lmcache-connector/

About llm-d FS Connector

The llm-d FS connector integrates with vLLM's native OffloadingConnector and stores KV blocks on shared storage that exposes a POSIX-compatible file API (for example IBM Storage Scale, CephFS, GCP Lustre, AWS Lustre).

This enables prefix cache reuse across multiple vLLM instances and across nodes that mount the same shared path.

Key advantages:

  • Fully asynchronous I/O - Uses vLLM's native offloading pipeline, enabling non-blocking KV cache reads and writes.
  • File system agnostic - Works with any storage backend that supports standard POSIX file operations.
  • KV sharing across instances and nodes - Multiple vLLM servers can reuse cached prefixes by accessing the same shared storage path.
  • High throughput via parallelism - I/O operations are parallelized across multiple threads to increase bandwidth and reduce tail latency.
  • Minimal GPU compute interference - Uses GPU DMA for data transfers, reducing interference with GPU compute kernels during load and store operations.

Note: The storage connector does not handle cleanup or eviction of data on the shared storage. Storage capacity management must be handled by the underlying storage system or by an external controller. A simple reference implementation of a PVC-based evictor is available in the kv-cache repository (PVC Evictor), which can be used to automatically clean up old KV-cache files when storage thresholds are exceeded.

For advanced configuration options and implementation details, see the llm-d FS backend documentation.

About LMCache Connector

LMCache is an extension for LLM serving engines that enhances performance by reducing "Time to First Token" (TTFT) and increasing throughput, particularly for long-context scenarios. It provides integration to various storage backends. For more information, visit the LMCache website.


Prerequisites

  • Have the proper client tools installed on your local system to use this guide.

  • Checkout llm-d repo:

    export branch="main" # branch, tag, or commit hash
    git clone https://github.com/llm-d/llm-d.git && cd llm-d && git checkout ${branch}
  • Set the following environment variables:

    export GAIE_VERSION=v1.5.0
    export GUIDE_NAME="tiered-prefix-cache-storage"
    export NAMESPACE=llm-d-${GUIDE_NAME}
  • Install the Gateway API Inference Extension CRDs:

    kubectl apply -k "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=${GAIE_VERSION}"
  • Create a target namespace for the installation:

    kubectl create namespace ${NAMESPACE}

Installation Instructions

1. Prepare a PVC (ReadWriteMany)

Set your storage class depending on your environment:

export STORAGE_CLASS="" # leave empty to use the cluster default StorageClass; or set "lustre" / "efs-sc"

To provision a managed GCP Lustre instance on GKE and configure the corresponding StorageClass, follow the GCP Lustre guide.

To provision AWS EFS and configure the corresponding StorageClass, follow the EFS guide.

Create a PVC using the selected storage class:

envsubst < guides/tiered-prefix-cache/storage/manifests/pvc.yaml | kubectl apply -n ${NAMESPACE} -f -

2. Deploy the llm-d Router

Standalone Mode

This deploys the llm-d Router with an Envoy sidecar side-by-side:

helm install ${GUIDE_NAME} \
oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone \
-f guides/recipes/router/base.values.yaml \
-f guides/tiered-prefix-cache/storage/router/${GUIDE_NAME}.values.yaml \
-n ${NAMESPACE} --version ${GAIE_VERSION}

Gateway Mode

To use a Kubernetes Gateway managed proxy instead of standalone:

  1. Deploy a Kubernetes Gateway by following one of the gateway guides.
  2. Deploy HTTPRoute and the llm-d Router:
export PROVIDER_NAME=gke # options: none, gke, agentgateway, istio
helm install llm-d-infpool \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
-f guides/recipes/router/base.values.yaml \
--set provider.name=${PROVIDER_NAME} \
--set experimentalHttpRoute.enabled=true \
--set experimentalHttpRoute.inferenceGatewayName=llm-d-inference-gateway \
-n ${NAMESPACE} --version ${GAIE_VERSION}

3. Deploy the Model Server

Apply the Kustomize overlay corresponding to your desired connector backend:

export CONNECTOR=llm-d-fs-connector # llm-d-fs-connector | lmcache-connector
export INFRA_PROVIDER=base # base | gke
kubectl apply -n ${NAMESPACE} -k guides/tiered-prefix-cache/storage/modelserver/gpu/vllm/${CONNECTOR}/${INFRA_PROVIDER}/

4. (Optional) Enable monitoring

kubectl apply -n ${NAMESPACE} -k guides/recipes/modelserver/components/monitoring

Verification

1. Check the PVC

kubectl get pvc -n ${NAMESPACE}

Output should show the PVC as Bound:

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
<pvc-name> Bound pvc-3c793698-XXXXXXX 18000Gi RWX <storage-class> <unset> 6d

2. Get the IP of the Proxy

Standalone Mode

export IP=$(kubectl get service ${GUIDE_NAME}-epp -n ${NAMESPACE} -o jsonpath='{.spec.clusterIP}')
Gateway Mode
export IP=$(kubectl get gateway llm-d-inference-gateway -n ${NAMESPACE} -o jsonpath='{.status.addresses[0].value}')

3. Send Test Requests

Open a temporary interactive shell inside the cluster:

kubectl run curl-debug --rm -it \
--image=cfmanteiga/alpine-bash-curl-jq \
--env="IP=$IP" \
--env="NAMESPACE=$NAMESPACE" \
-- /bin/bash

Send a completion request:

curl -X POST http://${IP}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3-32B",
"prompt": "How are you today?"
}' | jq

4. Verify KV cache is offloaded to storage

Send a long prompt (one that crosses several block_size boundaries) to trigger offload

# Run this inside the interactive shell created in step 3.
# Long prompt (~3K tokens)
PROMPT=$(printf 'Story: '; for i in $(seq 1 800); do printf 'alice met bob and they walked together. '; done)
jq -n --arg prompt "$PROMPT" '{"model":"Qwen/Qwen3-32B", "prompt":$prompt, "max_tokens":3, "temperature":0}' | \
curl -s http://${IP}/v1/completions \
-H 'Content-Type: application/json' \
-d @- | jq
# Check the shared PVC for written blocks
POD=$(kubectl get pod -n ${NAMESPACE} -l llm-d.ai/role=decode -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n ${NAMESPACE} ${POD} -- du -sh /mnt/files-storage/kv-cache
kubectl exec -n ${NAMESPACE} ${POD} -- find /mnt/files-storage/kv-cache -maxdepth 5

Expected output: du -sh shows hundreds of MB to several GB, and find lists a path like /mnt/files-storage/kv-cache/<model>/<block-config>/<tp-config>/... (fs connector) or /mnt/files-storage/kv-cache/<model>-xxx.pt (lmcache connector).

If you have monitoring set up, you can also confirm via vLLM's offload metrics in the metrics explorer, vllm:kv_offload_total_bytes for fs connector or lmcache:local_storage_usage for lmcache connector.


Benchmarking

The following benchmark results demonstrate the performance improvements of offloading the KV cache to Lustre using the LMCache connector. Two scenarios with varying context lengths are provided to illustrate how the performance gains from Lustre scale up as the computational load and KV cache size increase, particularly when exceeding the capacity of local HBM and CPU RAM.

Benchmark Setup

note

The following benchmark results were from a previous release and does not match the deployment of the current release. A follow up benchmark will be conducted and the results will be updated accordingly. See https://github.com/llm-d/llm-d/issues/680.

  • Hardware:

    • A total of 16 H100 GPUs, each with 80GB of HBM, were used.
    • The GPUs were distributed across 4 a3-highgpu-4g instances, with 4 GPUs per instance.
    • Lustre PVC with storage capacity of 18000GiB
  • vLLM Configuration:

    • gpu_memory_utilization was set to 0.65 to reduce the pressure on the benchmark tool. In production configuration this is typically set to a higher value such as 0.9.
    • Baseline has CPU offloading enabled.
    • Lustre offloading was enabled using Lustre PVC as local backend disk.

LMCache Connector Configuration:

  • For LMCache setup, LMCACHE_MAX_LOCAL_CPU_SIZE set to 20GB, which provides approximately 20*16(number of GPUs)=320GB of CPU RAM cache.
  • Lustre storage capacity available for KV cache offloading was set through LMCACHE_MAX_LOCAL_DISK_SIZE:"1120Gi". As we have 16 GPUs sharing the Lustre disk, 1120*16= 17920Gi <=le; 18000Gi (i.e. available Lustre capacity) This value can be less than or equal to the available disk size.

The benchmark was conducted using the inference-perf tool with the following hardware, memory, and workload configurations:

  • Workload:

    • The two different workloads were tested with a constant concurrency of 20 requests with different system_prompt_lengths of 50K and 70K.
    • Inference Perf configuration
      • type: concurrent
      • num_requests: 2700
      • concurrency_level: 20
    • System prompt length: 50K
      • num_groups: 50
      • system_prompt_len: 50000
      • question_len: 256
      • output_len: 1024
      • num_prompts_per_group: 50
    • System prompt length: 70K
      • num_groups: 50
      • system_prompt_len: 70000
      • question_len: 256
      • output_len: 1024
      • num_prompts_per_group: 50
  • Memory Calculation:

    • The KVCache size for the meta-llama/Llama-3.3-70B-Instruct model is approximately 320KB per token.
    • With gpu_memory_utilization at 0.65, there are 10768 GPU blocks available per engine.
    • The available HBM for KVCache per engine is approximately 55 GB (10768 blocks * 5.12 MB/block).
    • The total available HBM for the KVCache across the entire system was 220 GB (4 engines * 55 GB/engine).
    • Total CPU RAM cache available across the system was 320 GB.
    • Lustre capacity available for KV cache offloading: LMCACHE_MAX_LOCAL_DISK_SIZE="1120Gi" for each GPU.
Key Findings

In both scenarios, the total KV cache size significantly exceeds the combined capacity of local HBM and CPU RAM. The results demonstrate that as context length and memory demands increase, the performance benefits of offloading to Lustre become even more pronounced.

  • 50K system prompt length (KVCache size 994 GiB): While CPU RAM provides 320GB for KV cache offloading, adding Lustre significantly enhances performance compared to relying on CPU offloading alone.
  • 70K system prompt length (KVCache size 1.3 TiB): As the KV cache footprint grows to 1.3 TiB, the memory pressure intensifies. In this heavier scenario, Lustre delivers even greater performance gains, demonstrating its ability to seamlessly scale with demanding long-context use cases.
50K system prompt length (KVCache size 994 GiB) — KV Cache > (HBM + CPU RAM)
KVCache > HBM + CPU RAMMean TTFT (second)P90 TTFT (second)Mean E2E Latency (second)P90 E2E Latency (second)Input Throughput (token per second)Output Throughput (token per second)Overall Throughput (token per second)
Baseline vLLM + CPU offloading25.3837.7456.2169.691860735418962
vLLM + CPU offloading + Lustre20.12 (-21%)34.02 (-9.9%)45.83 (-18%)58.73 (-16%)22827 (+23%)435 (+23%)23262 (+23%)
70K system prompt length (KVCache size 1.3TiB GiB) — KV Cache >> (HBM + CPU RAM)
KVCache >> HBM + CPU RAMMean TTFT (second)P90 TTFT (second)Mean E2E Latency (second)P90 E2E Latency (second)Input Throughput (token per second)Output Throughput (token per second)Overall Throughput (token per second)
Baseline vLLM + CPU offloading58.0274.7587.99105.4616598226.6516825
vLLM + CPU offloading + Lustre45 (-22%)64.79 (-13%)68.28 (-22%)87.47 (-17%)21364 (+28.71%)291 (+28.39%)21656 (+28.71%)

LLM-D FS Connector Configuration:

  • The offloading connector supports multi-tiered offloading (e.g., offloading to both CPU and storage backends) through a Multi-connector setup, which combines an Offloading connector with a CPU backend and an Offloading connector with an LLM-D FS backend.
  • For benchmark we allocated 64.42*4 ~= 356GB GB of CPU RAM for offloading by setting cpu_bytes_to_use=64424509440.
  • The offloading connector can utilize the entire capacity of the attached Lustre disk for KV cache offloading, for this case the Lustre PVC = 18000GiB.

The benchmark was conducted using the inference-perf tool with the following hardware, memory, and workload configurations:

  • Workload:

    • The two different workloads were tested with a constant concurrency of 20 requests with different system_prompt_lengths of 30K on Qwen/Qwen3-32B and 50K on a larger model meta-llama/Llama-3.3-70B-Instruct.
    • Inference Perf configuration
      • type: concurrent
      • num_requests: 2700
      • concurrency_level: 20
    • [Qwen3-32B]System prompt length: 30K
      • num_groups: 50
      • system_prompt_len: 50000
      • question_len: 256
      • output_len: 1024
      • num_prompts_per_group: 50
    • [Llama-3.3-70B-Instruct]System prompt length: 50K
      • num_groups: 50
      • system_prompt_len: 50000
      • question_len: 256
      • output_len: 1024
      • num_prompts_per_group: 50
  • Memory Calculation:

  • Model: Qwen3/Qwen3-32B

    • The KVCache size for the Qwen/Qwen3-32B model is approximately 256KB per token.
    • With gpu_memory_utilization at 0.65, there are 8474 GPU blocks available per engine.
    • The available HBM for KVCache per engine is approximately 34.7 GB (8474 blocks * 4.096 MB/block).
    • The total available HBM for the KVCache across the entire system was 277 GB (8 engines * 34 GB/engine).
    • Total CPU RAM cache available across the system was 64*8 ~= 512GB.
    • Lustre capacity available for KV cache offloading: 18000GiB for total system.
  • Model: meta-llama/Llama-3.3-70B-Instruct

    • The KVCache size for the meta-llama/Llama-3.3-70B-Instruct model is approximately 320KB per token.
    • With gpu_memory_utilization at 0.65, there are 10768 GPU blocks available per engine.
    • The available HBM for KVCache per engine is approximately 55 GB (10768 blocks * 5.12 MB/block).
    • The total available HBM for the KVCache across the entire system was 220 GB (4 engines * 55 GB/engine).
    • Total CPU RAM cache available across the system was 64*4 ~= 256GB.
    • Lustre capacity available for KV cache offloading: 18000GiB for total system.
Key Findings

In this scenario, the total KV cache size significantly exceeds the combined capacity of local HBM and CPU RAM. The results demonstrate that as context length and memory demands increase, the performance benefits of offloading to Lustre become even more pronounced.

  • 30K system prompt length (KVCache size 653 GiB): While CPU RAM provides 256GB for KV cache offloading, adding Lustre significantly enhances performance compared to relying on CPU offloading alone.

  • 50K system prompt length (KVCache size 994 GiB): The performance difference with Lustre storage offloading is more prominent for larger models like llama-3.3-70B-Instruct .

30K system prompt length (KVCache size 653 GiB) — KV Cache > (HBM + CPU RAM)
KVCache > HBM + CPU RAMMean TTFT (second)P90 TTFT (second)Mean E2E Latency (second)P90 E2E Latency (second)Input Throughput (token per second)Output Throughput (token per second)Overall Throughput (token per second)ITL (second)
Baseline vLLM + CPU offloading2.245.1422.2126.627148836279840.021
vLLM + CPU offloading + Lustre1.38 (-38.4%)2.82 (-45.1%)20.45 (-7.9%)22.77 (-14.4%)28832 (+6.2%)828 (-1.0%)29661 (+6.0%)0.02 (-4.8%)

50K system prompt length (KVCache size 994 GiB) — KV Cache > (HBM + CPU RAM)
KVCache > HBM + CPU RAMMean TTFT (second)P90 TTFT (second)Mean E2E Latency (second)P90 E2E Latency (second)Input Throughput (token per second)Output Throughput (token per second)Overall Throughput (token per second)ITL (second)
Baseline vLLM + CPU offloading27.1141.7157.0672.2818333350186820.029
vLLM + CPU offloading + Lustre15.25 (-43.7%)24.71 (-40.8%)38.55 (-32.4%)48.01 (-33.6%)27091 (+47.8%)517 (+47.7%)27609 (+47.8%)0.022 (-24.1%)

Cleanup

To clean and remove applied deployments:

helm uninstall ${GUIDE_NAME} -n ${NAMESPACE}
kubectl delete -f guides/tiered-prefix-cache/storage/manifests/pvc.yaml -n ${NAMESPACE}
kubectl delete -n ${NAMESPACE} -k guides/tiered-prefix-cache/storage/modelserver/gpu/vllm/${CONNECTOR}
kubectl delete namespace ${NAMESPACE}