Optimized Baseline

Overview

This guide deploys the recommended out of the box configuration for most vLLM and SGLang deployments, reducing tail latency and increasing throughput through load-aware and prefix-cache aware balancing.

The optimized-baseline defaults to two main routing criteria:

Prefix-cache aware using the prefix cache scorer, which scores candidate endpoints by estimating prompt prefix cache reuse on each model server.
Load-aware using both the kv-cache utilization and the queue size scorers.

Default Configuration

Parameter	Value
Model	Qwen/Qwen3-32B
Replicas	8
Tensor Parallelism	2
GPUs per replica	2
Total GPUs	16

Supported Hardware Backends

This guide includes configurations for the following accelerators:

Backend	Directory	Notes
NVIDIA GPU	`modelserver/gpu/vllm/$(INFRA_PROVIDER)/`	Default configuration (`INFRA_PROVIDER` options: `base`, `gke`)
NVIDIA GPU (SGLang)	`modelserver/gpu/sglang/$(INFRA_PROVIDER)/`	SGLang inference server (`INFRA_PROVIDER` options: `base`, `gke`)
AMD GPU	`modelserver/amd/vllm/`	AMD GPU
AMD GPU (SGLang)	`modelserver/amd/sglang`	AMD GPU
Intel XPU	`modelserver/xpu/vllm/`	Intel Data Center GPU Max 1550+
Intel Gaudi (HPU)	`modelserver/hpu/vllm/`	Gaudi 1/2/3 with DRA support
Google TPU v6e	`modelserver/tpu-v6/vllm/`	GKE TPU
Google TPU v7	`modelserver/tpu-v7/vllm/`	GKE TPU
CPU	`modelserver/cpu/vllm/`	Intel/AMD, 64 cores + 64GB RAM per replica

note

Some hardware variants use reduced configurations (fewer replicas, smaller models) to enable CI testing for compatibility and regression checks. These configurations are maintained by their respective hardware vendors and are not guaranteed as production-ready examples. Users deploying on non-default hardware should review and adjust the configurations for their environment.

Prerequisites

Have the proper client tools installed on your local system to use this guide.

Checkout llm-d repo:

  export branch="main" # branch, tag, or commit hash
  git clone https://github.com/llm-d/llm-d.git && cd llm-d && git checkout ${branch}

Set the following environment variables:

  export GAIE_VERSION=v1.5.0
  export GUIDE_NAME="optimized-baseline"
  export NAMESPACE=llm-d-optimized-baseline

Install the Gateway API Inference Extension CRDs:

  kubectl apply -k "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=${GAIE_VERSION}"

Create a target namespace for the installation

    kubectl create namespace ${NAMESPACE}

Installation Instructions

1. Deploy the llm-d Router

Standalone Mode

This deploys the llm-d Router in Standalone Mode:

# Assuming base-directory is the root of the llm-d repo
helm install ${GUIDE_NAME} \
    oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone \
    -f guides/recipes/router/base.values.yaml \
    -f guides/${GUIDE_NAME}/router/${GUIDE_NAME}.values.yaml \
    -n ${NAMESPACE} --version ${GAIE_VERSION}

Gateway Mode

To use a Kubernetes Gateway managed proxy rather than the standalone version, follow these steps instead of applying the previous Helm chart:

Deploy a Kubernetes Gateway named by following one of the gateway guides.
Deploy the llm-d router and an HTTPRoute that connects it to the Gateway as follows:

export PROVIDER_NAME=gke # options: none, gke, agentgateway, istio
helm install ${GUIDE_NAME} \
    oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool  \
    -f guides/recipes/router/base.values.yaml \
    -f guides/${GUIDE_NAME}/router/${GUIDE_NAME}.values.yaml \
    --set provider.name=${PROVIDER_NAME} \
    --set experimentalHttpRoute.enabled=true \
    --set experimentalHttpRoute.inferenceGatewayName=llm-d-inference-gateway \
    -n ${NAMESPACE} --version ${GAIE_VERSION}

2. Deploy the Model Server

Apply the Kustomize overlays for your specific backend (defaulting to NVIDIA GPU / vLLM):

export INFRA_PROVIDER=base # base | gke
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/gpu/vllm/${INFRA_PROVIDER}/

Other Accelerators

# AMD GPU
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/amd/vllm/

# Intel XPU
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/xpu/vllm/

# Intel Gaudi (HPU)
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/hpu/vllm/

# Google TPU v6e
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/tpu-v6/vllm/

# Google TPU v7
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/tpu-v7/vllm/

# CPU
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/cpu/vllm/

3. (Optional) Enable monitoring

note

GKE provides automatic application monitoring out of the box. The llm-d Monitoring stack is not required for GKE, but it is available if you prefer to use it.

Install the Monitoring stack.
Deploy the monitoring resources for this guide.

kubectl apply -n ${NAMESPACE} -k guides/recipes/modelserver/components/monitoring

Verification

1. Get the IP of the Proxy

Standalone Mode

export IP=$(kubectl get service ${GUIDE_NAME}-epp -n ${NAMESPACE} -o jsonpath='{.spec.clusterIP}')

Gateway Mode

export IP=$(kubectl get gateway llm-d-inference-gateway -n ${NAMESPACE} -o jsonpath='{.status.addresses[0].value}')

2. Send Test Requests

Open a temporary interactive shell inside the cluster:

kubectl run curl-debug --rm -it \
    --image=cfmanteiga/alpine-bash-curl-jq \
    --env="IP=$IP" \
    --env="NAMESPACE=$NAMESPACE" \
    -- /bin/bash

Send a completion request:

curl -X POST http://${IP}/v1/completions \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "Qwen/Qwen3-32B",
        "prompt": "How are you today?"
    }' | jq

Benchmarking

The benchmark launches a pod (llmdbench-harness-launcher) that, in this case, uses inference-perf with a shared prefix synthetic workload named shared_prefix_synthetic. This workload runs several stages with different rates. The results will be saved to a local folder by using the -o flag of run_only.sh. Each experiment is saved under the specified output folder, e.g., ./results/<experiment ID>/inference-perf_<experiment ID>_optimized-baseline_<model name> folder

For more details, refer to the benchmark instructions doc.

1. Prepare the Benchmarking Suite

Download the benchmark script:

curl -L -O https://raw.githubusercontent.com/llm-d/llm-d-benchmark/main/existing_stack/run_only.sh
chmod u+x run_only.sh

Create HuggingFace token

2. Download the Workload Template

curl -LJO "https://raw.githubusercontent.com/llm-d/llm-d/main/guides/${GUIDE_NAME}/benchmark-templates/guide.yaml"

3. Execute Benchmark

export IP=$(kubectl get service ${GUIDE_NAME}-epp  -n ${NAMESPACE} -o jsonpath='{.spec.clusterIP}')

Click here for Gateway Mode

export IP=$(kubectl get gateway llm-d-inference-gateway  -n ${NAMESPACE} -o jsonpath='{.status.addresses[0].value}')

envsubst < guide.yaml > config.yaml
./run_only.sh -c config.yaml -o ./results

Cleanup

To remove the deployed components:

helm uninstall ${GUIDE_NAME} -n ${NAMESPACE}
kubectl delete  -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/gpu/vllm/${INFRA_PROVIDER}
kubectl delete namespace ${NAMESPACE}

Benchmarking Report

The benchmark runs on 16 × H100 GPUs, distributed across 8 model servers (2 H100s per server with TP=2).

Comparing llm-d Routing to a Simple Kubernetes Service

Graphs below compare optimized-baseline routing to a stock Kubernetes Service that round-robins requests across the same 8 vLLM pods (no EPP, no scoring).

Summary across the full ladder (rates 3 → 60):

Metric	k8s service (RR)	llm-d Optimized	Δ% vs k8s
Output tokens/sec	5,722	13,163	+130.0%
Requests/sec	35.87	36.38	+1.4%
TTFT mean (s)	58.10	0.156	−99.73%
TTFT p90 (s)	107.43	0.206	−99.81%
ITL mean (ms)	44.0	47.0	+6.8%

Click to view the per-rate breakdown across the full ladder

Output tokens/sec — higher is better; TTFT in seconds — lower is better.

Rate	k8s Output	llm-d Output	k8s TTFT mean	llm-d TTFT mean	k8s TTFT p90	llm-d TTFT p90
3	1,797	1,777	0.415	0.133	0.522	0.162
10	4,215	5,066	0.630	0.125	1.014	0.172
15	5,381	7,053	0.881	0.122	1.593	0.187
20	6,205	11,688	18.103	0.174	35.344	0.283
22	5,517	12,436	20.171	0.116	39.436	0.148
25	5,965	12,501	21.842	0.116	42.813	0.146
30	5,702	13,862	24.597	0.117	46.036	0.148
35	5,890	14,026	24.162	0.117	45.190	0.150
40	6,336	16,041	68.673	0.153	126.238	0.216
43	6,588	16,339	72.429	0.254	130.275	0.218
46	6,459	16,665	70.084	0.154	129.810	0.220
49	6,265	16,126	70.659	0.151	133.718	0.209
52	6,303	16,474	74.326	0.152	134.981	0.219
55	6,290	16,854	72.564	0.153	134.034	0.215
57	6,089	16,641	72.329	0.153	135.023	0.217
60	6,551	17,064	75.586	0.154	138.663	0.217

Overview​

Default Configuration​

Supported Hardware Backends​

Prerequisites​

Installation Instructions​

1. Deploy the llm-d Router​

Standalone Mode​

Gateway Mode

2. Deploy the Model Server​

Other Accelerators

3. (Optional) Enable monitoring​

Verification​

1. Get the IP of the Proxy​

2. Send Test Requests​

Benchmarking​

1. Prepare the Benchmarking Suite​

2. Download the Workload Template​

3. Execute Benchmark​

Cleanup​

Benchmarking Report​

Comparing llm-d Routing to a Simple Kubernetes Service​