Skip to main content

Optimized Baseline

Nightly - optimized baseline E2E (OpenShift) Nightly - optimized baseline E2E (CKS) Nightly - optimized baseline E2E (GKE)

Overview

This guide deploys the recommended out of the box configuration for most vLLM and SGLang deployments, reducing tail latency and increasing throughput through load-aware and prefix-cache aware balancing.

The optimized-baseline defaults to two main routing criteria:

Default Configuration

ParameterValue
ModelQwen/Qwen3-32B
Replicas8
Tensor Parallelism2
GPUs per replica2
Total GPUs16

Supported Hardware Backends

This guide includes configurations for the following accelerators:

BackendDirectoryNotes
NVIDIA GPUmodelserver/gpu/vllm/$(INFRA_PROVIDER)/Default configuration (INFRA_PROVIDER options: base, gke)
NVIDIA GPU (SGLang)modelserver/gpu/sglang/$(INFRA_PROVIDER)/SGLang inference server (INFRA_PROVIDER options: base, gke)
AMD GPUmodelserver/amd/vllm/AMD GPU
AMD GPU (SGLang)modelserver/amd/sglangAMD GPU
Intel XPUmodelserver/xpu/vllm/Intel Data Center GPU Max 1550+
Intel Gaudi (HPU)modelserver/hpu/vllm/Gaudi 1/2/3 with DRA support
Google TPU v6emodelserver/tpu-v6/vllm/GKE TPU
Google TPU v7modelserver/tpu-v7/vllm/GKE TPU
CPUmodelserver/cpu/vllm/Intel/AMD, 64 cores + 64GB RAM per replica
note

Some hardware variants use reduced configurations (fewer replicas, smaller models) to enable CI testing for compatibility and regression checks. These configurations are maintained by their respective hardware vendors and are not guaranteed as production-ready examples. Users deploying on non-default hardware should review and adjust the configurations for their environment.

Prerequisites

  • Have the proper client tools installed on your local system to use this guide.

  • Checkout llm-d repo:

    export branch="main" # branch, tag, or commit hash
    git clone https://github.com/llm-d/llm-d.git && cd llm-d && git checkout ${branch}
  • Set the following environment variables:

    export GAIE_VERSION=v1.5.0
    export GUIDE_NAME="optimized-baseline"
    export NAMESPACE=llm-d-optimized-baseline
  • Install the Gateway API Inference Extension CRDs:

    kubectl apply -k "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=${GAIE_VERSION}"
  • Create a target namespace for the installation

    kubectl create namespace ${NAMESPACE}

Installation Instructions

1. Deploy the llm-d Router

Standalone Mode

This deploys the llm-d Router in Standalone Mode:

# Assuming base-directory is the root of the llm-d repo
helm install ${GUIDE_NAME} \
oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone \
-f guides/recipes/router/base.values.yaml \
-f guides/${GUIDE_NAME}/router/${GUIDE_NAME}.values.yaml \
-n ${NAMESPACE} --version ${GAIE_VERSION}

Gateway Mode

To use a Kubernetes Gateway managed proxy rather than the standalone version, follow these steps instead of applying the previous Helm chart:

  1. Deploy a Kubernetes Gateway named by following one of the gateway guides.
  2. Deploy the llm-d router and an HTTPRoute that connects it to the Gateway as follows:
export PROVIDER_NAME=gke # options: none, gke, agentgateway, istio
helm install ${GUIDE_NAME} \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
-f guides/recipes/router/base.values.yaml \
-f guides/${GUIDE_NAME}/router/${GUIDE_NAME}.values.yaml \
--set provider.name=${PROVIDER_NAME} \
--set experimentalHttpRoute.enabled=true \
--set experimentalHttpRoute.inferenceGatewayName=llm-d-inference-gateway \
-n ${NAMESPACE} --version ${GAIE_VERSION}

2. Deploy the Model Server

Apply the Kustomize overlays for your specific backend (defaulting to NVIDIA GPU / vLLM):

export INFRA_PROVIDER=base # base | gke
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/gpu/vllm/${INFRA_PROVIDER}/

Other Accelerators

# AMD GPU
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/amd/vllm/

# Intel XPU
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/xpu/vllm/

# Intel Gaudi (HPU)
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/hpu/vllm/

# Google TPU v6e
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/tpu-v6/vllm/

# Google TPU v7
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/tpu-v7/vllm/

# CPU
kubectl apply -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/cpu/vllm/

3. (Optional) Enable monitoring

note

GKE provides automatic application monitoring out of the box. The llm-d Monitoring stack is not required for GKE, but it is available if you prefer to use it.

kubectl apply -n ${NAMESPACE} -k guides/recipes/modelserver/components/monitoring

Verification

1. Get the IP of the Proxy

Standalone Mode

export IP=$(kubectl get service ${GUIDE_NAME}-epp -n ${NAMESPACE} -o jsonpath='{.spec.clusterIP}')
Gateway Mode
export IP=$(kubectl get gateway llm-d-inference-gateway -n ${NAMESPACE} -o jsonpath='{.status.addresses[0].value}')

2. Send Test Requests

Open a temporary interactive shell inside the cluster:

kubectl run curl-debug --rm -it \
--image=cfmanteiga/alpine-bash-curl-jq \
--env="IP=$IP" \
--env="NAMESPACE=$NAMESPACE" \
-- /bin/bash

Send a completion request:

curl -X POST http://${IP}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3-32B",
"prompt": "How are you today?"
}' | jq

Benchmarking

The benchmark launches a pod (llmdbench-harness-launcher) that, in this case, uses inference-perf with a shared prefix synthetic workload named shared_prefix_synthetic. This workload runs several stages with different rates. The results will be saved to a local folder by using the -o flag of run_only.sh. Each experiment is saved under the specified output folder, e.g., ./results/<experiment ID>/inference-perf_<experiment ID>_optimized-baseline_<model name> folder

For more details, refer to the benchmark instructions doc.

1. Prepare the Benchmarking Suite

  • Download the benchmark script:

    curl -L -O https://raw.githubusercontent.com/llm-d/llm-d-benchmark/main/existing_stack/run_only.sh
    chmod u+x run_only.sh
  • Create HuggingFace token

2. Download the Workload Template

curl -LJO "https://raw.githubusercontent.com/llm-d/llm-d/main/guides/${GUIDE_NAME}/benchmark-templates/guide.yaml"

3. Execute Benchmark

export IP=$(kubectl get service ${GUIDE_NAME}-epp -n ${NAMESPACE} -o jsonpath='{.spec.clusterIP}')
Click here for Gateway Mode
export IP=$(kubectl get gateway llm-d-inference-gateway -n ${NAMESPACE} -o jsonpath='{.status.addresses[0].value}')
envsubst < guide.yaml > config.yaml
./run_only.sh -c config.yaml -o ./results

Cleanup

To remove the deployed components:

helm uninstall ${GUIDE_NAME} -n ${NAMESPACE}
kubectl delete -n ${NAMESPACE} -k guides/${GUIDE_NAME}/modelserver/gpu/vllm/${INFRA_PROVIDER}
kubectl delete namespace ${NAMESPACE}

Benchmarking Report

The benchmark runs on 16 × H100 GPUs, distributed across 8 model servers (2 H100s per server with TP=2).

Comparing llm-d Routing to a Simple Kubernetes Service

Graphs below compare optimized-baseline routing to a stock Kubernetes Service that round-robins requests across the same 8 vLLM pods (no EPP, no scoring).

Throughput vs QPS Latency vs QPS TTFT p90 vs QPS

Summary across the full ladder (rates 3 → 60):

Metrick8s service (RR)llm-d OptimizedΔ% vs k8s
Output tokens/sec5,72213,163+130.0%
Requests/sec35.8736.38+1.4%
TTFT mean (s)58.100.156−99.73%
TTFT p90 (s)107.430.206−99.81%
ITL mean (ms)44.047.0+6.8%
Click to view the per-rate breakdown across the full ladder

Output tokens/sec — higher is better; TTFT in seconds — lower is better.

Ratek8s Outputllm-d Outputk8s TTFT meanllm-d TTFT meank8s TTFT p90llm-d TTFT p90
31,7971,7770.4150.1330.5220.162
104,2155,0660.6300.1251.0140.172
155,3817,0530.8810.1221.5930.187
206,20511,68818.1030.17435.3440.283
225,51712,43620.1710.11639.4360.148
255,96512,50121.8420.11642.8130.146
305,70213,86224.5970.11746.0360.148
355,89014,02624.1620.11745.1900.150
406,33616,04168.6730.153126.2380.216
436,58816,33972.4290.254130.2750.218
466,45916,66570.0840.154129.8100.220
496,26516,12670.6590.151133.7180.209
526,30316,47474.3260.152134.9810.219
556,29016,85472.5640.153134.0340.215
576,08916,64172.3290.153135.0230.217
606,55117,06475.5860.154138.6630.217