Skip to main content

Optimized Baseline

Nightly - optimized baseline E2E (OpenShift) Nightly - optimized baseline E2E (CKS) Nightly - optimized baseline E2E (GKE)

Overview

This guide deploys the recommended out of the box scheduling configuration for most vLLM and SGLang deployments, reducing tail latency and increasing throughput through load-aware and prefix-cache aware balancing.

The optimized-baseline defaults to two main routing criteria:

Default Configuration

ParameterValue
ModelQwen/Qwen3-32B
Replicas8
Tensor Parallelism2
GPUs per replica2
Total GPUs16

Supported Hardware Backends

This guide includes configurations for the following accelerators:

BackendDirectoryNotes
NVIDIA GPUmodelserver/gpu/vllm/Default configuration
NVIDIA GPU (SGLang)modelserver/gpu/sglang/SGLang inference server
AMD GPUmodelserver/amd/vllm/AMD GPU
Intel XPUmodelserver/xpu/vllm/Intel Data Center GPU Max 1550+
Intel Gaudi (HPU)modelserver/hpu/vllm/Gaudi 1/2/3 with DRA support
Google TPU v6emodelserver/tpu-v6/vllm/GKE TPU
Google TPU v7modelserver/tpu-v7/vllm/GKE TPU
CPUmodelserver/cpu/vllm/Intel/AMD, 64 cores + 64GB RAM per replica
note

Some hardware variants use reduced configurations (fewer replicas, smaller models) to enable CI testing for compatibility and regression checks. These configurations are maintained by their respective hardware vendors and are not guaranteed as production-ready examples. Users deploying on non-default hardware should review and adjust the configurations for their environment.

Prerequisites

Installation Instructions

1. Prepare a Target Namespace

  • Create a target namespace for the installation.

    export NAMESPACE=llm-d-optimized-baseline
    kubectl create namespace ${NAMESPACE}

2. Deploy the Standalone Inference Scheduler

This deploys the inference scheduler with an Envoy sidecar.

helm install optimized-baseline \
oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone \
-f guides/recipes/scheduler/base.values.yaml \
-f guides/optimized-baseline/scheduler/optimized-baseline.values.yaml \
-n ${NAMESPACE} --version v1.4.0

3. Deploy the Model Server

Apply the Kustomize overlays for your specific backend (defaulting to NVIDIA GPU / vLLM):

kubectl apply -n ${NAMESPACE} -k guides/optimized-baseline/modelserver/gpu/vllm/

4. Enable monitoring (optional)

note

GKE provides automatic application monitoring out of the box. The llm-d Monitoring stack is not required for GKE, but it is available if you prefer to use it.

kubectl apply -n ${NAMESPACE} -k guides/recipes/modelserver/components/monitoring

Verification

1. Port-Forward to the Scheduler Service

Expose the standalone inference scheduler service to your local environment:

kubectl port-forward -n ${NAMESPACE} svc/optimized-baseline-epp 8000:8081

2. Send Test Requests

In a separate terminal, verify model availability and inference:

List Available Models:

curl -s http://localhost:8000/v1/models | jq

Send a Completion Request:

curl -X POST http://localhost:8000/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3-32B",
"prompt": "How are you today?"
}' | jq

Benchmarking

The benchmark launches a pod (llmdbench-harness-launcher) that, in this case, uses inference-perf with a shared prefix synthetic workload named shared_prefix_synthetic. This workload runs several stages with different rates. The results will be saved to a local folder by using the -o flag of run_only.sh. Each experiment is saved under the specified output folder, e.g., ./results/<experiment ID>/inference-perf_<experiment ID>_shared_prefix_synthetic_optimized-baseline_<model name> folder

For more details, refer to the benchmark instructions doc.

1. Prepare the Benchmarking Suite

  • Download the benchmark script:

    curl -L -O https://raw.githubusercontent.com/llm-d/llm-d-benchmark/main/existing_stack/run_only.sh
    chmod u+x run_only.sh
  • Create HuggingFace token

2. Download the Workload Template

curl -LJO "https://raw.githubusercontent.com/llm-d/llm-d/main/guides/optimized-baseline/benchmark-templates/shared_prefix.yaml"

3. Execute Benchmark

export GATEWAY_SVC=optimized-baseline-epp
export PORT=8081
envsubst < shared_prefix.yaml > config.yaml
./run_only.sh -c config.yaml -o ./results

Cleanup

To remove the deployed components:

helm uninstall optimized-baseline -n ${NAMESPACE}
kubectl delete -n ${NAMESPACE} -k guides/optimized-baseline/modelserver/gpu/vllm/

Benchmarking Report

The benchmark is running on 16 H100 GPUs, distributed across 8 model servers (2 H100s per server with TP=2).

There is a report for each stage.

Click here to view the report for rate=60 from the above example
metrics:
latency:
inter_token_latency:
max: 0.3976375609636307
mean: 0.06765722222528071
min: 1.3881013728678226e-05
p0p1: 1.722399512073025e-05
p1: 0.00027551683422643626
p5: 0.02622559448063839
p10: 0.033432915166486055
p25: 0.04734217074292246
p50: 0.07592084849602543
p75: 0.08339276927290484
p90: 0.0940622523019556
p95: 0.09673563879623544
p99: 0.13096482709748672
p99p9: 0.18361429275909982
units: s/token
normalized_time_per_output_token:
max: 24.031401686001725
mean: 0.15119099450472326
min: 0.029169302775326988
p0p1: 0.030635711364870543
p1: 0.03316916608329783
p5: 0.03686109928604165
p10: 0.0422473103951594
p25: 0.06722495797558614
p50: 0.07227312453111687
p75: 0.0776502936300094
p90: 0.08589849215923934
p95: 0.15161141803650466
p99: 2.2160512474802
p99p9: 3.599132445602329
units: s/token
request_latency:
max: 85.97330250998493
mean: 67.864936218041
min: 29.08179486700101
p0p1: 30.597063626140066
p1: 32.82888973700406
p5: 36.53580686951754
p10: 41.68587793367915
p25: 66.56756829548976
p50: 71.62742416901165
p75: 75.53078864999407
p90: 82.8551616292796
p95: 85.17766979286971
p99: 85.8529812369059
p99p9: 85.96677305092867
units: s
time_per_output_token:
max: 0.08567342651402578
mean: 0.06765722222528071
min: 0.028917132598988246
p0p1: 0.030438513501739303
p1: 0.03267320581834996
p5: 0.03637065519659664
p10: 0.04149165656909463
p25: 0.06637948430397955
p50: 0.07139790143899155
p75: 0.07530937768449075
p90: 0.08259890788880875
p95: 0.08494466238816095
p99: 0.0856393391511339
p99p9: 0.08567179985522212
units: s/token
time_to_first_token:
max: 0.2749739610007964
mean: 0.1203408618576747
min: 0.04670933203306049
p0p1: 0.05085431289958069
p1: 0.0542934795509791
p5: 0.06336988278490026
p10: 0.07046441090060399
p25: 0.08575929325888865
p50: 0.1132554289943073
p75: 0.1517725815065205
p90: 0.18095784459728748
p95: 0.19695026772387791
p99: 0.22566659807867837
p99p9: 0.25035182150500235
units: s
requests:
failures: 0
input_length:
max: 7668.0
mean: 7576.364
min: 7487.0
p0p1: 7490.992
p1: 7512.0
p5: 7531.0
p10: 7541.9
p25: 7556.0
p50: 7577.0
p75: 7594.0
p90: 7611.0
p95: 7624.0
p99: 7646.0
p99p9: 7665.006
units: count
output_length:
max: 1999.0
mean: 941.86
min: 3.0
p0p1: 20.0
p1: 32.99
p5: 500.2
p10: 949.9
p25: 992.0
p50: 997.0
p75: 1000.0
p90: 1000.0
p95: 1000.0
p99: 1000.0
p99p9: 1500.495
units: count
total: 1500
throughput:
output_tokens_per_sec: 13574.368209884744
requests_per_sec: 14.41229929064271
total_tokens_per_sec: 122767.19371273571
time:
duration: 24.984177332022227
scenario:
load:
args:
api:
headers: null
streaming: true
type: completion
circuit_breakers: null
data:
input_distribution: null
output_distribution: null
path: null
shared_prefix:
enable_multi_turn_chat: false
num_groups: 150
num_prompts_per_group: 5
output_len: 1000
question_len: 1200
system_prompt_len: 6000
trace: null
type: shared_prefix
load:
circuit_breakers: []
interval: 1.0
num_workers: 224
request_timeout: null
stages:
- concurrency_level: null
duration: 50
num_requests: null
rate: 15.0
- concurrency_level: null
duration: 20
num_requests: null
rate: 3.0
- concurrency_level: null
duration: 20
num_requests: null
rate: 10.0
- concurrency_level: null
duration: 20
num_requests: null
rate: 15.0
- concurrency_level: null
duration: 38
num_requests: null
rate: 20.0
- concurrency_level: null
duration: 34
num_requests: null
rate: 22.0
- concurrency_level: null
duration: 30
num_requests: null
rate: 25.0
- concurrency_level: null
duration: 25
num_requests: null
rate: 30.0
- concurrency_level: null
duration: 21
num_requests: null
rate: 35.0
- concurrency_level: null
duration: 38
num_requests: null
rate: 40.0
- concurrency_level: null
duration: 36
num_requests: null
rate: 43.0
- concurrency_level: null
duration: 33
num_requests: null
rate: 46.0
- concurrency_level: null
duration: 30
num_requests: null
rate: 49.0
- concurrency_level: null
duration: 29
num_requests: null
rate: 52.0
- concurrency_level: null
duration: 27
num_requests: null
rate: 55.0
- concurrency_level: null
duration: 26
num_requests: null
rate: 57.0
- concurrency_level: null
duration: 25
num_requests: null
rate: 60.0
sweep: null
trace: null
type: poisson
worker_max_concurrency: 100
worker_max_tcp_connections: 2500
metrics: null
report:
prometheus:
per_stage: false
summary: true
request_lifecycle:
per_request: true
per_stage: true
summary: true
server:
api_key: null
base_url: http://infra-optimized-baseline-inference-gateway-istio.dpikus-intel-inf.svc.cluster.local:80
ignore_eos: true
model_name: Qwen/Qwen3-32B
type: vllm
storage:
google_cloud_storage: null
local_storage:
path: /requests/inference-perf_1769435052_Shared_prefix_inf-scheduling-guide-Qwen3-32B
report_file_prefix: null
simple_storage_service: null
tokenizer:
pretrained_model_name_or_path: Qwen/Qwen3-32B
token: null
trust_remote_code: null
metadata:
stage: 2
name: inference-perf
model:
name: unknown
version: "0.1"

Comparing LLM-d scheduling to a simple kubernetes service

The following graphs illustrate the relationship between latency, throughput, and QPS, as generated by the inference-perf --analyze. For benchmarking, we compared our results against a standard Kubernetes (k8s) service endpoint that routes traffic directly to vLLM pods.

Throughput vs QPS Throughput vs Latency

The following data captures the performance of the last stage conducted at a fixed request rate of 60. We also compare the result with k8s service.

  • Throughput: Requests/sec +151.5%; Total tokens/sec +151.7%
  • Latency: TTFT (mean) -99.66%; E2E request latency (mean) -35.6%
  • Per-token speed: Inter-token latency (mean) -3.9%
Metrick8s (Mean)llm-d (Mean)Δ (llm-d - k8s)Δ% vs k8s
Requests/sec5.730614.4123+8.6817+151.5%
Input tokens/sec43,417.86109,192.83+65,774.97+151.5%
Output tokens/sec5,362.1613,574.37+8,212.21+153.2%
Total tokens/sec48,780.02122,767.19+73,987.17+151.7%
Request latency (s)105.413367.8649-37.5484-35.6%
TTFT (s)34.91450.1203-34.7942-99.66%
Inter-token latency (ms)70.4267.66-2.76-3.9%
Content Source

This content is automatically synced from guides/optimized-baseline/README.md on the main branch of the llm-d/llm-d repository.

📝 To suggest changes, please edit the source file or create an issue.