Optimized Baseline

Overview

This guide deploys the recommended out of the box scheduling configuration for most vLLM and SGLang deployments, reducing tail latency and increasing throughput through load-aware and prefix-cache aware balancing.

The optimized-baseline defaults to two main routing criteria:

Prefix-cache aware using the prefix cache scorer, which scores candidate endpoints by estimating prompt prefix cache reuse on each model server.
Load-aware using both the kv-cache utilization and the queue size scorers.

Default Configuration

Parameter	Value
Model	Qwen/Qwen3-32B
Replicas	8
Tensor Parallelism	2
GPUs per replica	2
Total GPUs	16

Supported Hardware Backends

This guide includes configurations for the following accelerators:

Backend	Directory	Notes
NVIDIA GPU	`modelserver/gpu/vllm/`	Default configuration
NVIDIA GPU (SGLang)	`modelserver/gpu/sglang/`	SGLang inference server
AMD GPU	`modelserver/amd/vllm/`	AMD GPU
Intel XPU	`modelserver/xpu/vllm/`	Intel Data Center GPU Max 1550+
Intel Gaudi (HPU)	`modelserver/hpu/vllm/`	Gaudi 1/2/3 with DRA support
Google TPU v6e	`modelserver/tpu-v6/vllm/`	GKE TPU
Google TPU v7	`modelserver/tpu-v7/vllm/`	GKE TPU
CPU	`modelserver/cpu/vllm/`	Intel/AMD, 64 cores + 64GB RAM per replica

note

Some hardware variants use reduced configurations (fewer replicas, smaller models) to enable CI testing for compatibility and regression checks. These configurations are maintained by their respective hardware vendors and are not guaranteed as production-ready examples. Users deploying on non-default hardware should review and adjust the configurations for their environment.

Prerequisites

Install the Gateway API Inference Extension CRDs
Have the proper client tools installed on your local system to use this guide.

Checkout llm-d repo:

  export branch="main" # branch, tag, or commit hash
  git clone https://github.com/llm-d/llm-d.git && cd llm-d && git checkout ${branch}

Installation Instructions

1. Prepare a Target Namespace

Create a target namespace for the installation.

export NAMESPACE=llm-d-optimized-baseline
kubectl create namespace ${NAMESPACE}

2. Deploy the Standalone Inference Scheduler

This deploys the inference scheduler with an Envoy sidecar.

helm install optimized-baseline \
  oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone \
  -f guides/recipes/scheduler/base.values.yaml \
  -f guides/optimized-baseline/scheduler/optimized-baseline.values.yaml \
  -n ${NAMESPACE} --version v1.4.0

3. Deploy the Model Server

Apply the Kustomize overlays for your specific backend (defaulting to NVIDIA GPU / vLLM):

kubectl apply -n ${NAMESPACE} -k guides/optimized-baseline/modelserver/gpu/vllm/

4. Enable monitoring (optional)

note

GKE provides automatic application monitoring out of the box. The llm-d Monitoring stack is not required for GKE, but it is available if you prefer to use it.

Install the Monitoring stack.
Deploy the monitoring resources for this guide.

kubectl apply -n ${NAMESPACE} -k guides/recipes/modelserver/components/monitoring

Verification

1. Port-Forward to the Scheduler Service

Expose the standalone inference scheduler service to your local environment:

kubectl port-forward -n ${NAMESPACE} svc/optimized-baseline-epp 8000:8081

2. Send Test Requests

In a separate terminal, verify model availability and inference:

List Available Models:

curl -s http://localhost:8000/v1/models | jq

Send a Completion Request:

curl -X POST http://localhost:8000/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
        "model": "Qwen/Qwen3-32B",
        "prompt": "How are you today?"
      }' | jq

Benchmarking

The benchmark launches a pod (llmdbench-harness-launcher) that, in this case, uses inference-perf with a shared prefix synthetic workload named shared_prefix_synthetic. This workload runs several stages with different rates. The results will be saved to a local folder by using the -o flag of run_only.sh. Each experiment is saved under the specified output folder, e.g., ./results/<experiment ID>/inference-perf_<experiment ID>_shared_prefix_synthetic_optimized-baseline_<model name> folder

For more details, refer to the benchmark instructions doc.

1. Prepare the Benchmarking Suite

Download the benchmark script:

curl -L -O https://raw.githubusercontent.com/llm-d/llm-d-benchmark/main/existing_stack/run_only.sh
chmod u+x run_only.sh

Create HuggingFace token

2. Download the Workload Template

curl -LJO "https://raw.githubusercontent.com/llm-d/llm-d/main/guides/optimized-baseline/benchmark-templates/shared_prefix.yaml"

3. Execute Benchmark

export GATEWAY_SVC=optimized-baseline-epp
export PORT=8081
envsubst < shared_prefix.yaml > config.yaml
./run_only.sh -c config.yaml -o ./results

Cleanup

To remove the deployed components:

helm uninstall optimized-baseline -n ${NAMESPACE}
kubectl delete -n ${NAMESPACE} -k guides/optimized-baseline/modelserver/gpu/vllm/

Benchmarking Report

The benchmark is running on 16 H100 GPUs, distributed across 8 model servers (2 H100s per server with TP=2).

There is a report for each stage.

Click here to view the report for rate=60 from the above example

metrics:
  latency:
    inter_token_latency:
      max: 0.3976375609636307
      mean: 0.06765722222528071
      min: 1.3881013728678226e-05
      p0p1: 1.722399512073025e-05
      p1: 0.00027551683422643626
      p5: 0.02622559448063839
      p10: 0.033432915166486055
      p25: 0.04734217074292246
      p50: 0.07592084849602543
      p75: 0.08339276927290484
      p90: 0.0940622523019556
      p95: 0.09673563879623544
      p99: 0.13096482709748672
      p99p9: 0.18361429275909982
      units: s/token
    normalized_time_per_output_token:
      max: 24.031401686001725
      mean: 0.15119099450472326
      min: 0.029169302775326988
      p0p1: 0.030635711364870543
      p1: 0.03316916608329783
      p5: 0.03686109928604165
      p10: 0.0422473103951594
      p25: 0.06722495797558614
      p50: 0.07227312453111687
      p75: 0.0776502936300094
      p90: 0.08589849215923934
      p95: 0.15161141803650466
      p99: 2.2160512474802
      p99p9: 3.599132445602329
      units: s/token
    request_latency:
      max: 85.97330250998493
      mean: 67.864936218041
      min: 29.08179486700101
      p0p1: 30.597063626140066
      p1: 32.82888973700406
      p5: 36.53580686951754
      p10: 41.68587793367915
      p25: 66.56756829548976
      p50: 71.62742416901165
      p75: 75.53078864999407
      p90: 82.8551616292796
      p95: 85.17766979286971
      p99: 85.8529812369059
      p99p9: 85.96677305092867
      units: s
    time_per_output_token:
      max: 0.08567342651402578
      mean: 0.06765722222528071
      min: 0.028917132598988246
      p0p1: 0.030438513501739303
      p1: 0.03267320581834996
      p5: 0.03637065519659664
      p10: 0.04149165656909463
      p25: 0.06637948430397955
      p50: 0.07139790143899155
      p75: 0.07530937768449075
      p90: 0.08259890788880875
      p95: 0.08494466238816095
      p99: 0.0856393391511339
      p99p9: 0.08567179985522212
      units: s/token
    time_to_first_token:
      max: 0.2749739610007964
      mean: 0.1203408618576747
      min: 0.04670933203306049
      p0p1: 0.05085431289958069
      p1: 0.0542934795509791
      p5: 0.06336988278490026
      p10: 0.07046441090060399
      p25: 0.08575929325888865
      p50: 0.1132554289943073
      p75: 0.1517725815065205
      p90: 0.18095784459728748
      p95: 0.19695026772387791
      p99: 0.22566659807867837
      p99p9: 0.25035182150500235
      units: s
  requests:
    failures: 0
    input_length:
      max: 7668.0
      mean: 7576.364
      min: 7487.0
      p0p1: 7490.992
      p1: 7512.0
      p5: 7531.0
      p10: 7541.9
      p25: 7556.0
      p50: 7577.0
      p75: 7594.0
      p90: 7611.0
      p95: 7624.0
      p99: 7646.0
      p99p9: 7665.006
      units: count
    output_length:
      max: 1999.0
      mean: 941.86
      min: 3.0
      p0p1: 20.0
      p1: 32.99
      p5: 500.2
      p10: 949.9
      p25: 992.0
      p50: 997.0
      p75: 1000.0
      p90: 1000.0
      p95: 1000.0
      p99: 1000.0
      p99p9: 1500.495
      units: count
    total: 1500
  throughput:
    output_tokens_per_sec: 13574.368209884744
    requests_per_sec: 14.41229929064271
    total_tokens_per_sec: 122767.19371273571
  time:
    duration: 24.984177332022227
scenario:
  load:
    args:
      api:
        headers: null
        streaming: true
        type: completion
      circuit_breakers: null
      data:
        input_distribution: null
        output_distribution: null
        path: null
        shared_prefix:
          enable_multi_turn_chat: false
          num_groups: 150
          num_prompts_per_group: 5
          output_len: 1000
          question_len: 1200
          system_prompt_len: 6000
        trace: null
        type: shared_prefix
      load:
        circuit_breakers: []
        interval: 1.0
        num_workers: 224
        request_timeout: null
        stages:
          - concurrency_level: null
            duration: 50
            num_requests: null
            rate: 15.0
          - concurrency_level: null
            duration: 20
            num_requests: null
            rate: 3.0
          - concurrency_level: null
            duration: 20
            num_requests: null
            rate: 10.0
          - concurrency_level: null
            duration: 20
            num_requests: null
            rate: 15.0
          - concurrency_level: null
            duration: 38
            num_requests: null
            rate: 20.0
          - concurrency_level: null
            duration: 34
            num_requests: null
            rate: 22.0
          - concurrency_level: null
            duration: 30
            num_requests: null
            rate: 25.0
          - concurrency_level: null
            duration: 25
            num_requests: null
            rate: 30.0
          - concurrency_level: null
            duration: 21
            num_requests: null
            rate: 35.0
          - concurrency_level: null
            duration: 38
            num_requests: null
            rate: 40.0
          - concurrency_level: null
            duration: 36
            num_requests: null
            rate: 43.0
          - concurrency_level: null
            duration: 33
            num_requests: null
            rate: 46.0
          - concurrency_level: null
            duration: 30
            num_requests: null
            rate: 49.0
          - concurrency_level: null
            duration: 29
            num_requests: null
            rate: 52.0
          - concurrency_level: null
            duration: 27
            num_requests: null
            rate: 55.0
          - concurrency_level: null
            duration: 26
            num_requests: null
            rate: 57.0
          - concurrency_level: null
            duration: 25
            num_requests: null
            rate: 60.0
        sweep: null
        trace: null
        type: poisson
        worker_max_concurrency: 100
        worker_max_tcp_connections: 2500
      metrics: null
      report:
        prometheus:
          per_stage: false
          summary: true
        request_lifecycle:
          per_request: true
          per_stage: true
          summary: true
      server:
        api_key: null
        base_url: http://infra-optimized-baseline-inference-gateway-istio.dpikus-intel-inf.svc.cluster.local:80
        ignore_eos: true
        model_name: Qwen/Qwen3-32B
        type: vllm
      storage:
        google_cloud_storage: null
        local_storage:
          path: /requests/inference-perf_1769435052_Shared_prefix_inf-scheduling-guide-Qwen3-32B
          report_file_prefix: null
        simple_storage_service: null
      tokenizer:
        pretrained_model_name_or_path: Qwen/Qwen3-32B
        token: null
        trust_remote_code: null
    metadata:
      stage: 2
    name: inference-perf
  model:
    name: unknown
version: "0.1"

Comparing LLM-d scheduling to a simple kubernetes service

The following graphs illustrate the relationship between latency, throughput, and QPS, as generated by the inference-perf --analyze. For benchmarking, we compared our results against a standard Kubernetes (k8s) service endpoint that routes traffic directly to vLLM pods.

The following data captures the performance of the last stage conducted at a fixed request rate of 60. We also compare the result with k8s service.

Throughput: Requests/sec +151.5%; Total tokens/sec +151.7%
Latency: TTFT (mean) -99.66%; E2E request latency (mean) -35.6%
Per-token speed: Inter-token latency (mean) -3.9%

Metric	k8s (Mean)	llm-d (Mean)	Δ (llm-d - k8s)	Δ% vs k8s
Requests/sec	5.7306	14.4123	+8.6817	+151.5%
Input tokens/sec	43,417.86	109,192.83	+65,774.97	+151.5%
Output tokens/sec	5,362.16	13,574.37	+8,212.21	+153.2%
Total tokens/sec	48,780.02	122,767.19	+73,987.17	+151.7%
Request latency (s)	105.4133	67.8649	-37.5484	-35.6%
TTFT (s)	34.9145	0.1203	-34.7942	-99.66%
Inter-token latency (ms)	70.42	67.66	-2.76	-3.9%

Content Source

This content is automatically synced from guides/optimized-baseline/README.md on the main branch of the llm-d/llm-d repository.

📝 To suggest changes, please edit the source file or create an issue.

Overview​

Default Configuration​

Supported Hardware Backends​

Prerequisites​

Installation Instructions​

1. Prepare a Target Namespace​

2. Deploy the Standalone Inference Scheduler​

3. Deploy the Model Server​

4. Enable monitoring (optional)​

Verification​

1. Port-Forward to the Scheduler Service​

2. Send Test Requests​

Benchmarking​

1. Prepare the Benchmarking Suite​

2. Download the Workload Template​

3. Execute Benchmark​

Cleanup​

Benchmarking Report​

Comparing LLM-d scheduling to a simple kubernetes service​