llm-d-benchmark

This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.

tip

We acknowledge many users are still utilizing our previous (now deprecated) library, and to make the transition easier, we still have that library available. It can be found in our v0.5.2 version tag.

Main Goal

Provide a single source of automation for repeatable and reproducible experiments and performance evaluation on llm-d:

Declarative lifecycle: All infrastructure, workloads, and experiments render into reviewable YAML before provisioning.
End-to-end automation: A single llmdbenchmark CLI covers standup, benchmarking, result collection, and teardown.
Reproducibility: A deterministic config merge chain (defaults.yaml to scenario to CLI overrides) captures the exact configuration in each workspace. Any result traces back to its inputs.
Structured experiments: Built-in Design of Experiments (DoE) support automates parameter sweeps across both infrastructure and workload configurations.
Multiple harnesses: Swap between inference-perf, guidellm, vllm-benchmark, and others with a CLI flag (-l).
Post-deployment validation" Per-scenario smoketests verify that deployed pod configurations match what the scenario defines -- resources, parallelism, env vars, probes, routing, and vLLM flags.

Prerequisites

Please refer to the official llm-d prerequisites for the most up-to-date requirements. For the client setup, the provided install.sh will install the necessary tools.

Administrative Requirements

Deploying the llm-d stack requires cluster-level admin privileges, as you will be configuring cluster-level resources. However, the scripts can be executed by namespace-level admin users, as long as the Kubernetes infrastructure components are configured and the target namespace already exists.

Getting Started

Install

The install script supports both uv and the standard python3 -m venv for virtual environment creation. When run interactively, it will prompt you to choose; in non-interactive mode (e.g. curl pipe), it auto-selects uv if your system Python is missing or older than 3.11. You can also pass --uv or --no-uv to skip the prompt.

Quick install (one-liner):

curl -sSL https://raw.githubusercontent.com/llm-d/llm-d-benchmark/main/install.sh | bash
cd llm-d-benchmark
source .venv/bin/activate
llmdbenchmark --version

Or clone manually:

git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
./install.sh              # or: --uv / --no-uv
source .venv/bin/activate
llmdbenchmark --version

Install a specific branch:

LLMDBENCH_BRANCH=main \
  curl -sSL https://raw.githubusercontent.com/llm-d/llm-d-benchmark/main/install.sh | bash

The install script auto-detects if the repo is present -- if not, it clones it first. It creates a virtualenv, validates system tools (kubectl, helm, Python 3.11+), and installs the llmdbenchmark package. See Installation for manual install and flags.

tip

The last line of output from llmdbenchmark standup shows the workspace path where all rendered configs, manifests, and results are stored.

Pick your path: with or without Accelerators

Two supported entry points depending on what you have access to:

🖥️ No Accelerators / No Cluster Access - Utilize a Kind Quickstart

Run the full standup -> smoketest -> run -> teardown lifecycle on a local Kind cluster using a simulated inference engine. No accelerators, no cloud account, no cluster operator required. It uses the same cicd/kind-sim scenario that CI runs on every PR, so if it works locally it works in CI.

Requirements: Docker (or Podman/Colima) with 4 CPUs / 8 GiB RAM and Python 3.11+
Continue with Quick Start Guide: Quickstart on Kind

🚀 Access to Compute cluster with Accelerators - full pipeline

Deploy against a Kubernetes cluster with Accelerators (OpenShift, GKE, EKS, CKS, etc.). Use one of the built-in specs or a well-lit path guide tuned for your hardware.

Requirements: cluster admin to install infra (or utilize an namespace admin with infra pre-installed), kubeconfig, compute nodes
Continue below with Choose a specification and Deploy and benchmark

Choose a specification

Every command takes a --spec that selects the configuration for your cluster and GPU type. Specs are Jinja2 templates under config/specification/:

--spec gpu                              # NVIDIA GPU setup (config/specification/examples/gpu.yaml.j2)
--spec inference-scheduling             # inference scheduling guide
--spec inference-scheduling-wva         # inference scheduling + WVA autoscaling
--spec multi-model-wva                  # multi-model WVA: N pools, 1 gateway, 1 shared HTTPRoute
--spec pd-disaggregation               # prefill-decode disaggregation guide
...
--spec /full/path/to/my-spec.yaml.j2    # custom spec

If the name is ambiguous or not found, the CLI lists all available specs and exits.

Deploy and benchmark (full pipeline)

Stand up the llm-d stack, run a quick sanity benchmark, and tear down:

# Preview what would be deployed (no cluster changes)
llmdbenchmark --spec gpu --dry-run standup

# Deploy for real
llmdbenchmark --spec gpu standup

# Run a sanity benchmark against the deployed endpoint
llmdbenchmark --spec gpu run -l inference-perf -w sanity_random.yaml

# Tear down when done
llmdbenchmark --spec gpu teardown

note

--dry-run renders all manifests and logs every command that would execute, without touching the cluster. Use it to review before deploying.

Each command renders Kubernetes manifests from your spec's templates and defaults, then applies them. The workspace directory captures rendered configs, manifests, and results for later inspection.

Deploy multiple models behind one gateway

The multi-model-wva scenario deploys N models under a single gateway, each with its own EPP + InferencePool + VariantAutoscaling + HPA, sharing one WVA controller and one HTTPRoute with N backendRefs:

# Standup - renders two stacks (qwen3-06b, llama-31-8b), installs shared
# infra once, deploys a per-model Helm release + VA + HPA for each.
llmdbenchmark --spec guides/multi-model-wva standup -p my-namespace

# Smoketest - runs stack-by-stack (sequential), hitting each pool at its
# routing prefix (/qwen3-06b/v1/models, /llama-31-8b/v1/models).
llmdbenchmark --spec guides/multi-model-wva smoketest -p my-namespace

# Run - iterates every stack, each harness pod targets its own pool's endpoint.
llmdbenchmark --spec guides/multi-model-wva run -p my-namespace

# See what's deployed: list detected endpoints + copy-paste run commands.
llmdbenchmark --spec guides/multi-model-wva run -p my-namespace --list-endpoints

# Benchmark just one pool (no --endpoint-url needed - auto-resolves):
llmdbenchmark --spec guides/multi-model-wva run -p my-namespace \
  --stack qwen3-06b \
  -l inference-perf -w sanity_random.yaml

# Teardown - removes both stacks and the shared infra.
llmdbenchmark --spec guides/multi-model-wva teardown -p my-namespace

Stack names (qwen3-06b, llama-31-8b) double as path prefixes on the shared HTTPRoute (/qwen3-06b/v1/..., /llama-31-8b/v1/...). Pick short descriptive names in your own scenario - --list-endpoints prints the rendered URLs so you rarely have to type them manually.

Discovering what's deployed (`--list-endpoints`)

After standup, --list-endpoints detects each pool's routing URL, prints a copy-paste-ready table, and exits without launching any harness pods:

llmdbenchmark --spec guides/multi-model-wva run -p my-namespace --list-endpoints

📋 Detected endpoints:
  STACK        MODEL                      ENDPOINT URL
  -----------  -------------------------  ------------------------------
  qwen3-06b    Qwen/Qwen3-0.6B            http://10.1.2.3:80/qwen3-06b
  llama-31-8b  unsloth/Meta-Llama-3.1-8B  http://10.1.2.3:80/llama-31-8b

💡 Copy-paste to benchmark one pool:

  # qwen3-06b - Qwen/Qwen3-0.6B
  llmdbenchmark --spec guides/multi-model-wva run \
    --namespace my-namespace \
    --endpoint-url http://10.1.2.3:80/qwen3-06b \
    --model Qwen/Qwen3-0.6B \
    -l <harness> -w <workload.yaml> -j <parallel-pods>
  ...

Example Targeting a single pool (`--stack`)

--stack NAME restricts any lifecycle command to one pool (or a comma-separated subset). Endpoint URL auto-resolves for the selected stack - no need to pass --endpoint-url manually:

# Benchmark qwen3-06b only with guidellm, two parallel harness pods
llmdbenchmark --spec guides/multi-model-wva run -p my-namespace \
  --stack qwen3-06b \
  -l guidellm \
  -w sanity_random.yaml \
  -j 2

Breakdown of the Example:

--stack qwen3-06b filters per-stack steps to that pool. Endpoint detection (step 03) runs only for that stack and auto-resolves to http://<gateway>:80/qwen3-06b - including the routing prefix - so every downstream step targets the qwen3-06b InferencePool.
-l guidellm selects the guidellm harness (workload/harnesses/guidellm-llm-d-benchmark.sh).
-j 2 launches two guidellm pods hitting the same endpoint. Both pods run the same treatment (-w) but write to distinct result subdirectories ({experiment_id}_1, {experiment_id}_2) on the workload PVC.

Want to compare pools side-by-side? Launch two invocations in parallel shells (different --workspace each):

# Terminal 1 - --workspace is a global option, placed before the subcommand
llmdbenchmark --spec guides/multi-model-wva --workspace /tmp/run-qwen run -p my-namespace \
  --stack qwen3-06b \
  -l guidellm -w sanity_random.yaml -j 2 &

# Terminal 2 (or same shell, backgrounded)
llmdbenchmark --spec guides/multi-model-wva --workspace /tmp/run-llama run -p my-namespace \
  --stack llama-31-8b \
  -l guidellm -w sanity_random.yaml -j 2

--stack also works on standup, smoketest, and teardown. Same flag, same semantics - restrict execution to the named subset of stacks without editing the scenario YAML. Scenario-wide steps (namespace creation, admin prereqs, shared infra, WVA controller install) always run; only the per-stack steps (06+ for standup) are filtered.

# Standup only pool qwen3-06b from the multi-model scenario - shared
# infra (istio, Gateway, WVA controller, model PVC) installs normally,
# but only qwen3-06b's ms/gaie/VA/HPA resources get created.
llmdbenchmark --spec guides/multi-model-wva standup -p my-namespace \
  --stack qwen3-06b

# Standup two named pools out of a larger scenario:
llmdbenchmark --spec guides/multi-model-wva standup -p my-namespace \
  --stack qwen3-06b,llama-31-8b

# Tear down just one pool later, leaving the other running:
llmdbenchmark --spec guides/multi-model-wva teardown -p my-namespace \
  --stack qwen3-06b

Unknown stack names fail loudly with a list of valid ones.

When --stack NAME selects exactly one stack, -m/--models scopes to that stack only - sibling stacks keep their scenario-defined models. Handy for "rerun pool A against a different model" without touching pool B:

llmdbenchmark --spec guides/multi-model-wva run -p my-namespace \
  --stack qwen3-06b \
  --model meta-llama/Llama-3.2-3B \
  -l inference-perf -w sanity_random.yaml

Without --stack, -m applies to every stack and emits a warning.

Add a third model by copying a stack block in config/scenarios/guides/multi-model-wva.yaml and changing name + model. Scenario-wide config (gateway class, WVA controller image, shared HTTPRoute, EPP plugin config) lives in the top-level shared: block and is inherited by every stack. See Workload Variant Autoscaler for the full architecture.

Benchmark an existing endpoint (run-only mode)

Already have a model-serving endpoint running? Skip deployment entirely:

llmdbenchmark --spec gpu run \
  --endpoint-url http://10.131.0.42:80 \
  --model meta-llama/Llama-3.1-8B \
  --namespace my-namespace \
  --harness inference-perf \
  --workload sanity_random.yaml

This uses the same harness, profile rendering, and result collection pipeline -- just without the standup and teardown phases.

tip

run can also be used in debug mode (-d / --debug) which starts the harness pod with sleep infinity so you can exec into it and run commands interactively. See this example.

See workload/README.md for the full experiment file format and all pre-built experiments, as well as advanced functionality.

Next Steps

Topic	Where to look
Configuration system, defaults, scenarios, overrides	config/README.md
Multi-model scenarios and the `shared:` block	config/README.md, developer-guide
Workload-variant-autoscaler, including multi-pool setup	docs/workload-variant-autoscaler.md
Workloads, harnesses, profiles, experiments	workload/README.md
Standup phase, deployment methods, step details	llmdbenchmark/standup/README.md
Smoketests, per-scenario validation, adding validators	llmdbenchmark/smoketests/README.md
Run phase, benchmark execution, result collection	llmdbenchmark/run/README.md
Teardown phase and deep clean	llmdbenchmark/teardown/README.md
Design of Experiments (DoE) orchestration	llmdbenchmark/experiment/README.md
Plan-phase rendering pipeline	llmdbenchmark/parser/README.md
Execution framework and step contribution guide	llmdbenchmark/executor/README.md
CLI reference (all flags, env vars)	CLI Reference below

Prerequisites

Please refer to the official llm-d prerequisites for the most up-to-date requirements.

System Requirements

Python 3.11+
kubectl -- Kubernetes CLI
helm -- Helm package manager
curl, git -- Standard system tools
helmfile -- Required for modelservice deployments
kustomize, jq, yq -- Required for template rendering
skopeo, crane -- Required for container image management
oc (optional) -- Required for OpenShift clusters (either kubectl or oc must be present)

Administrative Requirements

info

Deploying the llm-d stack requires cluster-level admin privileges for configuring cluster-level resources. Namespace-level admin users can run the tool if Kubernetes infrastructure components are configured and the target namespace already exists. Use --non-admin to skip admin-only steps.

Installation

Quick Install (recommended)

# One-liner -- auto-clones if needed
curl -sSL https://raw.githubusercontent.com/llm-d/llm-d-benchmark/main/install.sh | bash
cd llm-d-benchmark
source .venv/bin/activate

Or manually:

git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
./install.sh              # or: --uv / --no-uv
source .venv/bin/activate

The install script:

Creates a Python virtual environment at .venv/ (via uv or python3 -m venv - see Install)
Validates Python 3.11+ and pip
Checks for required system tools (curl, git, kubectl or oc, helm, helmfile, kustomize, jq, yq, skopeo, crane)
Installs the helm-diff plugin (required by helmfile)
Installs llmdbenchmark and planner (from llm-d-planner)
Verifies all Python packages are importable

Manual Install w/o Install Script

git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
pip install "git+https://github.com/llm-d-incubation/llm-d-planner.git@f51812bebca30e0291ec541bd2ef2acf0572e8a4"

Verify Installation

llmdbenchmark --version

CLI Reference

Global Options

Flag	Env Var	Description
`--spec SPEC`	`LLMDBENCH_SPEC`	Specification name or path (bare name, category/name, or full path)
`--workspace DIR` / `--ws`	`LLMDBENCH_WORKSPACE`	Workspace directory for outputs (default: temp dir)
`--base-dir DIR` / `--bd`	`LLMDBENCH_BASE_DIR`	Base directory for templates/scenarios (default: `.`)
`--non-admin` / `-i`	`LLMDBENCH_NON_ADMIN`	Skip admin-only steps
`--dry-run` / `-n`	`LLMDBENCH_DRY_RUN`	Generate YAML without applying to cluster
`--verbose` / `-v`	`LLMDBENCH_VERBOSE`	Enable debug logging
`--version`		Show version

Plan Options

Flag	Env Var	Description
`-p NS`	`LLMDBENCH_NAMESPACE`	Namespace(s) to render into the plan
`-m MODELS`	`LLMDBENCH_MODELS`	Model to render the plan for
`-t METHODS`	`LLMDBENCH_METHODS`	Deployment method (`standalone`, `modelservice`)
`-f` / `--monitoring`		Enable monitoring in rendered templates (PodMonitor, EPP verbosity)
`-k FILE`	`LLMDBENCH_KUBECONFIG` / `KUBECONFIG`	Kubeconfig path (used for cluster resource auto-detection)

Standup Options

Flag	Env Var	Description
`-s STEPS`		Step filter (e.g., `0,1,5` or `1-7`)
`-c FILE`	`LLMDBENCH_SCENARIO`	Scenario file
`-m MODELS`	`LLMDBENCH_MODELS`	Models to deploy
`-p NS`	`LLMDBENCH_NAMESPACE`	Namespace(s)
`-t METHODS`	`LLMDBENCH_METHODS`	Deployment methods (`standalone`, `modelservice`)
`-r NAME`	`LLMDBENCH_RELEASE`	Helm release name
`-k FILE`	`LLMDBENCH_KUBECONFIG` / `KUBECONFIG`	Kubeconfig path
`--parallel N`	`LLMDBENCH_PARALLEL`	Max parallel stacks (default: 4)
`--stack NAME[,NAME...]`	`LLMDBENCH_STACK`	Restrict per-stack execution to the named subset. Useful in multi-stack scenarios (e.g. `guides/multi-model-wva`) to re-deploy a single pool without touching siblings. Unknown names fail loudly.
`--monitoring`	`LLMDBENCH_MONITORING`	Enable PodMonitor creation and EPP verbosity during standup
`--skip-smoketest`		Skip automatic smoketest after standup completes
`--affinity`	`LLMDBENCH_AFFINITY`	Node affinity / tolerations label
`--annotations`	`LLMDBENCH_ANNOTATIONS`	Extra annotations for deployed resources
`--wva`	`LLMDBENCH_WVA`	Workload Variant Autoscaler config

Teardown Options

Flag	Env Var	Description
`-s STEPS`		Step filter
`-m MODELS`	`LLMDBENCH_MODELS`	Model that was deployed (for resource name resolution)
`-t METHODS`	`LLMDBENCH_METHODS`	Methods to tear down (`standalone`, `modelservice`)
`-r NAME`	`LLMDBENCH_RELEASE`	Helm release name (default: `llmdbench`)
`-d` / `--deep`	`LLMDBENCH_DEEP_CLEAN`	Deep clean: delete ALL resources in both namespaces
`-p NS`	`LLMDBENCH_NAMESPACE`	Comma-separated namespaces (model,harness)
`--stack NAME[,NAME...]`	`LLMDBENCH_STACK`	Restrict teardown to the named subset. Useful for removing one pool from a multi-stack scenario while leaving siblings in place.
`-k FILE`	`LLMDBENCH_KUBECONFIG` / `KUBECONFIG`	Kubeconfig path

Experiment Options

Flag	Env Var	Description
`-e FILE`	`LLMDBENCH_EXPERIMENTS`	Experiment YAML with setup and run treatments (required)
`-p NS`	`LLMDBENCH_NAMESPACE`	Namespace(s)
`-t METHODS`	`LLMDBENCH_METHODS`	Deploy method
`-m MODELS`	`LLMDBENCH_MODELS`	Models to deploy
`-k FILE`	`LLMDBENCH_KUBECONFIG` / `KUBECONFIG`	Kubeconfig path
`--parallel N`	`LLMDBENCH_PARALLEL`	Max parallel stacks (default: 4)
`-f` / `--monitoring`		Enable monitoring during standup and run phases
`-l HARNESS`	`LLMDBENCH_HARNESS`	Harness name
`-w PROFILE`	`LLMDBENCH_WORKLOAD`	Workload profile
`-o OVERRIDES`	`LLMDBENCH_OVERRIDES`	Workload parameter overrides
`-r DEST`	`LLMDBENCH_OUTPUT`	Results destination (local, gs://, s3://)
`-j N`	`LLMDBENCH_PARALLELISM`	Parallel harness pods
`--wait-timeout N`	`LLMDBENCH_WAIT_TIMEOUT`	Seconds to wait for harness completion
`-x DATASET`	`LLMDBENCH_DATASET`	Dataset URL for harness replay
`-d` / `--debug`	`LLMDBENCH_DEBUG`	Debug mode: start harness pods with sleep infinity
`--stop-on-error`		Abort on first setup treatment failure
`--skip-teardown`		Leave stacks running for debugging

Run Options

Flag	Env Var	Description
`-s STEPS`		Step filter (e.g., `0,1,5` or `2-6`)
`-m MODEL`	`LLMDBENCH_MODEL`	Model name override (e.g. facebook/opt-125m)
`-p NS`	`LLMDBENCH_NAMESPACE`	Namespaces (deploy,benchmark)
`-t METHODS`	`LLMDBENCH_METHODS`	Deploy method used during standup
`-k FILE`	`LLMDBENCH_KUBECONFIG` / `KUBECONFIG`	Kubeconfig path
`-l HARNESS`	`LLMDBENCH_HARNESS`	Harness name (inference-perf, guidellm, vllm-benchmark)
`-w PROFILE`	`LLMDBENCH_WORKLOAD`	Workload profile YAML
`-e FILE`	`LLMDBENCH_EXPERIMENTS`	Experiment treatments YAML for parameter sweeping
`-o OVERRIDES`	`LLMDBENCH_OVERRIDES`	Workload parameter overrides (param=value,...)
`-r DEST`	`LLMDBENCH_OUTPUT`	Results destination (local, gs://, s3://)
`-j N`	`LLMDBENCH_PARALLELISM`	Parallel harness pods
`-U URL`	`LLMDBENCH_ENDPOINT_URL`	Explicit endpoint URL (run-only mode)
`-c FILE`		Run config YAML (run-only mode)
`--generate-config`		Generate config and exit
`-x DATASET`	`LLMDBENCH_DATASET`	Dataset URL for harness replay
`--wait-timeout N`	`LLMDBENCH_WAIT_TIMEOUT`	Seconds to wait for harness completion
`--monitoring`		Enable metrics scraping and EPP log capture during benchmark
`-q` / `--serviceaccount`	`LLMDBENCH_SERVICE_ACCOUNT`	Service account name for harness pods
`-g` / `--envvarspod`	`LLMDBENCH_HARNESS_ENVVARS_TO_YAML`	Comma-separated env var names to propagate into harness pod
`--analyze`		Run local analysis on results after collection
`-z` / `--skip`	`LLMDBENCH_SKIP`	Skip execution, only collect existing results
`-d` / `--debug`	`LLMDBENCH_DEBUG`	Debug mode: start harness pods with sleep infinity
`--stack NAME[,NAME...]`	`LLMDBENCH_STACK`	Restrict the benchmark to the named subset of stacks. Endpoint URL auto-resolves for the selected stack - no need for `--endpoint-url`. When `--stack` selects exactly one stack, `-m/--models` scopes to that stack only.
`--list-endpoints`		Detect per-stack endpoint URLs, print a copy-paste table of `llmdbenchmark run` invocations, and exit without launching any harness pods. Useful after `standup` to discover what's deployed.

Smoketest Options

Run post-deployment validation independently against an already-deployed stack.

llmdbenchmark --spec gpu smoketest -p my-namespace
llmdbenchmark --spec gpu smoketest -p my-namespace -s 2   # config validation only

Flag	Env Var	Description
`-s STEPS`		Step filter (e.g., `0,1,2` or `0-2`)
`-p NS`	`LLMDBENCH_NAMESPACE`	Namespace(s)
`-t METHODS`	`LLMDBENCH_METHODS`	Deployment methods (standalone, modelservice, fma)
`-k FILE`	`LLMDBENCH_KUBECONFIG` / `KUBECONFIG`	Kubeconfig path
`--parallel N`	`LLMDBENCH_PARALLEL`	Max parallel stacks (default: 4). Smoketest pins this to 1 regardless - parallel probes across stacks are confusing.
`--stack NAME[,NAME...]`	`LLMDBENCH_STACK`	Restrict smoketest to the named subset of stacks.

Smoketests also run automatically after standup unless --skip-smoketest is passed. See llmdbenchmark/smoketests/README.md for details on what each step validates.

Environment Variables

Every CLI flag can be set via a LLMDBENCH_* environment variable (see tables above). The priority chain is:

CLI flag (highest) -- explicitly passed on the command line
Environment variable -- exported in the user's shell
Rendered config (lowest) -- defaults.yaml + scenario YAML

This is useful for CI/CD pipelines, .bashrc configuration, or migrating from the original bash-based workflow.

# Example: set common defaults via env vars, override per-run via CLI
export LLMDBENCH_SPEC=inference-scheduling
export LLMDBENCH_NAMESPACE=my-team-ns
export LLMDBENCH_KUBECONFIG=~/.kube/my-cluster

# These use the env vars above; --dry-run overrides nothing, just adds a flag
llmdbenchmark standup --dry-run
llmdbenchmark standup                          # live deploy to my-team-ns
llmdbenchmark standup -p override-ns           # CLI wins over env var

Boolean env vars accept 1, true, or yes (case-insensitive). Active LLMDBENCH_* overrides are logged at startup for debugging.

Architecture

The tool operates in three phases, each composed of numbered steps executed by a shared StepExecutor framework.

Config Override Chain

Values flow through a merge pipeline during the plan phase:

Config Override Chain

Steps read from the rendered config.yaml and never define their own fallback defaults. If a required key is missing from the rendered config, the step raises a clear error. This ensures defaults.yaml is the single source of truth for all default values. Environment variables (LLMDBENCH_*) sit between scenario overrides and CLI flags in the priority chain.

See config/README.md for the full configuration reference, including how to override values.

Deployment Methods

The standup phase supports two deployment paths:

standalone -- Direct Kubernetes Deployments and Services for each model (step 06)
modelservice -- Helm-based deployment with gateway infrastructure, GAIE, and LWS support (steps 07-09)

Both paths share steps 00-05 (infrastructure, namespaces, secrets) and step 10 (smoketest).

Standup Steps

Step	Name	Scope	Description
00	ensure_infra	Global	Validate dependencies, cluster connectivity, kubeconfig
02	admin_prerequisites	Global	Admin prerequisites (CRDs, gateway, LWS, namespaces)
03	workload_monitoring	Global	Workload monitoring, node resource discovery
04	model_namespace	Per-stack	Model namespace (PVCs, secrets, download job)
05	harness_namespace	Per-stack	Harness namespace (PVC, data access pod, preprocess)
06	standalone_deploy	Per-stack	Standalone vLLM deployment (Deployment + Service)
07	deploy_setup	Per-stack	Helm repos and gateway infrastructure (helmfile)
08	deploy_gaie	Per-stack	GAIE inference extension deployment
09	deploy_modelservice	Per-stack	Modelservice deployment (helmfile + LWS)
10	smoketest	Per-stack	Health check, inference test, per-scenario config validation
11	inference_test	Per-stack	Sample inference request with demo curl command

Run Steps

Step	Name	Scope	Description
00	preflight	Global	Validate cluster connectivity and run-phase prerequisites
01	cleanup_previous	Global	Remove leftover harness pods from previous runs
02	detect_endpoint	Per-stack	Discover or accept the model-serving endpoint
03	verify_model	Per-stack	Verify the expected model is served at the endpoint
04	render_profiles	Per-stack	Render workload profile templates with runtime values
05	create_profile_configmap	Per-stack	Create profile and harness-scripts ConfigMaps
06	deploy_harness	Per-stack	Deploy harness pod(s) and execute the full treatment cycle
07	wait_completion	Per-stack	Wait for harness pod(s) to complete
08	collect_results	Per-stack	Collect results from PVC to local workspace
09	upload_results	Global	Upload results to cloud storage (safety-net bulk upload)
10	cleanup_post	Global	Clean up harness pods and ConfigMaps
11	analyze_results	Global	Run local analysis on collected results

Teardown Steps

Step	Name	Description	Condition
00	preflight	Validate cluster connectivity, load config	Always
01	uninstall_helm	Uninstall Helm releases, delete routes and jobs	Modelservice only
02	clean_harness	Clean harness ConfigMaps, pods, secrets	Always
03	delete_resources	Delete namespaced resources (normal or deep)	Always
04	clean_cluster_roles	Clean cluster-scoped ClusterRoles/Bindings	Admin + modelservice only

Project Structure

config/                       Declarative configuration (all plan-phase inputs)
    templates/
        jinja/                Jinja2 templates for Kubernetes manifests
        values/defaults.yaml  Base configuration with all anchored defaults
    scenarios/                Deployment overrides (guides/, examples/, cicd/)
    specification/            Specification templates (guides/, examples/, cicd/)

llmdbenchmark/                Python package
    cli.py                    Entry point, workspace setup, command dispatch
    config.py                 Plan-phase workspace configuration singleton

    interface/                CLI subcommand definitions (argparse)
        commands.py           Command enum (plan, standup, teardown, run, experiment)
        env.py                Environment variable helpers for CLI defaults
        plan.py               Plan subcommand
        standup.py            Standup subcommand
        teardown.py           Teardown subcommand
        run.py                Run subcommand
        experiment.py         Experiment subcommand (DoE orchestration)

    parser/                   Plan-phase template rendering (see parser/README.md)
        render_specification.py   Specification file parsing and validation
        render_plans.py           Jinja2 template rendering engine
        render_result.py          Structured error tracking for renders
        config_schema.py          Pydantic config validation (typo/type detection)
        version_resolver.py       Auto-resolve image tags and chart versions
        cluster_resource_resolver.py  Auto-detect accelerator/network values

    experiment/               DoE experiment orchestration (see experiment/README.md)
        parser.py             Parse experiment YAML (setup + run treatments)
        summary.py            Per-treatment result tracking and summary output

    executor/                 Execution framework (see executor/README.md)
        step.py               Step ABC, Phase enum, result dataclasses
        step_executor.py      Step orchestrator (sequential + parallel)
        command.py            kubectl/helm/helmfile subprocess wrapper
        context.py            Shared state (ExecutionContext dataclass)
        protocols.py          Structural typing (LoggerProtocol)
        deps.py               System dependency checker

    smoketests/               Post-deployment validation (see smoketests/README.md)
        base.py               Health checks, inference tests, pod inspection helpers
        report.py             CheckResult / SmoketestReport tracking
        steps/                Smoketest step implementations (00-02)
        validators/           Per-scenario config validators

    standup/                  Standup phase (see standup/README.md)
        preprocess/           Scripts mounted as ConfigMaps in vLLM pods
        steps/                Step implementations (00-11)

    teardown/                 Teardown phase (see teardown/README.md)
        steps/                Step implementations (00-05)

    run/                      Run phase (see run/README.md)
        steps/                Step implementations (00-11)

    logging/                  Structured logger with emoji support (see logging/README.md)
    exceptions/               Error hierarchy (Template, Configuration, Execution)
    utilities/                Shared helpers (see utilities/README.md)
        cluster.py            Kubernetes connection, platform detection
        capacity_validator.py GPU capacity validation
        huggingface.py        HuggingFace model access checks
        endpoint.py           Endpoint discovery and model verification
        profile_renderer.py   Workload profile template rendering
        kube_helpers.py       Shared kubectl patterns (wait, collect, cleanup)
        cloud_upload.py       Unified cloud storage upload (GCS, S3)
        os/
            filesystem.py     Workspace and directory management
            platform.py       Host OS detection

See module-level READMEs for detailed documentation:

executor/README.md -- Execution framework and step contribution guide
smoketests/README.md -- Post-deployment validation and per-scenario config checking
standup/README.md -- Standup phase details
run/README.md -- Run phase, benchmark execution, result collection
teardown/README.md -- Teardown phase details
experiment/README.md -- DoE experiment orchestration
parser/README.md -- Plan-phase rendering pipeline
logging/README.md -- Logger, stream separation, file logging
utilities/README.md -- Shared utilities, workspace architecture

Well-Lit Path Guides

llm-d-benchmark supports all available Well-Lit Path Guides. Each guide has a corresponding specification:

llmdbenchmark --spec inference-scheduling standup       # Inference scheduling
llmdbenchmark --spec pd-disaggregation standup          # Prefill-decode disaggregation
llmdbenchmark --spec tiered-prefix-cache standup        # Tiered prefix cache
llmdbenchmark --spec precise-prefix-cache-aware standup # Precise prefix cache-aware routing
llmdbenchmark --spec wide-ep-lws standup                # Wide expert-parallel with LWS

warning

wide-ep-lws requires RDMA/RoCE networking and LeaderWorkerSet (LWS) controller. Verify your cluster has working RDMA HCAs before deploying.

Main Concepts

Model ID Label

Kubernetes resource names derived from model IDs use a hashed model_id_label format: {first8}-{sha256_8}-{last8}. This keeps resource names within DNS length limits while remaining identifiable. The label is computed automatically during the plan phase and used in template rendering for deployment names, service names, and route names. See config/README.md for details.

Scenarios

Cluster-specific configuration: GPU model, LLM, and llm-d parameters. Scenarios are YAML files under config/scenarios/ that override defaults.yaml for a particular deployment context.

Harnesses

Load generators that drive benchmark traffic. Supported: inference-perf, guidellm, vllm benchmarks, inferencemax, and nop (for model load time benchmarking).

(Workload) Profiles

Benchmark load specifications including LLM use case, traffic pattern, input/output distribution, and dataset. Found under workload/profiles.

info

The triplet <scenario>, <harness>, <(workload) profile>, combined with the standup/teardown capabilities, provides enough information to reproduce any single experiment.

Experiments

Design of Experiments (DOE) files describing parameter sweeps across standup and run configurations. The experiment command automates the full setup x run treatment matrix -- standing up a different infrastructure configuration for each setup treatment, running all workload variations, tearing down, and producing a summary. See llmdbenchmark/experiment/README.md for the full experiment lifecycle documentation.

Benchmark Report

Results are saved in the native format of each harness, as well as a universal Benchmark Report format (v0.1 and v0.2). The benchmark report is a standard data format describing the cluster configuration, workload, and results of a benchmark run. It acts as a common API for comparing results across different harnesses and configurations. See llmdbenchmark/analysis/benchmark_report/README.md for the full schema documentation and Python API.

Analysis

The analysis pipeline generates per-request distribution plots, cross-treatment comparison tables and charts, and Prometheus metric visualizations. Analysis runs both inside the harness container (automatically) and locally via --analyze. For interactive exploration, a Jupyter notebook is also available at docs/analysis/README.md.

Dependencies

News

KubeCon/CloudNativeCon 2025 North America Talk "A Cross-Industry Benchmarking Tutorial for Distributed LLM Inference on Kubernetes", with the accompanying tutorial
llm-d-benchmark supports all available Well-Lit Path Guides
Data from benchmarking experiments is made available on the main project's Google Drive

Topics

Testing

Unit tests live under tests/ and run with pytest:

pytest tests/ -v

For integration testing against a live cluster, util/test-scenarios.sh runs standup/teardown cycles across scenarios:

util/test-scenarios.sh --stable     # Run known-stable scenarios
util/test-scenarios.sh --trouble    # Run scenarios that have had issues
util/test-scenarios.sh --all        # Run all scenarios
util/test-scenarios.sh --ms-only    # Modelservice scenarios only
util/test-scenarios.sh --sa-only    # Standalone scenarios only

See tests/README.md for unit test details.

Developing

Developer Guide -- How to add new steps, analysis modules, harnesses, scenarios, and experiments
Package Architecture -- Overview of the llmdbenchmark package structure and submodules

Contribute

How to contribute, including development process and governance.
See Developer Guide for how to add new steps, harnesses, scenarios, and analysis modules.
Join Slack (sig-benchmarking channel) for cross-org development discussion.
Bi-weekly contributor standup: Tuesdays 13:00 EST. Calendar | Meeting notes | Google group

License

Licensed under Apache License 2.0. See LICENSE for details.

Content Source

This content is automatically synced from README.md on the main branch of the llm-d/llm-d-benchmark repository.

📝 To suggest changes, please edit the source file or create an issue.

Main Goal​

Prerequisites​

Administrative Requirements​

Getting Started​

Install​

Pick your path: with or without Accelerators​

Choose a specification​

Deploy and benchmark (full pipeline)​

Deploy multiple models behind one gateway​

Discovering what's deployed (--list-endpoints)​

Example Targeting a single pool (--stack)​

Benchmark an existing endpoint (run-only mode)​

Next Steps​

Prerequisites​

System Requirements​

Administrative Requirements​

Installation​

Quick Install (recommended)​

Manual Install w/o Install Script​

Verify Installation​

CLI Reference​

Global Options​

Plan Options​

Standup Options​

Teardown Options​

Experiment Options​

Run Options​

Smoketest Options​

Environment Variables​

Architecture​

Project Structure​

Well-Lit Path Guides​

Main Concepts​

Model ID Label​

(Workload) Profiles​

Dependencies​

News​

Topics​

Testing​

Developing​

Contribute​

License​

Main Goal

Prerequisites

Administrative Requirements

Getting Started

Install

Pick your path: with or without Accelerators

Choose a specification

Deploy and benchmark (full pipeline)

Deploy multiple models behind one gateway

Discovering what's deployed (`--list-endpoints`)

Example Targeting a single pool (`--stack`)

Benchmark an existing endpoint (run-only mode)

Next Steps

Prerequisites

System Requirements

Administrative Requirements

Installation

Quick Install (recommended)

Manual Install w/o Install Script

Verify Installation

CLI Reference

Global Options

Plan Options

Standup Options

Teardown Options

Experiment Options

Run Options

Smoketest Options

Environment Variables

Architecture

Project Structure

Well-Lit Path Guides

Main Concepts

Model ID Label

(Workload) Profiles

Dependencies

News

Topics

Testing

Developing

Contribute

License