Autoscaling with Workload Variant Autoscaler (WVA)
Version Compatibility: This guide is tested and validated with WVA v0.5.1. Ensure that all version references in installation commands match this version for compatibility.
Breaking Changes in v0.5.1: If upgrading from v0.4.1 or earlier, see the Upgrading section below for required migration steps.
The Workload Variant Autoscaler (WVA) provides dynamic autoscaling capabilities for llm-d inference deployments, automatically adjusting replica counts based on inference server saturation.
Overviewβ
WVA integrates with llm-d to:
- Dynamically scale inference replicas based on workload saturation
- Optimize resource utilization by adjusting to traffic patterns
- Reduce tail latency through saturation-based scaling decisions
Note: WVA currently supports only the Intelligent Inference Scheduling well-lit path. Other well-lit paths (such as Prefill/Decode Disaggregation or Wide Expert-Parallelism) are not currently supported.
Prerequisitesβ
Before installing WVA, ensure you have:
-
Kubernetes cluster: A running Kubernetes cluster (v1.31+) with GPU support. WVA uses the Intelligent Inference Scheduling well-lit path, which requires GPUs. See Hardware Requirements for supported accelerator types. If you need to set up a local cluster:
- Kind: For Kind clusters with GPU emulation, use the WVA Kind setup script which creates a cluster and patches nodes with GPU capacity (required for pod scheduling if using GPU-requesting pods). Note: Saturation-based scaling does not require node patching; it only uses workload metrics. See Infrastructure Prerequisites for other cluster setup options.
- Minikube: See Minikube setup documentation for single-host development.
- Production clusters: See Infrastructure Prerequisites for provider-specific setup (GKE, AKS, OpenShift (4.18+), etc.).
-
Gateway control plane: Configure and deploy your Gateway control plane before installation.
istioremains the default path in llm-d, andagentgatewayis also supported as a first-class self-installed inference gateway. The legacykgatewaypath remains available only for migration. -
Prometheus monitoring stack: WVA requires Prometheus to be accessible for metric collection. WVA requires HTTPS connections to Prometheus. The monitoring setup depends on your platform:
- OpenShift: User Workload Monitoring should be enabled (see OpenShift monitoring docs)
- GKE: An in-cluster Prometheus instance is required (GMP does not expose HTTP API). See GKE configuration below for setup instructions.
- Kind/Minikube: Install Prometheus with TLS/HTTPS configuration. See Kind/Minikube configuration below for installation and TLS setup instructions.
- Other Kubernetes: A Prometheus stack must be installed with HTTPS support (see monitoring documentation)
-
Create Installation Namespace:
export NAMESPACE=llm-d-autoscaler
kubectl create namespace ${NAMESPACE}
- HuggingFace token secret: Create the
llm-d-hf-tokensecret in your target namespace with the keyHF_TOKENmatching a valid HuggingFace token to pull models.
Installationβ
The workload-autoscaling helmfile supports two installation modes (see Step 5):
- Full Installation: Installs the complete llm-d Intelligent Inference Scheduling stack (infra, gaie, modelservice) plus WVA in a single
helmfile applycommand. - WVA-Only Installation: Installs only WVA, connecting to an existing Intelligent Inference Scheduling deployment.
For full installation, istio remains the default environment, and agentgateway is also supported via helmfile apply -e agentgateway. The legacy kgateway environment remains available only as a deprecated migration path.
Install Prometheus Adapter separately in Step 6 after WVA installation.
Step 1: Configure WVA Valuesβ
Edit workload-autoscaling/values.yaml to configure WVA settings (controller, Prometheus, accelerator, SLOs, HPA). Set namespace:
export NAMESPACE=llm-d-autoscaler
Step 2: Platform-Specific Configurationβ
Update workload-autoscaling/values.yaml with platform-specific settings:
OpenShiftβ
For OpenShift deployments, update the values file:
wva:
prometheus:
monitoringNamespace: openshift-user-workload-monitoring
baseURL: "https://thanos-querier.openshift-monitoring.svc.cluster.local:9091"
serviceAccountName: "prometheus-k8s"
tls:
insecureSkipVerify: true # or "false" for production
caCertPath: "" # or set ca cert path for production - "/etc/ssl/certs/prometheus-ca.crt"
Extract CA cert (if required): kubectl get secret thanos-querier-tls -n openshift-monitoring -o jsonpath='{.data.tls\.crt}' | base64 -d > ${TMPDIR:-/tmp}/prometheus-ca.crt
GKEβ
GMP doesn't expose HTTP API. Deploy in-cluster Prometheus:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n llm-d-monitoring --create-namespace
Update workload-autoscaling/values.yaml:
wva:
prometheus:
monitoringNamespace: llm-d-monitoring
baseURL: "http://llmd-kube-prometheus-stack-prometheus.llm-d-monitoring.svc.cluster.local:9090"
tls:
insecureSkipVerify: true
Other Kubernetes Platforms (Kind, Minikube, etc.)β
WVA HTTPS Requirement: WVA requires HTTPS for Prometheus connections. The default Prometheus installation on Kind/Minikube uses HTTP only. You must install Prometheus and configure it with TLS.
Install Prometheus:
export MON_NS=llm-d-monitoring
# Install Prometheus stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update
helm install llmd prometheus-community/kube-prometheus-stack -n ${MON_NS} --create-namespace
Enable TLS on Prometheus (Required):
WVA requires HTTPS for Prometheus. Configure Prometheus with TLS:
export MON_NS=llm-d-monitoring
# Create self-signed TLS certificate
openssl req -x509 -newkey rsa:2048 -nodes \
-keyout ${TMPDIR:-/tmp}/prometheus-tls.key -out ${TMPDIR:-/tmp}/prometheus-tls.crt -days 365 \
-subj "/CN=prometheus" \
-addext "subjectAltName=DNS:llmd-kube-prometheus-stack-prometheus.${MON_NS}.svc.cluster.local,DNS:llmd-kube-prometheus-stack-prometheus.${MON_NS}.svc,DNS:prometheus,DNS:localhost"
# Create TLS secret and upgrade Prometheus
kubectl create secret tls prometheus-web-tls --cert=${TMPDIR:-/tmp}/prometheus-tls.crt --key=${TMPDIR:-/tmp}/prometheus-tls.key -n ${MON_NS} --dry-run=client -o yaml | kubectl apply -f -
helm upgrade llmd prometheus-community/kube-prometheus-stack -n ${MON_NS} \
--set prometheus.prometheusSpec.web.tlsConfig.cert.secret.name=prometheus-web-tls \
--set prometheus.prometheusSpec.web.tlsConfig.cert.secret.key=tls.crt \
--set prometheus.prometheusSpec.web.tlsConfig.keySecret.name=prometheus-web-tls \
--set prometheus.prometheusSpec.web.tlsConfig.keySecret.key=tls.key \
--reuse-values
Note: For Kind clusters, consider using simulated-accelerators if vLLM GPU detection fails. Saturation-based scaling does not require node patchingβit only uses workload metrics (KV cache utilization, queue length). Simulator pods don't request GPUs, so they can schedule without node patching. If using regular vLLM with GPU resource requests on Kind, you must patch nodes for pods to schedule (e.g.,
kubectl patch node <node-name> -p '{"status":{"capacity":{"nvidia.com/gpu":"8"}}}'). WVA automatically discovers its namespace viaPOD_NAMESPACE.
Step 3: Create WVA Namespace (if needed)β
The helmfile creates llm-d-autoscaler namespace automatically if it doesn't exist. Note: If you created the namespace in Prerequisites #4 for the HF token secret, it's already created.
For OpenShift only, ensure the namespace has the monitoring label:
export NAMESPACE=${NAMESPACE:-llm-d-autoscaler}
kubectl label namespace "${NAMESPACE}" openshift.io/user-monitoring=true --overwrite
Step 4: Install WVA CRDs (Required)β
Install WVA CRDs before deploying:
kubectl apply -f https://raw.githubusercontent.com/llm-d-incubation/workload-variant-autoscaler/v0.5.1/charts/workload-variant-autoscaler/crds/llmd.ai_variantautoscalings.yaml
kubectl get crd variantautoscalings.llmd.ai
Step 5: Install WVA (with llm-d Stack - if not deployed already)β
Choose your installation mode:
Option A: Full Installation (Default)β
Install the complete llm-d inference-scheduling stack (infra, gaie, modelservice) plus WVA:
export NAMESPACE=${NAMESPACE:-llm-d-autoscaler}
cd guides/workload-autoscaling
helmfile apply -n ${NAMESPACE}
To use agentgateway instead of the default istio environment:
export NAMESPACE=${NAMESPACE:-llm-d-autoscaler}
cd guides/workload-autoscaling
helmfile apply -e agentgateway -n ${NAMESPACE}
To use the deprecated kgateway migration path:
export NAMESPACE=${NAMESPACE:-llm-d-autoscaler}
cd guides/workload-autoscaling
helmfile apply -e kgateway -n ${NAMESPACE}
This installs the complete Intelligent Inference Scheduling stack:
- Infra (gateway infrastructure)
- GAIE (inference pool and endpoint picker)
- Model Service (vLLM inference pods)
- WVA (workload-variant-autoscaler) in
llm-d-autoscalernamespace
WVA automatically discovers its namespace via POD_NAMESPACE.
Option B: WVA-Only Installation (If utilizing existing stack)β
If you already have the Intelligent Inference Scheduling stack installed, you can install only WVA and connect it to your existing deployment.
Prerequisites:
- An existing inference-scheduling deployment in your cluster
- The namespace where inference-scheduling is deployed (default:
llm-d-inference-scheduler) - The release name postfix used for inference-scheduling (default:
inference-scheduling)
Configuration:
WVA will auto-detect the model service name based on common patterns, but you can override via environment variables or values file:
# Set the namespace where your inference-scheduling is deployed (default: llm-d-inference-scheduling)
export LLMD_NAMESPACE=llm-d-inference-scheduler
# Set the release name postfix used for inference-scheduling (default: inference-scheduling)
# This is used to auto-detect the model service name: ms-{RELEASE_NAME_POSTFIX}-llm-d-modelservice
export LLMD_RELEASE_NAME_POSTFIX=inference-scheduling
# Set the namespace for WVA installation (default: llm-d-autoscaler)
export WVA_NAMESPACE=llm-d-autoscaler
Optional: Explicit Configuration in values.yamlβ
For explicit control, you can override the auto-detected values in workload-autoscaling/values.yaml:
llmd:
namespace: llm-d-inference-scheduling # Namespace of existing inference-scheduling deployment
modelName: ms-inference-scheduling-llm-d-modelservice # Explicit model service name (optional)
modelID: "Qwen/Qwen3-0.6B" # Must match the model in your inference-scheduling deployment
Install WVA only:
cd guides/workload-autoscaling
helmfile apply -e wva-only -n ${WVA_NAMESPACE}
Note: Use
WVA_NAMESPACE(notLLMD_NAMESPACE) for the-nflag. This is the namespace where WVA will be installed. WVA will connect to your existing inference-scheduling deployment in theLLMD_NAMESPACE.
This installs only:
- WVA (workload-variant-autoscaler) in the namespace specified by
WVA_NAMESPACE(default:llm-d-autoscaler)
WVA will connect to your existing inference-scheduling deployment using the configured namespace and model service name. The model service name is auto-detected as ms-{LLMD_RELEASE_NAME_POSTFIX}-llm-d-modelservice unless explicitly set in values.yaml.
Step 6: Install Prometheus Adapter (Required Dependency)β
Prometheus Adapter exposes WVA's external metric to HPA/KEDA. Install after WVA installation (Step 5), which creates the required prometheus-ca ConfigMap.
Choose your platform and follow the corresponding section:
6.1: OpenShiftβ
# Setup
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
export MON_NS=openshift-user-workload-monitoring
# Download OpenShift-specific values
curl -o ${TMPDIR:-/tmp}/prometheus-adapter-values.yaml \
https://raw.githubusercontent.com/llm-d-incubation/workload-variant-autoscaler/v0.5.1/config/samples/prometheus-adapter-values-ocp.yaml
# Update Prometheus URL
sed -i.bak "s|url:.*|url: https://thanos-querier.openshift-monitoring.svc.cluster.local|" ${TMPDIR:-/tmp}/prometheus-adapter-values.yaml || \
echo "Edit ${TMPDIR:-/tmp}/prometheus-adapter-values.yaml to set prometheus.url"
# Install
helm upgrade -i prometheus-adapter prometheus-community/prometheus-adapter \
--version 5.2.0 -n ${MON_NS} --create-namespace -f ${TMPDIR:-/tmp}/prometheus-adapter-values.yaml
# Verify RBAC permissions
kubectl auth can-i --list --as=system:serviceaccount:${MON_NS}:prometheus-adapter | grep -E "monitoring.coreos.com|prometheuses|namespaces"
# Create ClusterRole for Prometheus API access if needed
kubectl apply -f - <<'YAML'
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: allow-thanos-querier-api-access
rules:
- nonResourceURLs: [/api/v1/query, /api/v1/query_range, /api/v1/labels, /api/v1/label/*/values, /api/v1/series, /api/v1/metadata, /api/v1/rules, /api/v1/alerts]
verbs: [get]
- apiGroups: [monitoring.coreos.com]
resourceNames: [k8s]
resources: [prometheuses/api]
verbs: [get, create, update]
- apiGroups: [""]
resources: [namespaces]
verbs: [get]
YAML
6.2: GKE/Generic Kubernetesβ
# Setup
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
export MON_NS=${MON_NS:-llm-d-monitoring}
# Download values
curl -o ${TMPDIR:-/tmp}/prometheus-adapter-values.yaml \
https://raw.githubusercontent.com/llm-d-incubation/workload-variant-autoscaler/v0.5.1/config/samples/prometheus-adapter-values.yaml
# Update Prometheus URL
sed -i.bak "s|url:.*|url: http://llmd-kube-prometheus-stack-prometheus.${MON_NS}.svc.cluster.local:9090|" ${TMPDIR:-/tmp}/prometheus-adapter-values.yaml || \
echo "Edit ${TMPDIR:-/tmp}/prometheus-adapter-values.yaml to set prometheus.url"
# Install
helm upgrade -i prometheus-adapter prometheus-community/prometheus-adapter \
--version 5.2.0 -n ${MON_NS} --create-namespace -f ${TMPDIR:-/tmp}/prometheus-adapter-values.yaml
6.3: Kind/HTTPS Prometheusβ
For Kind clusters with HTTPS Prometheus (configured in Step 2), the prometheus-ca ConfigMap is created by WVA (Step 5). Configure Prometheus Adapter to use it:
# Setup
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
export MON_NS=${MON_NS:-llm-d-monitoring}
# Download values
curl -o ${TMPDIR:-/tmp}/prometheus-adapter-values.yaml \
https://raw.githubusercontent.com/llm-d-incubation/workload-variant-autoscaler/v0.5.1/config/samples/prometheus-adapter-values.yaml
# Configure values with CA cert (ConfigMap created by WVA in Step 5)
cat >> ${TMPDIR:-/tmp}/prometheus-adapter-values.yaml <<EOF
prometheus:
url: https://llmd-kube-prometheus-stack-prometheus.${MON_NS}.svc.cluster.local
port: 9090
extraArguments:
- --prometheus-ca-file=/etc/ssl/certs/prometheus-ca.crt
extraVolumeMounts:
- name: prometheus-ca
mountPath: /etc/ssl/certs/prometheus-ca.crt
subPath: ca.crt
readOnly: true
extraVolumes:
- name: prometheus-ca
configMap:
name: prometheus-ca
EOF
# Install
helm upgrade -i prometheus-adapter prometheus-community/prometheus-adapter \
--version 5.2.0 -n ${MON_NS} --create-namespace -f ${TMPDIR:-/tmp}/prometheus-adapter-values.yaml
Note: WVA creates the
prometheus-caConfigMap in the monitoring namespace using thecaCertvalue. This ConfigMap is required for Prometheus Adapter (Step 6.3).
Verify installation: kubectl get pods -n ${MON_NS} -l app.kubernetes.io/name=prometheus-adapter
Step 7: Verify End-to-End Installationβ
export NAMESPACE=${NAMESPACE:-llm-d-autoscaler}
export MON_NS=${MON_NS:-llm-d-monitoring}
# Check pods
kubectl get pods -n ${MON_NS} -l app.kubernetes.io/name=prometheus-adapter
kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=workload-variant-autoscaler
# Verify metrics and HPA
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/${NAMESPACE}/inferno_desired_replicas" | jq
kubectl get hpa -n ${NAMESPACE}
kubectl get variantautoscalings -n ${NAMESPACE}
Configuration Checklistβ
Edit workload-autoscaling/values.yaml for WVA settings. Key configurations:
For WVA-Only Mode: If using wva-only installation, configure connection to existing inference-scheduling deployment:
llmd:
namespace: llm-d-inference-scheduler # Namespace of existing inference-scheduling deployment
modelName: ms-inference-scheduling-llm-d-modelservice # Optional: explicit model service name (auto-detected if not set)
modelID: "Qwen/Qwen3-0.6B" # Must match model ID in your inference-scheduling deployment
For Full Installation: Model ID must match model configured in modelservice:
llmd:
modelID: "Qwen/Qwen3-0.6B" # Must match model ID in ms-workload-autoscaling/values.yaml
Accelerator (L40S, A100, H100, Intel-Max-1550):
va:
accelerator: L40S # your accelerator type
Prometheus (platform-specific):
wva:
prometheus:
monitoringNamespace: llm-d-monitoring
baseURL: "https://..." # Platform-specific URL
tls:
insecureSkipVerify: true
See WVA chart documentation for all options.
Upgradingβ
Upgrading from v0.4.1 or Earlierβ
Important Breaking Change in v0.5.1: The scaleTargetRef field is now required in the VariantAutoscaling CRD. Existing VariantAutoscaling resources without scaleTargetRef must be updated before upgrading to v0.5.1.
Impactβ
- Scale-to-Zero: VariantAutoscalings without
scaleTargetRefwill not scale to zero properly, even with HPAScaleToZero enabled and HPAminReplicas: 0, because the HPA cannot reference the target deployment. - Validation: After the CRD update, VariantAutoscalings without
scaleTargetRefwill fail validation.
Migration Stepsβ
-
Update CRDs first (Helm does not automatically update CRDs during
helm upgrade):kubectl apply -f https://raw.githubusercontent.com/llm-d-incubation/workload-variant-autoscaler/v0.5.1/charts/workload-variant-autoscaler/crds/llmd.ai_variantautoscalings.yaml -
Update existing VariantAutoscaling resources to include the required
scaleTargetReffield:# List all VariantAutoscalings
kubectl get variantautoscalings -A
# For each VariantAutoscaling, add scaleTargetRef
kubectl edit variantautoscaling <name> -n <namespace>Add the following to the
specsection:spec:
scaleTargetRef:
kind: Deployment
name: <your-deployment-name> # Replace with your actual deployment name
# ... rest of your existing spec -
Verify the CRD update:
kubectl get crd variantautoscalings.llmd.ai -o jsonpath='{.spec.versions[0].schema.openAPIV3Schema.properties.spec.properties}' | jq 'keys'You should see
scaleTargetRefin the list of properties. -
Upgrade the Helm release:
cd guides/workload-autoscaling
helmfile apply -n ${NAMESPACE:-llm-d-autoscaler}
For more details, see the WVA breaking changes documentation.
Cleanupβ
Remove WVA and Prometheus Adapter:
For Full Installation:
# Remove WVA stack
cd guides/workload-autoscaling
helmfile destroy -n ${NAMESPACE:-llm-d-autoscaler}
If you installed with agentgateway or the deprecated kgateway migration path, destroy with the matching environment:
cd guides/workload-autoscaling
helmfile destroy -e agentgateway -n ${NAMESPACE:-llm-d-autoscaler}
# or: helmfile destroy -e kgateway -n ${NAMESPACE:-llm-d-autoscaler}
For WVA-Only Installation:
# Remove only WVA (existing inference-scheduling stack remains)
cd guides/workload-autoscaling
helmfile destroy -e wva-only -n ${NAMESPACE:-llm-d-autoscaler}
Remove Prometheus Adapter (if not needed by other components):
helm uninstall prometheus-adapter -n ${MON_NS:-llm-d-monitoring}
The llm-d stack (infra, gaie, modelservice) continues operating without autoscaling after WVA removal.
This content is automatically synced from guides/workload-autoscaling/README.wva.md on the main branch of the llm-d/llm-d repository.
π To suggest changes, please edit the source file or create an issue.