Workload-Variant-Autoscaler (WVA)
The Workload Variant Autoscaler (WVA) is a Kubernetes-based global autoscaler for inference model servers serving LLMs. WVA works alongside standard Kubernetes HPA autoscaler and external autoscalers like KEDA to scale the object supporting scale subresource. The high-level details of the algorithm are here. It determines optimal replica counts for given request traffic loads for inference servers by considering constraints such as GPU count (cluster resources), energy-budget and performance-budget (latency/throughput).
Key Features
- Intelligent Autoscaling: Optimizes replica count by observing the current state of the system
- Cost Optimization: Minimizes infrastructure costs by picking the correct accelerator variant
Quick Start
Prerequisites
- Kubernetes v1.31.0+ (or OpenShift 4.18+)
- Helm 3.x
- kubectl
Install with Helm (Recommended)
Go to the INSTALL (on OpenShift) section here for detailed steps.
Try it Locally with Kind (No GPU Required!)
# Deploy WVA with llm-d infrastructure on a local Kind cluster
make deploy-wva-emulated-on-kind CREATE_CLUSTER=true DEPLOY_LLM_D=true
# This creates a Kind cluster with emulated GPUs and deploys:
# - WVA controller
# - llm-d infrastructure (simulation mode)
# - Prometheus and monitoring stack
# - vLLM emulator for testing
Works on Mac (Apple Silicon/Intel) and Windows - no physical GPUs needed! Perfect for development and testing with GPU emulation.
See the Installation Guide for detailed instructions.
Documentation
User Guide
Integrations
Deployment Options
How It Works
-
Platform admin deploys llm-d infrastructure (including model servers) and waits for servers to warm up and start serving requests
-
Platform admin creates a
VariantAutoscalingCR for the running deployment -
WVA continuously monitors request rates and server performance via Prometheus metrics
-
Capacity model obtains KV cache utilization and queue depth of inference servers with slack capacity to determine replicas
-
Actuator emits optimization metrics to Prometheus and updates VariantAutoscaling status
-
External autoscaler (HPA/KEDA) reads the metrics and scales the deployment accordingly
Important Notes:
- WVA handles the creation order gracefully - you can create the VA before or after the deployment
- If a deployment is deleted, the VA status is immediately updated to reflect the missing deployment
- When the deployment is recreated, the VA automatically resumes operation
- Configure HPA stabilization window (recommend 120s+) for gradual scaling behavior
- WVA updates the VA status with current and desired allocations every reconciliation cycle
Example
apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
name: llama-8b-autoscaler
namespace: llm-inference
spec:
scaleTargetRef:
kind: Deployment
name: llama-8b
modelID: "meta/llama-3.1-8b"
variantCost: "10.0" # Optional, defaults to "10.0"
More examples in config/samples/.
Upgrading
CRD Updates
Important: Helm does not automatically update CRDs during helm upgrade. When upgrading WVA to a new version with CRD changes, you must manually apply the updated CRDs first:
# Apply the latest CRDs before upgrading
kubectl apply -f charts/workload-variant-autoscaler/crds/
# Then upgrade the Helm release
helm upgrade workload-variant-autoscaler ./charts/workload-variant-autoscaler \
--namespace workload-variant-autoscaler-system \
[your-values...]
Breaking Changes
v0.5.0 (upcoming)
- VariantAutoscaling CRD: Added
scaleTargetReffield as required. v0.4.1 VariantAutoscaling resources withoutscaleTargetRefmust be updated before upgrading:- Impact on Scale-to-Zero: VAs without
scaleTargetRefwill not scale to zero properly, even with HPAScaleToZero enabled and HPAminReplicas: 0, because the HPA cannot reference the target deployment. - Migration: Update existing VAs to include
scaleTargetRef:spec:
scaleTargetRef:
kind: Deployment
name: <your-deployment-name> - Validation: After CRD update, VAs without
scaleTargetRefwill fail validation.
- Impact on Scale-to-Zero: VAs without
Verifying CRD Version
To check if your cluster has the latest CRD schema:
# Check the CRD fields
kubectl get crd variantautoscalings.llmd.ai -o jsonpath=``{.spec.versions[0].schema.openAPIV3Schema.properties.spec.properties}`` | jq 'keys'
Contributing
We welcome contributions! See the llm-d Contributing Guide for guidelines.
Join the llm-d autoscaling community meetings to get involved.
License
Apache 2.0 - see LICENSE for details.
Related Projects
References
For detailed documentation, visit the docs directory.
This content is automatically synced from README.md on the main branch of the llm-d-incubation/workload-variant-autoscaler repository.
📝 To suggest changes, please edit the source file or create an issue.