Autoscaling Workloads with HPA and IGW Metrics
This guide explains how to configure autoscaling for LLM workloads by integrating the Kubernetes Horizontal Pod Autoscaler (HPA) with metrics emitted by the Inference Gateway (IGW). By using gateway-level signals like queue depth and active request counts, you can achieve more responsive and model-aware scaling than with traditional CPU/Memory metrics.
Overview
Traditional autoscaling often relies on resource utilization (CPU/GPU). However, for LLM inference, resource usage is often "pegged" at 100% during active batches, making it a poor indicator of true load.
The llm-d architecture solves this by using the Endpoint Picker (EPP) flow control metrics. These metrics reflect the actual state of the inference queue and the health of the model pool, allowing the HPA to scale out before users experience high latency and scale in when capacity is idle.
Metric Definitions and Collection
Follow the Intelligent Inference Scheduling well-lit path to set up an LLM deployment. By default, llm-d deployments include the necessary ServiceMonitors to scrape EPP metrics.
- Metric Collection: For details on how to ensure scraping is active, see the llm-d Monitoring Guide.
- Metric Definitions: For a list of metrics emitted by EPP refer here.
Recommended Metrics for Scaling
| Metric Name | Description | Recommended Usage |
|---|---|---|
inference_extension_flow_control_queue_size | The number of requests currently buffered in the gateway waiting for an available backend. | Scale-out signal: High queue size indicates that the existing replicas are saturated. |
inference_objective_running_requests | The number of concurrent requests being processed by the model pool. | Capacity signal: Useful for tracking total throughput. |
Configuration Guide
1. Enable Flow Control in IGW
Enable the Flow Control layer by adding the flowControl FeatureGate to your EndpointPickerConfig:
apiVersion: config.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
featureGates:
- "flowControl"
# ...
Follow the flow control configuration guide to tune the saturation detector in your EPP deployment as needed.
2. Install the Prometheus Adapter
The Prometheus Adapter bridges Prometheus metrics to the Kubernetes External Metrics API, which the HPA uses to read IGW signals.
Add the Helm repository and install the adapter into your monitoring namespace:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--create-namespace
Note: You must set
prometheus.urlto point to your Prometheus instance. If you are usingkube-prometheus-stack, the default service ishttp://prometheus-operated.monitoring.svc:9090. Pass it at install time or in a values file:
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--create-namespace \
--set prometheus.url=http://prometheus-operated.monitoring.svc \
--set prometheus.port=9090
3. Configure Prometheus Adapter Rules
Create a values file igw-adapter-values.yaml with the following rules:
rules:
external:
- seriesQuery: 'inference_extension_flow_control_queue_size'
resources:
overrides:
namespace:
resource: "namespace"
namespaced: false
name:
as: "igw_queue_depth"
metricsQuery: 'sum(inference_extension_flow_control_queue_size{inference_pool="vllm-llama3-8b-instruct"})'
- seriesQuery: 'inference_objective_running_requests'
resources:
overrides:
namespace:
resource: "namespace"
namespaced: false
name:
as: "igw_running_requests"
metricsQuery: 'sum(inference_objective_running_requests{top_level_controller_name="vllm-llama3-8b-instruct-epp"})'
Note: Replace
vllm-llama3-8b-instructandvllm-llama3-8b-instruct-eppwith your own deployment names.
Apply the rules by upgrading the adapter:
helm upgrade prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--reuse-values \
--values igw-adapter-values.yaml
Verify the metrics are visible to the Kubernetes API:
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/igw_queue_depth"
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/igw_running_requests"
A successful response returns a JSON object with the current metric value. A 404 means
the adapter rules are not applied correctly or the Prometheus series does not exist yet —
re-check the metricsQuery label values against your live Prometheus data.
4. Create the HPA Resource
Below is a sample HPA configuration hpa.yaml that uses the dual-metric setup to scale your model server based on both the queue depth and current request load.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-llama3-8b-instruct-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama3-8b-instruct
minReplicas: 1
maxReplicas: 3
metrics:
- type: External
external:
metric:
name: igw_queue_depth
target:
type: Value
value: "250"
- type: External
external:
metric:
name: igw_running_requests
target:
type: AverageValue
averageValue: "250"
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300 # 5 min cooldown to prevent flapping
policies:
- type: Percent
value: 100
periodSeconds: 15
Note 1: The target values (
250) used here are examples and must be tuned to your model and hardware. A good starting point is to run your model server at a known concurrency level, observe the actual metric values usingkubectl describe hpa, and set the target below the concurrency at which your model's latency begins to degrade.
Note 2: Although
igw_queue_depthandigw_running_requestsoriginate from the EPP pod, we usetype: Externalrather thantype: Pods. This is becausetype: Podsrequires metrics to come from the pods being scaled — in this case the model server pods. Since the EPP is a separate deployment acting as a gateway and emitting metrics on behalf of the model server pool, we treat its metrics as external signals.
5. Verify the HPA
Apply the manifest and confirm the HPA is reading metrics:
kubectl apply -f hpa.yaml
kubectl get hpa vllm-llama3-8b-instruct-hpa -n default
A successful deployment would look like this:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
vllm-llama3-8b-instruct-hpa Deployment/vllm-llama3-8b-instruct 0/250, 0/250 (avg) 1 3 1 5m
Scale to Zero
To unlock significant cost savings on GPU resources, you can scale your deployment to zero pods when there is no traffic. With the EPP Flow Control Layer, scale-from-zero is now seamless:
- Request Queueing: When traffic hits a deployment with 0 replicas, the EPP flow control layer automatically queues the requests in its internal buffers.
- Late Binding: The EPP "holds" these requests while the autoscaler provisions the pods. Once the model server becomes ready, the EPP immediately dispatches the queued requests.
- User Experience: Users will see a latency spike (corresponding to the pod's startup time) but will not receive 5xx errors during the scaling event.
There are a couple of options to leverage the scale to/from zero feature.
Option 1: Native HPA
HPA supports scaling to zero through the HPAScaleToZero alpha feature flag. This is the recommended path for a native Kubernetes experience.
- Enable Feature Gate: Follow the Kubernetes Alpha Feature Guide to enable the
HPAScaleToZerofeature gate on your cluster. - Configure HPA: Set
minReplicas: 0in your HPA manifest. - Outcome: The HPA will de-provision all pods when metrics hit zero and re-provision them as soon as
igw_queue_depth > 0.
Option 2: KEDA
If your environment does not allow alpha feature gates, KEDA is a stable alternative.
- Setup KEDA: Install KEDA and follow the KEDA Prometheus Scaler guide. Note that KEDA comes with its own built-in metrics adapter that is enabled by default when you install KEDA. Unlike HPA, it does not require the Prometheus adapter installation.
- Configure Scaler: Use the same
igw_queue_depthmetric as a trigger. - Outcome: KEDA scales the deployment from 0 to 1 as soon as a request is queued. Once at 1 pod, the standard HPA (configured with
minReplicas: 1) takes over to scale up to N.
This content is automatically synced from guides/workload-autoscaling/README.hpa-igw.md on the main branch of the llm-d/llm-d repository.
📝 To suggest changes, please edit the source file or create an issue.