Predicted Latency-Based Routing
Overview
Route each inference request to the model server predicted to serve it fastest — and, optionally, only to a server predicted to meet its TTFT/TPOT SLO.
This path is for operators who want to adopt predicted latency-based scheduling in an existing llm-d deployment. For what the component is and how it works internally — the plugin pipeline, the ML model, scaling characteristics, the full metric list — see architecture/advanced/latency-predictor.md.
When to Pick This Path
Pick it when:
- Your workload has high variance in prompt and completion length, and queue depth alone is a poor proxy for true load.
- Your clients can express per-request latency SLOs (interactive vs. batch) and you want the gateway to enforce them.
- Static weight tuning between cache affinity and load has become fragile as traffic shifts.
Skip it when your pool is heterogeneous — mixed GPU types, model variants, or serving configurations in the same pool will produce inaccurate predictions, because the predictor assumes a single pod shape.
OpenShift support for this guide is currently not reliable as-is. The latency-predictor sidecars used by predicted-latency scheduling may require additional OpenShift-specific runtime adjustments beyond the manifests in this guide. Until that is resolved, prefer GKE or CoreWeave for the tested path.
Prerequisites
-
Have the proper client tools installed on your local system to use this guide.
-
Checkout llm-d repo:
export branch="main" # branch, tag, or commit hashgit clone https://github.com/llm-d/llm-d.git && cd llm-d && git checkout ${branch} -
Set the following environment variables:
export GAIE_VERSION=v1.5.0export GUIDE_NAME="predicted-latency-routing"export NAMESPACE=llm-d-predicted-latencyexport MODEL_NAME="Qwen/Qwen3-32B" -
Install the Gateway API Inference Extension CRDs:
kubectl apply -k "https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=${GAIE_VERSION}" -
Create a target namespace for the installation:
kubectl create namespace ${NAMESPACE}
Installation Instructions
1. Deploy the llm-d Router
Two ready-to-use values files ship with this guide:
| File | When to use |
|---|---|
router/predicted-latency.values.yaml | Default — predictor trains on end-to-end latency. Routing-only, no SLO header support. |
router/predicted-latency-slo.values.yaml | SLO-aware — Assumes x-slo-ttft-ms / x-slo-tpot-ms are set on requests. Every request must be sent with "stream": true. |
Both target model server pods labeled llm-d.ai/guide=optimized-baseline since in the next step we will simply reuse the model server manifests from the optimized-baseline guide.
Standalone Mode
This deploys the llm-d Router with an Envoy sidecar, it doesn't set up a Kubernetes Gateway.
export REPO_ROOT=$(realpath $(git rev-parse --show-toplevel))
helm install ${GUIDE_NAME} \
oci://registry.k8s.io/gateway-api-inference-extension/charts/standalone \
-f guides/recipes/router/base.values.yaml \
-f ${REPO_ROOT}/guides/${GUIDE_NAME}/router/predicted-latency.values.yaml \
-n ${NAMESPACE} --version ${GAIE_VERSION}
Nightly CI also sets --set inferenceExtension.monitoring.prometheus.auth.enabled=false so the validation job can scrape EPP metrics without a bearer token. That override is CI-only; leave metrics auth enabled for normal deployments unless you explicitly need unauthenticated scraping.
For SLO-aware scheduling, swap the values file: -f guides/${GUIDE_NAME}/router/predicted-latency-slo.values.yaml.
Gateway Mode
To use a Kubernetes Gateway managed proxy rather than the standalone version, follow these steps instead of applying the previous Helm chart:
- Deploy a Kubernetes Gateway by following one of the gateway guides.
- Deploy the llm-d Router and an HTTPRoute that connects it to the Gateway as follows:
export REPO_ROOT=$(realpath $(git rev-parse --show-toplevel))
export PROVIDER_NAME=gke # options: none, gke, agentgateway, istio
helm install ${GUIDE_NAME} \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
-f ${REPO_ROOT}/guides/recipes/router/base.values.yaml \
-f ${REPO_ROOT}/guides/recipes/router/features/httproute-flags.yaml \
-f ${REPO_ROOT}/guides/${GUIDE_NAME}/router/predicted-latency.values.yaml \
--set provider.name=${PROVIDER_NAME} \
-n ${NAMESPACE} --version ${GAIE_VERSION}
2. Deploy the Model Server
This guide reuses the model server manifests from the optimized-baseline guide (the values files above already select pods labeled llm-d.ai/guide=optimized-baseline). Apply the default NVIDIA GPU / vLLM overlay:
export INFRA_PROVIDER=base # base | gke
kubectl apply -n ${NAMESPACE} -k guides/optimized-baseline/modelserver/gpu/vllm/${INFRA_PROVIDER}/
For other backends (AMD GPU, Intel XPU, Gaudi, TPU, CPU), see optimized-baseline → Deploy the Model Server.
3. Enable monitoring (optional)
Follow optimized-baseline → Enable monitoring — the same steps apply since this guide reuses the same model server manifests.
Send Requests
Once enabled, latency-based scheduling works on every request — no header changes needed. The proxy picks the endpoint with the lowest predicted latency.
To opt an individual request into SLO-aware routing, add one or both headers:
x-slo-ttft-ms— Time-to-first-token SLO in milliseconds.x-slo-tpot-ms— Time-per-output-token SLO in milliseconds.
1. Get the IP of the Proxy
Standalone Mode
export IP=$(kubectl get service ${GUIDE_NAME}-epp -n ${NAMESPACE} -o jsonpath='{.spec.clusterIP}')
Gateway Mode
export IP=$(kubectl get gateway llm-d-inference-gateway -n ${NAMESPACE} -o jsonpath='{.status.addresses[0].value}')
2. Send a Test Request
Open a temporary interactive shell inside the cluster:
kubectl run curl-debug --rm -it \
--image=cfmanteiga/alpine-bash-curl-jq \
--env="IP=$IP" \
--env="NAMESPACE=$NAMESPACE" \
--env="MODEL_NAME=$MODEL_NAME" \
-- /bin/bash
Send a completion request:
curl -X POST http://${IP}/v1/completions \
-H 'Content-Type: application/json' \
-H 'x-slo-ttft-ms: 200' \
-H 'x-slo-tpot-ms: 50' \
-d '{
"model": "'${MODEL_NAME}'",
"prompt": "Explain the difference between prefill and decode.",
"max_tokens": 200,
"temperature": 0,
"stream": true,
"stream_options": {"include_usage": true}
}'
Sheddable requests (priority < 0) are rejected at admission when no endpoint can meet the SLO, rather than routed to a guaranteed miss.
Verify
Once traffic is flowing, confirm three things in Prometheus (see the architecture doc for the metric reference):
- Predictions are being produced.
inference_objective_request_ttft_prediction_duration_secondshas non-zero samples. If it stays empty, the predictor sidecar is not being called — tail the EPP logs forpredicted-latency-producererrors. - Predictions track reality. Compare
inference_objective_request_predicted_ttft_secondsagainstinference_objective_request_ttft_secondsover a rolling window. A healthy deployment converges to within a few percent after warmup. - SLOs are being honored. If you're sending SLO-annotated traffic,
inference_objective_request_ttft_slo_violation_totaland..._tpot_slo_violation_totalshould increment only under genuine saturation.
Cleanup
To remove the deployed components:
helm uninstall ${GUIDE_NAME} -n ${NAMESPACE}
kubectl delete -n ${NAMESPACE} -k guides/optimized-baseline/modelserver/gpu/vllm/${INFRA_PROVIDER}
kubectl delete namespace ${NAMESPACE}
Troubleshooting
| Symptom | Likely cause |
|---|---|
| Prediction duration metrics empty | Predictor sidecar unreachable — EPP falls back to composite heuristic scoring. Check sidecar readiness and PREDICTION_SERVER_URL. |
| Large, persistent drift between predicted and actual TTFT | streamingMode mismatch (set to false on a streaming workload, or vice versa), or workload drifted outside the training window. |
| High TPOT SLO violation rate at low QPS | streamingMode: false — TPOT is not being trained. Flip it to true and restart. |
| SLO violations cluster on a few pods during spikes | Scoring strategy is least; try most for more headroom at the cost of utilization. |
| Prediction-based routing degrades to baseline | Predictor error or sidecar restart — expected fallback, not a failure. Investigate sidecar logs. |
Related
- Latency Predictor Architecture — plugin pipeline, ML model, scaling characteristics, metric reference.
- llm-d/llm-d-router — source for the EPP plugins and per-plugin configuration references.
- llm-d/llm-d-latency-predictor — source for the training and prediction server Python code.
- Predicted Latency-Based Scheduling for LLMs — design rationale and benchmark results.