Skip to main content

Predicted Latency-Based Scheduling

llm-d's optimized baseline guide leverages load signals and prefix-cache affinity to schedule requests, combining the signals together with heuristics.

This path is for operators who want to adopt predicted latency-based scheduling - which uses an XGBoost model trained online - to make scheduling decisions. This strategy is useful when:

  • Your workload has high variance in prompt and completion length, and queue depth alone is a poor proxy for true load.
  • Your clients can express per-request latency SLOs (interactive vs. batch) and you want the gateway to enforce them.
  • Static weight tuning between cache affinity and load has become fragile as traffic shifts.
note

Predicted latency is not a fit when the pool is heterogeneous — mixed GPU types, model variants (e.g. prefill vs decode), or serving configurations in the same pool will produce inaccurate predictions, because the predictor assumes a single pod shape.

Deploy

See the Predicted Latency guide for manifests and step-by-step deployment.

Architecture

Latency Predictor

The setup deploys an EPP with the predicted latency sidecar containers:

  • Training Server - trains the XGBoost model to predict TPOT and TTFT based on observed traffic
  • Prediction Servers - predict TPOT and TTFT of the request based on current server state

During the standard request flow:

  • Request arrives at the proxy, which forwards the request to the EPP
  • EPP queries the prediction server
  • EPP (using latency-scorer) selects optimal endpoint based on the prediction
  • Proxy forwards request to the vLLM endpoint
  • vLLM endpoint processes the request, returns response to proxy
  • Proxy sends results to the training server, which uses samples to update the model

Further Reading