Architecture
High-level guide to llm-d architecture. Start here, then dive into specific guides.
Core Components​
The llm-d architecture is built around three primary concepts: the Router, the InferencePool, and the Model Server.
-
llm-d Router - The intelligent entry point for inference requests. It provides LLM-aware load balancing, request queuing, and policy enforcement. It is composed of two functional parts:
- Proxy: A high-performance L7 proxy (typically Envoy) that accepts user requests and consults the EPP via the
ext-procprotocol to determine the optimal destination. - Endpoint Picker (EPP): The routing engine that scores and selects model server pods based on real-time metrics, KV-cache affinity, and configured policies.
- Proxy: A high-performance L7 proxy (typically Envoy) that accepts user requests and consults the EPP via the
-
InferencePool - The API that defines a group of Model Server Pods sharing the same model and compute configuration. Conceptualized as an "LLM-optimized Service", it serves as the discovery target for the Router.
-
Model Server - The inference engine (such as vLLM or SGLang) that executes the model on hardware accelerators (GPUs, TPUs, HPUs).
Advanced Patterns​
llm-d's core design can be extended with optional advanced patterns:
KV Cache Management​
llm-d provides a comprehensive ecosystem for managing and reusing the KV cache across the inference pool. This includes:
- Prefix-Cache Aware Routing: Heuristic and precise techniques to maximize cache hits.
- KV-Cache Indexing: Event-driven tracking of cache state across all model servers.
- KV Offloading: Tiered storage hierarchy (CPU, SSD) for extending cache capacity.
See KV Cache Management for an overview of how these components compose.
Disaggregated Serving​
In disaggregated serving, a single inference request is split into multiple phases (e.g., Prefill and Decode) handled by specialized workers. The llm-d Router orchestrates this flow by selecting both a prefill and a decode endpoint and coordinating the KV-cache transfer between them.
See Disaggregation for complete details.
Predicted Latency-Based Routing​
The llm-d Router can be extended with "consultant" sidecars that provide advanced signals for routing decisions. The primary implementation is the Latency Predictor, which enables routing based on predicted ITL and TTFT.
- Latency Predictor: Trains an XGBoost model online to predict request latency for better endpoint scoring and SLO enforcement.
Batch Inference​
Batch and offline inference workloads are handled by two modules that can be deployed independently or together. The Batch Gateway provides an OpenAI-compatible Batch API for job management, while the Async Processor dispatches queued requests with flow-control gating. When composed, the Batch Gateway delegates dispatch to the Async Processor.
See Batch Inference for details on the batch inference design.
Autoscaling​
llm-d supports proactive, SLO-aware autoscaling through two complementary approaches:
- HPA/KEDA: Standard Kubernetes-native scaling using metrics exported by the EPP (like queue depth).
- Workload Variant Autoscaler (WVA): Globally optimized scaling that minimizes cost by routing traffic across different model variants (e.g., different hardware or quantization) while meeting latency targets.
See Autoscaling for complete details.