Prefix-Cache Aware Routing
Prefix-cache aware routing is a core technique managed by the llm-d Router (specifically via its Endpoint Picker (EPP) component) to reduce tail latency and increase throughput. By routing requests to model server replicas that already contain the relevant Key-Value (KV) cache for a prompt's prefix, the system avoids redundant "prefill" computation, saving both time and accelerator (GPU/TPU) resources. This technique expects the underlying model servers to support KV-caching across requests, such as vLLM's Automatic Prefix Caching feature.
llm-d provides two distinct implementations of this capability, catering to different operational requirements and precision needs.
1. Approximate Implementation​
The approximate implementation is designed to be lightweight and requires no external dependencies beyond the standard EPP deployment.
Components​
approx-prefix-cache-producer(DataProducer plugin)prefix-cache-scorer(Scorer plugin)
How it Works​
- Approximation: Since the EPP does not natively contain a tokenizer, it approximates tokens using character-to-token ratios.
- Hashing: The
approx-prefix-cache-producersplits the incoming prompt into fixed-size blocks (e.g., 16 tokens approximated as characters) and builds a rolling hash chain. - Local Index: The EPP maintains an in-memory LRU index of which prefix hashes were recently sent to which Pods.
- Scoring: The
prefix-cache-scorerreads the match information and assigns a score based on the ratio of matched blocks to total prompt blocks. - Learning: After a routing decision is made, the EPP updates its local index, assuming the selected Pod will now host that prefix.
Pros & Cons​
- Pros: Extremely lightweight; no need for a tokenizer sidecar; no network connectivity required to model servers (ZMQ); does not require explicit model server integration as it doesn't expect the model servers to communicate KV-cache events.
- Cons: Can diverge from actual model server state (e.g., if a Pod evicts a prefix due to memory pressure); less precise than token-based matching.
2. Precise Implementation​
The precise implementation provides 100% accuracy by leveraging actual token data and real-time state updates from the model servers.
Components​
tokenizer(DataProducer plugin)precise-prefix-cache-scorer(Scorer plugin)- KV-Cache Indexer (EPP Data Layer component)
How it Works​
- Exact Tokenization: The
tokenizerplugin sends the prompt to a high-performance tokenizer service (typically running as a sidecar or a local UDS service) to get exact Token IDs. - Real-time Events: Model servers (like vLLM) are configured to emit
KVEventsover ZeroMQ (ZMQ) whenever their internal KV cache changes (blocks added or evicted). - Global Index: The KV-Cache Indexer subscribes to these events and maintains a precise, globally consistent view of exactly which token blocks reside on which Pods.
- Precise Matching: The
precise-prefix-cache-scorermatches the exact Token IDs against this global index. - Speculative Indexing: To close the "blind spot" between a routing decision and the arrival of the subsequent
KVEvent, the plugin can proactively add "speculative" entries to the index immediately after routing.
Pros & Cons​
- Pros: 100% precision; handles complex cache eviction policies; natively supports Prefill/Decode disaggregation (by identifying specific blocks for transfer).
- Cons: Requires additional infrastructure (tokenizer service, ZMQ connectivity); slightly higher resource overhead; requires model server support for emitting KV-cache events.
Comparison Summary​
| Feature | Approximate | Precise |
|---|---|---|
| Precision | Heuristic (Character-based) | 100% (Token-based) |
| State Source | Local EPP assumptions | Real-time KVEvents from Model Servers |
| Dependencies | None | Tokenizer Service, ZMQ |
| Use Case | Simple, homogeneous workloads | Complex, high-scale production serving |
| P/D Disagg Support | Basic | Advanced/Native |
Composition with KV Cache Management​
Both implementations are part of the broader KV Cache Management ecosystem in llm-d. While the Approximate implementation is self-contained, the Precise implementation relies on the KV-Cache Indexer and can work in tandem with KV Offloading to manage cache state across accelerator and host memory boundaries.