Skip to main content

Prefix-Cache Aware Routing

Prefix-cache aware routing is a core technique managed by the llm-d Router (specifically via its Endpoint Picker (EPP) component) to reduce tail latency and increase throughput. By routing requests to model server replicas that already contain the relevant Key-Value (KV) cache for a prompt's prefix, the system avoids redundant "prefill" computation, saving both time and accelerator (GPU/TPU) resources. This technique expects the underlying model servers to support KV-caching across requests, such as vLLM's Automatic Prefix Caching feature.

llm-d provides two distinct implementations of this capability, catering to different operational requirements and precision needs.


1. Approximate Implementation

The approximate implementation is designed to be lightweight and requires no external dependencies beyond the standard EPP deployment.

Components

How it Works

  1. Approximation: Since the EPP does not natively contain a tokenizer, it approximates tokens using character-to-token ratios.
  2. Hashing: The approx-prefix-cache-producer splits the incoming prompt into fixed-size blocks (e.g., 16 tokens approximated as characters) and builds a rolling hash chain.
  3. Local Index: The EPP maintains an in-memory LRU index of which prefix hashes were recently sent to which Pods.
  4. Scoring: The prefix-cache-scorer reads the match information and assigns a score based on the ratio of matched blocks to total prompt blocks.
  5. Learning: After a routing decision is made, the EPP updates its local index, assuming the selected Pod will now host that prefix.

Pros & Cons

  • Pros: Extremely lightweight; no need for a tokenizer sidecar; no network connectivity required to model servers (ZMQ); does not require explicit model server integration as it doesn't expect the model servers to communicate KV-cache events.
  • Cons: Can diverge from actual model server state (e.g., if a Pod evicts a prefix due to memory pressure); less precise than token-based matching.

2. Precise Implementation

The precise implementation provides 100% accuracy by leveraging actual token data and real-time state updates from the model servers.

Components

How it Works

  1. Exact Tokenization: The tokenizer plugin sends the prompt to a high-performance tokenizer service (typically running as a sidecar or a local UDS service) to get exact Token IDs.
  2. Real-time Events: Model servers (like vLLM) are configured to emit KVEvents over ZeroMQ (ZMQ) whenever their internal KV cache changes (blocks added or evicted).
  3. Global Index: The KV-Cache Indexer subscribes to these events and maintains a precise, globally consistent view of exactly which token blocks reside on which Pods.
  4. Precise Matching: The precise-prefix-cache-scorer matches the exact Token IDs against this global index.
  5. Speculative Indexing: To close the "blind spot" between a routing decision and the arrival of the subsequent KVEvent, the plugin can proactively add "speculative" entries to the index immediately after routing.

Pros & Cons

  • Pros: 100% precision; handles complex cache eviction policies; natively supports Prefill/Decode disaggregation (by identifying specific blocks for transfer).
  • Cons: Requires additional infrastructure (tokenizer service, ZMQ connectivity); slightly higher resource overhead; requires model server support for emitting KV-cache events.

Comparison Summary

FeatureApproximatePrecise
PrecisionHeuristic (Character-based)100% (Token-based)
State SourceLocal EPP assumptionsReal-time KVEvents from Model Servers
DependenciesNoneTokenizer Service, ZMQ
Use CaseSimple, homogeneous workloadsComplex, high-scale production serving
P/D Disagg SupportBasicAdvanced/Native

Composition with KV Cache Management

Both implementations are part of the broader KV Cache Management ecosystem in llm-d. While the Approximate implementation is self-contained, the Precise implementation relies on the KV-Cache Indexer and can work in tandem with KV Offloading to manage cache state across accelerator and host memory boundaries.