KV-Cache Manager
Introduction
LLM inference can be computationally expensive due to the sequential nature of token generation. KV-caching plays a critical role in optimizing this process. By storing previously computed key and value attention vectors, KV-cache reuse avoids redundant computations during inference, significantly reducing latency and resource consumption. This is particularly beneficial for long context multi-turn conversations or Agentic (&RAG) applications where previously computed information can be leveraged effectively. Efficient KV-cache management and routing are essential for scaling LLM inference and delivering a responsive user experience.
llmd-kv-cache-manager is a pluggable KV-cache Manager for KV-cache Aware Routing in LLM serving platforms.
This initial work will expand in capacity as development continues.
See the docs folder in the repository for more information on goals, architecture and more.
Goals
The KV-Cache-Manager is designed to connect high-level serving-stack goals with concrete system capabilities through a layered objective structure:
- Improve user experience
- By reducing Time-To-First-Token (TTFT)
- Enabled through higher KVCache hit rates and reduced tensor transfers
- Supported by smart routing and distributed cache availability
- Optimized by proactive pre-placement of hot caches and session duplication/migration
- By reducing Time-To-First-Token (TTFT)
- Reduce serving costs
- By improving compute utilization
- Minimize re-compute via KVCache reuse and locality-aware request handling
- Leverage zero-copy cache transfers across nodes
- By improving compute utilization
Vision
This goal structure above is shaped by our vision for emerging use cases like RAG and agentic workflows, which involve heavy context-reuse across sessions and instances. Shared documents, tool prompts, and workflow steps create overlapping token streams that benefit significantly from cross-instance KVCache coordination.
To implement this vision, the KVCache-Manager incorporates proactive cache placement, session duplication, and cluster-level cache APIs - bridging gaps in current serving stacks where KVCache management and utilization is not yet treated as a first-class concern.
Architecture Overview
The code defines a kvcache.Indexer module that efficiently maintains a global view of KV-cache states and localities. In the current state of vLLM, the only available information on KV-cache availability is that of the offloaded tensors to KV-cache Engines via the Connector API.
The kvcache.Indexer
module is a pluggable Go package designed for use by orchestrators to enable KV-cache-aware scheduling decisions.
This overview greatly simplifies the actual architecture and combines steps across several submodules.
Architecture
For even more a detailed architecture, refer to the architecture document.
The architecture is designed to efficiently maintain a global view of KV-cache states and localities, enabling KV-cache-aware scheduling decisions.
Detailed System Flow
Explanation
The main blocking sequence of steps that happens when a user (e.g., router) sends a request to the kvcache.Indexer is as follows:
- User sends a request to the kvcache.Indexer with a prompt, model name, and relevant pods.
- kvcache.Indexer:
- Finds the longest tokenized prefix for the prompt and model name using the PrefixStore.
- Depending on the store type (LRU or Trie), it gets the tokenization of the longest cached prefix
- Adds a tokenization task to the TokenizersPool, which is handled asynchronously by a worker. This bit is explained later.
- Finds the longest tokenized prefix for the prompt and model name using the PrefixStore.
- kvcache.Indexer queries the TokenProcessor to get block keys for the tokens of the longest prefix.
- TokenProcessor:
- Chunks the tokens and generate keys for the token blocks. The chunking and key calculating has to be aligned with the source that feeds the key -> pods backend (Redis).
- Returns the block keys to the kvcache.Indexer.
- kvcache.Indexer queries the KVBlockIndexer for pods that have the block keys.
- The KVBlockIndexer queries the Redis backend for the mappings with MGet.
- The Redis backend efficiently returns the key -> pods mapping.
- kvcache.Indexer uses the configured KVBlockScorer to score the pods based block hits:
- LongestPrefixMatch: scores by the longest consecutive (ordered) block hits in a single pod.
- HighestBlockHit: scores by the index of the highest block hit in a single pod.
- CoverageBasedMatching: scores by the total number of block hits in a single pod.
Asynchronous tokenization flow:
- A worker fetches the task from the TokenizersPool.
- The worker tokenizes the prompt using the HuggingFaceTokenizer.
- The HuggingFaceTokenizer retrieves the cached in-memory tokenizer for the model.
- If the tokenizer is not cached, it gets created and cached.
- The HuggingFaceTokenizer returns the tokens to the worker.
- The worker adds the tokens to the PrefixStore.
- Depending on the store type (LRU or Trie), it adds the tokens to the appropriate store:
- LRUStore: an LRU HashTable of prompt-chunks to tokens
- TrieStore: a Trie of characters to tokens
- Due to the nature of how tokenizers operate, the tokenization of a prefix of a prompt is a prefix of the tokenization of the full prompt. One challenge in tokenization is that different chunks of a prompt map to different tokens. Therefore, when we chunk a prompt, we use the [_, end] index associated with the tokens to contain token in a chunk. The implication of this design is that the tokens contained in a chunk are only correct if all previous chunks are also considered, since one token may be associated with the edge-characters of two consecutive chunks.
- Depending on the store type (LRU or Trie), it adds the tokens to the appropriate store:
Maintenance of Redis for KVBlock -> Pods Mapping
Currently, indexing information is updated from vLLM for the offloaded tokens using the Connector API, specifically leveraging the LMCache connector.
Future enhancements will enable the llm-d-kv-cache-manager
component to process KV-cache events across all memory layers of vLLM, ensuring an accurate holistic view of KV-cache localities throughout the system.
Examples
- KV-cache Indexer:
- A reference implementation of using the
kvcache.Indexer
module.
- A reference implementation of using the
- KV-cache Aware Scorer:
- A reference implementation of integrating the
kvcache.Indexer
module in llm-d-inference-scheduler in a KV-cache aware scorer.
- A reference implementation of integrating the