Skip to main content

llm-d Inference Router Architecture

Overview

llm-d is an extensible architecture designed to route inference requests efficiently across model-serving pods. A central component of this architecture is the Inference Gateway, which builds on the Kubernetes-native Gateway API Inference Extension (GIE) to enable scalable, flexible, and pluggable routing of requests.

The design enables:

  • Support for multiple base models and LoRA adapters within a shared cluster [Not supported in Phase1]
  • Efficient routing based on KV cache locality, prefix, session affinity, load, and model metadata
  • Disaggregated Prefill/Decode (P/D) execution
  • Pluggable filters, scorers, and scrapers for extensible routing

Core Goals

  • Route inference requests to optimal pods based on:
    • Base model compatibility
    • KV cache reuse
    • Load balancing
  • Support multi-model deployments on heterogeneous hardware
  • Enable runtime extensibility with pluggable logic (filters, scorers, scrapers)
  • Community-aligned implementation using GIE and Envoy + External Processing (EPP)

Architecture Design

Inference Gateway Architecture

The inference scheduler is built on top of:

  • Envoy as a programmable data plane
  • EPP (External Processing Plugin) using GIE

Pluggability

Pluggability Architecture

Routing decisions are governed by dynamic components:

  • Filters: Exclude pods based on static or dynamic criteria
  • Scorers: Assign scores to candidate pods
  • Scrapers: Collect pod metadata and metrics for scorers

These components are maintained in the llm-d-inference-scheduler repository and can evolve independently.


Filters, Scorers, and Scrapers

Core Design Principles

  • Pluggability: No core changes are needed to add new scorers or filters
  • Isolation: Each component operates independently

Routing Flow

  1. Filtering

    • Pods in an InferencePool go through a sequential chain of filters
    • Pods may be excluded based on criteria like model compatibility, resource usage, or custom logic
  2. Scoring

    • Filtered pods are scored using a weighted set of scorers
    • Scorers currently run sequentially (future: parallel execution)
    • Scorers access a shared datastore populated by scrapers
  3. Pod Selection

    • The highest-scored pod is selected
    • If multiple pods share the same score, one is selected at random

Lifecycle Hooks

  • Pre-call
  • Scoring
  • Post-choice
  • After-response

Scorers & Configuration

ScorerDescriptionEnv Vars
Session-awarePrefers pods from same sessionENABLE_SESSION_AWARE_SCORER, SESSION_AWARE_SCORER_WEIGHT, PREFILL_ENABLE_SESSION_AWARE_SCORER, PREFILL_SESSION_AWARE_SCORER_WEIGHT
Prefix-awareMatches prompt prefixENABLE_PREFIX_AWARE_SCORER, PREFIX_AWARE_SCORER_WEIGHT, PREFILL_ENABLE_PREFIX_AWARE_SCORER, PREFILL_PREFIX_AWARE_SCORER_WEIGHT
KVCache-awareOptimizes for KV reuseENABLE_KVCACHE_AWARE_SCORER, KVCACHE_INDEXER_REDIS_ADDR, PREFILL_ENABLE_KVCACHE_AWARE_SCORER, PREFILL_KVCACHE_INDEXER_REDIS_ADDR, HF_TOKEN, KVCACHE_INDEXER_REDIS_ADDR
Load-awareAvoids busy podsENABLE_LOAD_AWARE_SCORER, LOAD_AWARE_SCORER_WEIGHT, PREFILL_ENABLE_LOAD_AWARE_SCORER, PREFILL_LOAD_AWARE_SCORER_WEIGHT

Prefill / Decode Configuration

In case Disaggrigated Prefill is enabled, you should also define the following environment variables.

  • Toggle P/D mode: PD_ENABLED=true
  • Threshold: PD_PROMPT_LEN_THRESHOLD=<value>

Prefill Scorers:

export PREFILL_ENABLE_SESSION_AWARE_SCORER=true
export PREFILL_SESSION_AWARE_SCORER_WEIGHT=1
export PREFILL_ENABLE_KVCACHE_AWARE_SCORER=true
export PREFILL_KVCACHE_AWARE_SCORER_WEIGHT=1
export PREFILL_ENABLE_LOAD_AWARE_SCORER=true
export PREFILL_LOAD_AWARE_SCORER_WEIGHT=1
export PREFILL_ENABLE_PREFIX_AWARE_SCORER=true
export PREFILL_PREFIX_AWARE_SCORER_WEIGHT=1

Metric Scraping

  • Scrapers collect metrics (e.g., memory usage, active adapters)
  • Data is injected into the shared datastore for scorers
  • Scoring can rely on numerical metrics or metadata (model ID, adapter tags)

Disaggregated Prefill/Decode (P/D)

When enabled, the router:

  • Selects one pod for Prefill (prompt processing)
  • Selects another pod for Decode (token generation)

The vLLM sidecar handles orchestration between Prefill and Decode stages. It allows:

  • Queuing
  • Local memory management
  • Experimental protocol compatibility

Note: The detailed P/D design is available in this document: Disaggregated Prefill/Decode in llm-d


InferencePool & InferenceModel Design

Current Assumptions

  • Single InferencePool and single EPP due to Envoy limitations
  • Model-based filtering can be handled within EPP
  • Currently only one base model is supported

References