Skip to main content

Feature Matrix

llm-d supports multiple model servers, accelerator backends, and infrastructure providers at various levels of maturity.

This page describes the current coverage as validated in the v0.7.0 release and nightly CI.

Well-Lit Paths × Model Server × Accelerator

Optimized Baseline

AcceleratorvLLMSGLang
NVIDIA GPU
AMD ROCm
Intel XPU
Intel Gaudi (HPU)
Google TPU
CPU

Nightly CI: OpenShift (CUDA), GKE (CUDA), CoreWeave (CUDA), XPU (PR-triggered), HPU (PR-triggered)

Precise Prefix-Cache-Aware Routing

AcceleratorvLLMSGLang
NVIDIA GPU
Intel XPU

Nightly CI: OpenShift (CUDA), CoreWeave (CUDA), GKE (CUDA)

Prefill/Decode Disaggregation

AcceleratorvLLMSGLang
NVIDIA GPU
AMD ROCm
Intel XPU
Google TPU

Nightly CI: OpenShift (CUDA), GKE (CUDA), CoreWeave (CUDA)

Wide Expert-Parallelism

AcceleratorvLLMSGLang
NVIDIA GPU

Nightly CI: OpenShift (CUDA), GKE (CUDA), CoreWeave (CUDA)

Requires LeaderWorkerSet (LWS) for multi-node orchestration. DP-aware scheduling variant is under development.

Tiered Prefix Cache

AcceleratorCPU Offload (vLLM)Storage Offload (vLLM)SGLang
NVIDIA GPU
Intel XPU
Google TPUComing soon

Nightly CI: OpenShift (CUDA)

Workload Autoscaling

VariantvLLMSGLang
HPA + IGW Metrics
Workload Variant Autoscaler (WVA)

Nightly CI: OpenShift (WVA), CoreWeave (WVA)

Predicted Latency-Based Scheduling

AcceleratorvLLMSGLang
NVIDIA GPU

Nightly CI: GKE (CUDA), CoreWeave (CUDA)

Accelerator-agnostic: only validated on NVIDIA GPU, but the scheduler logic does not depend on accelerator type and should work on any backend supported by vLLM or SGLang.

Asynchronous Processing

BackendvLLMSGLang
Redis
GCP Pub/Sub

Nightly CI: None

Batch Gateway

BackendvLLMSGLang
PostgreSQL + Redis
S3 + Redis

Nightly CI: None

Provides OpenAI-compatible Batch API (/v1/batches, /v1/files) for offline inference workloads.

Infrastructure Providers

ProviderOptimized BaselineP/D DisaggregationWide EPTiered Prefix CachePrecise Prefix CacheWVA
OpenShiftNightlyNightlyNightlyNightlyNightlyNightly
GKENightlyNightlyNightlyNightly
CoreWeave (CKS)NightlyNightlyNightlyNightlyNightly
MinikubeManual
DigitalOceanManual
AKSManual

Gateway Providers

ProviderStatusNotes
IstioDefaultUsed in all well-lit paths
AgentGatewaySupportedPreferred for new self-installed deployments
GKE GatewaySupportedExternally managed, used in GKE guides
kgatewayDeprecatedWill be removed in next release

Support Matrix

Supported Hardware

For accelerator maintainer contacts and contribution requirements, see Accelerator Support. The information below is also maintained in that document and will be consolidated into this feature matrix in a future docs revision.

AcceleratorSupported DevicesNotes
NVIDIA GPUA100, H100, H200, B200Primary platform. All well-lit paths validated.
AMD ROCmMI250, MI300XOptimized baseline and P/D disaggregation.
Google TPUv5e, v6e, v7GKE only. Optimized baseline and P/D disaggregation.
Intel XPUData Center GPU Max 1550, BMG (Battlemage)Uses DRA. Optimized baseline, P/D disaggregation, and precise prefix cache.
Intel Gaudi (HPU)Gaudi 2, Gaudi 3Uses DRA. Optimized baseline.
CPUIntel Xeon (Sapphire Rapids+), AMD EPYC64 cores, 64 GB RAM per replica.
note

All Operational Excellence (e.g., observability, flow control) and Batch well-lit paths are supported on all accelerator types.

Software Requirements

ComponentMinimum VersionNotes
Kubernetes1.30+Gateway API v1 support required
Gateway API CRDsv1.5.1
Gateway API Inference Extension CRDsv1.5.0
Helm3.xFor helmfile-based guides
Helmfile0.xFor helmfile-based guides
kubectl1.30+
kustomize5.xFor kustomize-based guides (tiered prefix cache, wide EP)

Installation Methods

llm-d guides use two deployment methods. Both produce the same Kubernetes resources.

MethodNotes
HelmUsed to deploy llm-d router in standalone and gateway modes, async processor, etc.
KustomizeUsed to deploy declarative overlays for model servers, gateways, etc. Reusable base layers in guides/recipes/.

The project is migrating from helmfile to kustomize-first installation (tracking issue). New guides should prefer kustomize.

Guide Maturity

Each well-lit path guide is assigned a maturity level reflecting its testing and documentation coverage.

LevelDefinition
HighTested nightly across multiple infrastructure providers (OpenShift, GKE, CoreWeave). Benchmarked and documented.
MediumTested nightly on at least one infrastructure provider. Documented with deployment guide.
ExperimentalFunctional but not regularly tested by maintainers. May have known limitations.
GuideMaturityNightly Providers
Optimized Baseline (vLLM, CUDA)HighOpenShift, GKE, CoreWeave
Optimized Baseline (SGLang, CUDA)Medium
Optimized Baseline (AMD, XPU, HPU, TPU, CPU)ExperimentalXPU, HPU (PR-triggered)
Precise Prefix-Cache-Aware RoutingMediumOpenShift
Prefill/Decode Disaggregation (vLLM, CUDA)HighOpenShift, GKE, CoreWeave
Prefill/Decode Disaggregation (SGLang, CUDA)Experimental
Prefill/Decode Disaggregation (AMD, XPU, TPU)Experimental
Wide Expert-ParallelismExperimentalOpenShift, GKE, CoreWeave
Tiered Prefix CacheMediumOpenShift
Workload Autoscaling (WVA)ExperimentalOpenShift, CoreWeave
Workload Autoscaling (HPA + IGW)Experimental
Predicted Latency-Based SchedulingMediumGKE, CoreWeave
Asynchronous ProcessingExperimental