Skip to main content

Well-lit Path: P/D Disaggregation

Overview

This guide demonstrates how to deploy Llama-70B using vLLM's P/D disaggregation support with NIXL. This guide has been validated on:

  • an 8xH200 cluster with InfiniBand networking
  • an 8xH200 cluster on GKE with RoCE networking

WARNING: We are still investigating and optimizing performance for other hardware and networking configurations

In this example, we will demonstrate a deployment of Llama-3.3-70B-Instruct-FP8 with:

  • 4 TP=1 Prefill Workers
  • 1 TP=4 Decode Worker

P/D Best Practices

P/D disaggregation can benefit overall throughput by:

  • Specializing P and D workers for compute-bound vs latency-bound workloads
  • Reducing the number of copies of the model (increasing KV cache RAM) with wide parallelism

However, P/D disaggregation is not a target for all workloads. We suggest exploring P/D disaggregation for workloads with:

  • Large models (e.g. Llama-70B+, not Llama-8B)
  • Longer input sequence lengths (e.g 10k ISL | 1k OSL, not 200 ISL | 200 OSL)
  • Sparse MoE architectures with opportunities for wide-EP

As a result, as you tune your P/D deployments, we suggest focusing on the following parameters:

  • Heterogeneous Parallelism: deploy P workers with less parallelism and more replicas and D workers with more parallelism and fewer replicas
  • xPyD Ratios: tuning the ratio of P workers to D workers to ensure balance for your ISL|OSL ratio

For very large models leveraging wide-EP, traffic for KV cache transfer may contend with expert parallelism when the ISL|OSL ratio is also high. We recommend starting with RDMA for KV cache transfer before attempting to leverage TCP, as TCP transfer requires more tuning of UCX under NIXL.

Hardware Requirements

This guide expects 8 Nvidia GPUs of any kind, and RDMA via InfiniBand or RoCE between all pods in the workload.

Prerequisites

Installation

Use the helmfile to compose and install the stack. The Namespace in which the stack will be deployed will be derived from the ${NAMESPACE} environment variable. If you have not set this, it will default to llm-d-pd in this example.

export NAMESPACE=llm-d-pd # Or any namespace your heart desires
cd guides/pd-disaggregation
helmfile apply -n ${NAMESPACE}

NOTE: You can set the $RELEASE_NAME_POSTFIX env variable to change the release names. This is how we support concurrent installs. Ex: RELEASE_NAME_POSTFIX=pd-2 helmfile apply -n ${NAMESPACE}

NOTE: This uses Istio as the default provider, see Gateway Options for installing with a specific provider.

Gateway options

To see specify your gateway choice you can use the -e <gateway option> flag, ex:

helmfile apply -e kgateway -n ${NAMESPACE}

To see what gateway options are supported refer to our gateway provider prereq doc. Gateway configurations per provider are tracked in the gateway-configurations directory.

You can also customize your gateway, for more information on how to do that see our gateway customization docs.

Infrastructure provider specifics

This guide uses RDMA via InfiniBand or RoCE for disaggregated serving kv-cache transfer. The resource attributes required to configure accelerator networking are not yet standardized via Kubernetes Dynamic Resource Allocation and so are parameterized per infra provider in the Helm charts. If your provider has a custom setting you will need to update the charts before deploying.

Install HTTPRoute

Follow provider specific instructions for installing HTTPRoute.

Install for "kgateway" or "istio"

kubectl apply -f httproute.yaml

Install for "gke"

kubectl apply -f httproute.gke.yaml

Verify the Installation

  • Firstly, you should be able to list all helm releases to view the 3 charts got installed into your chosen namespace:
helm list -n ${NAMESPACE}
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gaie-pd llm-d-pd 1 2025-08-24 12:54:51.231537 -0700 PDT deployed inferencepool-v0.5.1 v0.5.1
infra-pd llm-d-pd 1 2025-08-24 12:54:46.983361 -0700 PDT deployed llm-d-infra-v1.2.4 v0.2.0
ms-pd llm-d-pd 1 2025-08-24 12:54:56.736873 -0700 PDT deployed llm-d-modelservice-v0.2.7 v0.2.0
  • Out of the box with this example you should have the following resources:
kubectl get all -n ${NAMESPACE}
NAME READY STATUS RESTARTS AGE
pod/gaie-pd-epp-54444ddc66-qv6ds 1/1 Running 0 2m35s
pod/infra-pd-inference-gateway-istio-56d66db57f-zwtzn 1/1 Running 0 2m41s
pod/ms-pd-llm-d-modelservice-decode-84bf6d5bdd-jzfjn 2/2 Running 0 2m30s
pod/ms-pd-llm-d-modelservice-prefill-86f6fb7cdc-8kfb8 1/1 Running 0 2m30s
pod/ms-pd-llm-d-modelservice-prefill-86f6fb7cdc-g6wmp 1/1 Running 0 2m30s
pod/ms-pd-llm-d-modelservice-prefill-86f6fb7cdc-jx2w2 1/1 Running 0 2m30s
pod/ms-pd-llm-d-modelservice-prefill-86f6fb7cdc-vzcb8 1/1 Running 0 2m30s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/gaie-pd-epp ClusterIP 10.16.0.255 <none> 9002/TCP,9090/TCP 2m35s
service/gaie-pd-ip-bb618139 ClusterIP None <none> 54321/TCP 2m35s
service/infra-pd-inference-gateway-istio LoadBalancer 10.16.3.74 10.16.4.3 15021:31707/TCP,80:34096/TCP 2m41s

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/gaie-pd-epp 1/1 1 1 2m36s
deployment.apps/infra-pd-inference-gateway-istio 1/1 1 1 2m42s
deployment.apps/ms-pd-llm-d-modelservice-decode 1/1 1 1 2m31s
deployment.apps/ms-pd-llm-d-modelservice-prefill 4/4 4 4 2m31s

NAME DESIRED CURRENT READY AGE
replicaset.apps/gaie-pd-epp-54444ddc66 1 1 1 2m36s
replicaset.apps/infra-pd-inference-gateway-istio-56d66db57f 1 1 1 2m42s
replicaset.apps/ms-pd-llm-d-modelservice-decode-84bf6d5bdd 1 1 1 2m31s
replicaset.apps/ms-pd-llm-d-modelservice-prefill-86f6fb7cdc 4 4 4 2m31s

NOTE: This assumes no other guide deployments in your given ${NAMESPACE} and you have not changed the default release names via the ${RELEASE_NAME} environment variable.

Using the stack

For instructions on getting started making inference requests see our docs

Cleanup

To remove the deployment:

# Remove the model services
helmfile destroy -n ${NAMESPACE}

# Remove the infrastructure
helm uninstall ms-pd -n ${NAMESPACE}
helm uninstall gaie-pd -n ${NAMESPACE}
helm uninstall infra-pd -n ${NAMESPACE}

NOTE: If you set the $RELEASE_NAME_POSTFIX environment variable, your release names will be different from the command above: infra-$RELEASE_NAME_POSTFIX, gaie-$RELEASE_NAME_POSTFIX and ms-$RELEASE_NAME_POSTFIX.

NOTE: You do not need to specify your environment with the -e <environment> flag to helmfile for removing a installation of the guide, even if you use a non-default option. You do, however, have to set the -n ${NAMESPACE} otherwise it may not cleanup the releases in the proper namespace.

Cleanup HTTPRoute

Follow provider specific instructions for deleting HTTPRoute.

Cleanup for "kgateway" or "istio"

kubectl delete -f httproute.yaml

Cleanup for "gke"

kubectl delete -f httproute.gke.yaml

Customization

For information on customizing a guide and tips to build your own, see our docs

Content Source

This content is automatically synced from guides/pd-disaggregation/README.md in the llm-d/llm-d repository.

📝 To suggest changes, please edit the source file or create an issue.