High performance distributed inference on Kubernetes with llm-d
Our guides provide tested and benchmarked recipes and Helm charts to serve large language models (LLMs) at peak performance with best practices common to production deployments. A familiarity with basic deployment and operation of Kubernetes is assumed.
If you want to learn by doing, follow a step-by-step first deployment with QUICKSTART.md.
Who are these guides (and llm-d) for?
These guides are targeted at startups and enterprises deploying production LLM serving that want the best possible performance while minimizing operational complexity. State of the art LLM inference involves multiple optimizations that offer meaningful tradeoffs, depending on use case. The guides help identify those key optimizations, understand their tradeoffs, and verify the gains against your own workload.
We focus on the following use cases:
- Deploying a self-hosted LLM behind a single workload across tens or hundreds of nodes
- Running a production model-as-a-service platform that supports many users and workloads sharing one or more LLM deployments
Well-Lit Path Guides
A well-lit path is a documented, tested, and benchmarked solution of choice to reduce adoption risk and maintenance cost. These are the central best practices common to production deployments of large language model serving.
We currently offer three tested and benchmarked paths to help you deploy large models:
- Intelligent Inference Scheduling - Deploy vLLM behind the Inference Gateway (IGW) to decrease latency and increase throughput via precise prefix-cache aware routing and customizable scheduling policies.
- Prefill/Decode Disaggregation - Reduce time to first token (TTFT) and get more predictable time per output token (TPOT) by splitting inference into prefill servers handling prompts and decode servers handling responses, primarily on large models such as Llama-70B and when processing very long prompts.
- Wide Expert-Parallelism - Deploy very large Mixture-of-Experts (MoE) models like DeepSeek-R1 and significantly reduce end-to-end latency and increase throughput by scaling up with Data Parallelism and Expert Parallelism over fast accelerator networks.
These guides are intended to be a starting point for your own configuration and deployment of model servers. Our Helm charts provide basic reusable building blocks for vLLM deployments and inference scheduler configuration within these guides but will not support the full range of all possible configurations. Both guides and charts depend on features provided and supported in the vLLM and inference gateway open source projects.
Supporting Guides
Our supporting guides address common operational challenges with model serving at scale:
- Simulating model servers can deploy a vLLM model server simulator that allows testing inference scheduling and orchestration at scale as each instance does not need accelerators.
Other Guides
The following guides have been provided by the community but do not fully integrate into the llm-d configuration structure yet and are not fully supported as well-lit paths:
- Coming Soon!
New guides added to this list enable at least one of the core well-lit paths but may directly include prerequisite steps specific to new hardware or infrastructure providers without full abstraction. A guide added here is expected to eventually become path of an existing well-lit path.
This content is automatically synced from guides/README.md in the llm-d/llm-d repository.
📝 To suggest changes, please edit the source file or create an issue.