Rollout Guides
Rollout guides demonstrate how to perform incremental deployment operations that gradually introduce new versions of your inference infrastructure with minimal service disruption.
Overview​
These guides cover rollout strategies for LLM inference deployments, helping you choose the right approach based on your requirements.
Rollout Strategies​
Rolling Update​
A Rolling Update is the standard Kubernetes deployment strategy that updates pods gradually within a single InferencePool. This approach works in both standalone and llm-d router gateway modes.
How it works:
- Updates pods incrementally (e.g., 25% at a time)
- Old pods continue serving traffic until new pods are healthy
- Built into Kubernetes Deployments
Use Rolling Updates for:
- General, non-critical updates where strict traffic percentages do not matter
- Scenarios where you want to conserve compute resources
- Development and staging environments
Learn more: Kubernetes Rolling Update Tutorial
Blue-Green Update (HTTPRoute Traffic Splitting)​
A Blue-Green Update creates a second complete InferencePool and uses HTTPRoute to control traffic distribution between the old (blue) and new (green) versions. This strategy requires llm-d router gateway mode.
How it works:
- Deploy a complete new InferencePool alongside the existing one
- Use HTTPRoute to gradually shift traffic (e.g., 1% → 5% → 10% → 50% → 100%)
- Instant rollback by adjusting HTTPRoute weights
Use Blue-Green Updates for:
- Critical, high-risk production deployments that require gradual canary rollouts
- Scenarios requiring fast rollbacks
- Header-based routing (e.g., routing beta users to new version)
- Updates that need precise traffic control
Guide: Blue-Green Update
Comparison:
| Feature | Rolling Update | Blue-Green Update |
|---|---|---|
| Routing Control | Random/Even across all healthy pods | Precise Percentage (e.g., exactly 1% or 10%) |
| Blast Radius | High (All users exposed randomly) | Low (Isolated to specified target weight) |
| Rollback Speed | Slow (Requires creating new pods in reverse) | Instant (Flip HTTPRoute weight back to 0) |
| Resource Costs | Low (Only temporary surge of pods) | High (Requires running two full environments) |
| Version Coexistence | Simultaneously active inside one Service | Strictly separated across two distinct Services |
| Deployment Mode | Standalone and Gateway | Gateway only |
Note: Capacity management may also play a role in choosing between these strategies.
LoRA Adapter Rollout​
LoRA (Low-Rank Adaptation) adapter rollouts allow you to update model customizations without changing the base model or infrastructure. This works in both standalone and llm-d router gateway modes.
How it works:
- Use
InferenceModelRewriteto map model names to specific adapter versions - Gradually shift traffic between adapter versions
- No infrastructure changes required
Use LoRA Adapter Rollouts when:
- You need to deploy new versions of LoRA adapters without disrupting service
- You want to test adapter changes with a subset of traffic
- You need to maintain multiple adapter versions simultaneously
Guide: LoRA Adapter Rollout
General Rollout Pattern​
All rollout guides follow a similar pattern:
- Deploy new infrastructure - Create the new version alongside the existing one
- Configure traffic splitting - Gradually shift traffic to the new version (e.g., 10% → 50% → 100%)
- Monitor and validate - Verify the new version performs correctly at each stage
- Complete rollout - Direct 100% of traffic to the new version
- Clean up - Remove the old version once the new version is stable
Prerequisites​
Before following these guides, ensure you have:
- A working llm-d deployment (see getting started guide)
- Access to kubectl and the Kubernetes cluster
- Understanding of Kubernetes Gateway API concepts (for gateway mode)
- Familiarity with your model serving infrastructure (vLLM, etc.)