InferencePool
The InferencePool is the central resource that bridges the gap between the Gateway, the Endpoint Policy Provider (EPP), and the collection of model server instances. It serves as the source of truth for both endpoint discovery and service mesh/gateway integration.
Functional Overview
The InferencePool performs two primary roles in the inference infrastructure:
- Endpoint Discovery for the EPP: It defines how the EPP should find and monitor the model server Pods that are eligible to serve requests.
- Service Integration for the Gateway: It provides the necessary metadata for the Gateway controller to locate the EPP and connect it to the proxy as an external processing (
ext-proc) service.
Architecture and Relations
The following diagram visualizes how the InferencePool resource is involved in the control path of both the EPP and Gateway Controller:
1. Endpoint Discovery (EPP Perspective)
The EPP uses the InferencePool to discover which pods it can pick from.
- Selector-based Discovery: The
InferencePooldefines aselector(label matching). The EPP watches for Pods that match these labels within the same namespace. - Dynamic Membership: As model server Pods are scaled up or down, or as their readiness state changes, the EPP automatically updates its internal list of healthy candidates.
- Port Mapping: The
targetPortsin theInferencePooltell the EPP which ports on the discovered Pods are listening for inference traffic (e.g., port 8000 for vLLM).
2. Gateway Integration (Controller Perspective)
When an InferencePool is used as a backendRef in an HTTPRoute, the Gateway controller uses the resource to configure the underlying proxy.
- EPP Connectivity: The
endpointPickerRef(orextensionRef) in theInferencePoolpoints to the EPP service. The Gateway controller uses this information to configure the proxy'sext_procfilter, ensuring that every request directed to the pool is first processed by the EPP. - Routing Logic: The proxy is configured to "park" the request and wait for the EPP's decision. The EPP then instructs the proxy—via the
ext_procprotocol—on which specific Pod IP from the discovered pool should receive the request. - Failure Handling: The
failureModedefined in theInferencePool(e.g.,FailOpenorFailClose) tells the Gateway controller how to configure the proxy's behavior if the EPP becomes unresponsive.
Key Relationships
- One-to-One Mapping: Typically, one
InferencePoolcorresponds to one logical deployment of a model (e.g., Gemma4) and is served by one EPP deployment. - Decoupled Scaling: The model servers can scale independently of the EPP. The
InferencePoolensures the EPP is always aware of the current set of available endpoints. - Namespace Scoped: All discovery and references (Pods, EPP Service, and the InferencePool itself) are strictly contained within the same Kubernetes namespace to maintain security and isolation boundaries.