Flow Control
Flow Control feature enables intelligent request queuing. Request queuing is useful for multiple reasons:
Multi-Tenant Deploymentsβ
In comparison to a single workload deployment, operators of multi-tenant workloads have additional considerations:
- Certain tenants are higher-priority than others (e.g. paid vs unpaid)
- Certain requests have different-SLOs than others (e.g. batch vs online)
- Certain tenants are more active than others - we want fairness between them
Flow control introduces intelligent queuing to the EPP, allowing operators to factor traffic dynamics into scheduling decisions. This capability addresses noisy-neighbor problems when mixing high- and low-priority traffic; furthermore, it ensures fairness among equal-priority tenants, preventing any single user from starving others of shared pool resources.
SINGLE TENANT MULTI-TENANT
βββββββββββββ ββββββββββββ
[A] βββΆ [ GPUs ] [A] β²
[B] βββΆ [ GPUs ]
[B] βββΆ [ GPUs ] [C] β±
[C] βββΆ [ GPUs ]
One deployment per customer One deployment, many customers
Single Workload "No-Regret" Schedulingβ
In addition to inter-tenant prioritization and fairness, flow control also enables "no-regret" scheduling by holding requests during peak saturation. By delaying the dispatch until load subsidesβrather than committing a request to a specific server's queue where it becomes stuckβthe EPP ensures requests land on the best available resource.
βββββ req ββββββββββββββββββββββββββββ βββββββββββββββ
β A ββββββββΆβ ββββββββββββββββββββ β--------βΆβ Server 1 β
βββββ β β Request Queue β β β [βββββ] FULLβ
β β βββββββββββββββ β β βββββββββββββββ
βββββ β β [R][R][R][R][R] β β βββββββββββββββ
β B ββββββββΆβ ββββββββββββββββββββ β--------βΆβ Server 2 β
βββββ β β checks load β β [βββββ] FULLβ
β β queues reqs if β βββββββββββββββ
βββββ β detects saturation β βββββββββββββββ
β C ββββββββΆβ β releases reqs when ββββββββββΆβ Server 3 β
βββββ β capacity opens β β [βββββ] 60% β
ββββββββββββββββββββββββββββ βββββββββββββββ
Deployβ
For detailed step-by-step instructions on how to deploy and configure Flow Control, see the Flow Control Architecture.
Architectureβ
Requests arrive to the proxy with headers expressing their tenant ID and traffic priority. EPP leverages these headers to assign a FlowKey (tuple of FairnessID and Priority) to each request and maintains separate in-memory queues for each FlowKey. Each FlowKey is assigned to a PriorityBand (for cases when multiple tenants have the same priority).
Then, in each scheduling cycle, the EPP traverses the queues in 3 tiers:
- Priority - the system always services highest
PriorityBandfirst - Fairness - within a
PriorityBand, the Fairness Policy determines which flow (i.e. tenant) is dispatched next - Ordering - within a flow (i.e. tenant), the Ordering Policy determines which request to serve (e.g. FCFC or SLO-aware)
In the background EPP monitors the model servers for saturation. If it detects saturation, requests are queued until saturation subsides.
Trust Boundary: In a production system, allowing end-users to self-assert their tenant ID or traffic priority (premium-traffic) is an abuse vector. In production, these headers should be stripped from external requests and injected by an upstream trusted API gateway, identity provider, or Envoy AuthZ filter based on the API key.
Further Readingβ
See Flow Control architecture for full details of the design.