EPP gRPC APIs Reference
This document lists the gRPC APIs the Endpoint Picker (EPP) supports for inference traffic. gRPC requests flow through the gateway as HTTP/2 (H2C) traffic, and the EPP decodes the gRPC frames and protobuf payloads to do prefix-cache aware routing, plugin decisions, and response usage tracking.
Unlike the HTTP APIs, gRPC parsing is not enabled by default: the matching parser plugin must be configured in the EndpointPickerConfig.
Supported gRPC APIs​
| gRPC Method | Source | Parser Plugin | Supported |
|---|---|---|---|
vllm.grpc.engine.VllmEngine/Generate | vLLM gRPC engine API | vllmgrpc-parser | ✅ |
vllm.grpc.engine.VllmEngine/Embed | vLLM gRPC engine API | vllmgrpc-parser | ✅ |
The gRPC API is currently token-out only for Generate: responses carry token IDs (chunk.token_ids, complete.output_ids) rather than decoded text, and clients are responsible for detokenization.
Parser Configuration​
Parsers are configured via the requestHandler.parsers section of the EndpointPickerConfig. Instantiate the parser plugin in plugins, then reference it by name:
apiVersion: llm-d.ai/v1alpha1
kind: EndpointPickerConfig
plugins:
- name: maxScore
type: max-score-picker
- name: vllmgrpcParser
type: vllmgrpc-parser
schedulingProfiles:
# ... omitted for brevity ...
requestHandler:
parsers:
- pluginRef: vllmgrpcParser
InferencePool Configuration​
gRPC requires HTTP/2 end to end. For the gateway to connect to the model server pods with HTTP/2 cleartext (h2c), the InferencePool must set appProtocol: kubernetes.io/h2c.
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
name: vllm-grpc-qwen3-32b
spec:
targetPorts:
- number: 8000
appProtocol: kubernetes.io/h2c
selector:
matchLabels:
app: vllm-grpc-qwen3-32b
endpointPickerRef:
name: vllm-grpc-qwen3-32b-epp
port:
number: 9002
When deploying with the llm-d-router Helm charts, setting router.modelServers.protocol=grpc configures this automatically.
Request Examples​
The examples below use grpcurl with the proxy endpoint as ${IP}, set per the relevant guide's verification steps. They require the vllm_engine.proto definition, and a model server that exposes the vLLM gRPC engine API.
vLLM VllmEngine/Generate​
Request (text input; alternatively pass pre-tokenized input via the tokenized field):
grpcurl -plaintext -proto vllm_engine.proto \
-d '{
"request_id": "req-1",
"text": "Hello",
"sampling_params": {"max_tokens": 10}
}' \
${IP}:80 vllm.grpc.engine.VllmEngine/Generate
Response:
{
"complete": {
"outputIds": [17993, 1894, 7332, 198, 286, 2415, 1140, 259, 4580, 892],
"finishReason": "length",
"promptTokens": 1,
"completionTokens": 10
}
}
Streaming request (set "stream": true; the server returns a stream of GenerateResponse messages with incremental chunk payloads followed by a final complete payload):
grpcurl -plaintext -proto vllm_engine.proto \
-d '{
"request_id": "req-2",
"text": "Hello",
"sampling_params": {"max_tokens": 10},
"stream": true
}' \
${IP}:80 vllm.grpc.engine.VllmEngine/Generate
Streaming response
Response contents:
{
"chunk": {
"tokenIds": [
883336980
],
"promptTokens": 10,
"completionTokens": 1
}
}
Response contents:
{
"chunk": {
"tokenIds": [
186949092
],
"promptTokens": 10,
"completionTokens": 1
}
}
Response contents:
{
"chunk": {
"tokenIds": [
446163293
],
"promptTokens": 10,
"completionTokens": 1
}
}
Response contents:
{
"chunk": {
"tokenIds": [
186949092
],
"promptTokens": 10,
"completionTokens": 1
}
}
Response contents:
{
"chunk": {
"tokenIds": [
3509523577
],
"promptTokens": 10,
"completionTokens": 1
}
}
Response contents:
{
"chunk": {
"tokenIds": [
1690122482
],
"promptTokens": 10,
"completionTokens": 1
}
}
Response contents:
{
"complete": {
"finishReason": "stop",
"promptTokens": 10
}
}
vLLM VllmEngine/Embed​
This method requires pre-tokenized input and an embedding model deployment.
Request:
grpcurl -plaintext -proto vllm_engine.proto \
-d '{
"request_id": "req-3",
"tokenized": {"original_text": "Hello", "input_ids": [9906]}
}' \
${IP}:80 vllm.grpc.engine.VllmEngine/Embed
Response (embedding vector truncated for readability):
{
"embedding": [-0.01350, -0.02152, -0.01368, "..."],
"promptTokens": 1,
"embeddingDim": 1024
}
HTTP Headers​
The EPP HTTP headers (request classification, flow control, and SLO headers such as x-llm-d-inference-objective and x-llm-d-inference-fairness-id) work for gRPC requests exactly as they do for HTTP.
Specify them as gRPC metadata on the call. With grpcurl, use -H:
grpcurl -plaintext -proto vllm_engine.proto \
-H 'x-llm-d-inference-objective: my-objective' \
-H 'x-llm-d-inference-fairness-id: tenant-a' \
-d '{
"request_id": "req-4",
"text": "Hello",
"sampling_params": {"max_tokens": 10}
}' \
${IP}:80 vllm.grpc.engine.VllmEngine/Generate
In a Go client, attach the metadata to the outgoing context:
ctx = metadata.AppendToOutgoingContext(ctx,
"x-llm-d-inference-objective", "my-objective",
"x-llm-d-inference-fairness-id", "tenant-a")
resp, err := client.Generate(ctx, req)
In Python, pass metadata on the call:
stub.Generate(request, metadata=(
("x-llm-d-inference-objective", "my-objective"),
("x-llm-d-inference-fairness-id", "tenant-a"),
))