Skip to main content

EPP HTTP APIs Reference

This document lists the HTTP APIs the Endpoint Picker (EPP) supports for inference traffic. Depending on the API, the EPP may parse fields from the request body to do prefix-cache aware routing, and plugin decisions.

Supported HTTP APIs

EndpointSourceSupported
/v1/completionsOpenAI Completions API
/v1/chat/completionsOpenAI Chat Completions API
/v1/responsesOpenAI Responses API
/v1/embeddingsOpenAI Embeddings API
/v1/messagesAnthropic Messages API
/inference/v1/generatevLLM Generate API

Request Examples

The examples below parameterize the model as ${MODEL_NAME} and the proxy endpoint as ${IP}. Set ${MODEL_NAME} to Qwen/Qwen3-VL-32B-Instruct from the multimodal optimized-baseline guide, and set ${IP} to the proxy endpoint IP retrieved per that guide's verification steps.

export MODEL_NAME=Qwen/Qwen3-VL-32B-Instruct

The /v1/embeddings section overrides ${MODEL_NAME} since chat/instruct models do not expose that route.

OpenAI /v1/completions

Request:

curl -X POST http://${IP}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "'"${MODEL_NAME}"'",
"prompt": "Hello",
"max_tokens": 10
}' | jq

Response:

{
"id": "cmpl-abc123",
"object": "text_completion",
"created": 1781036021,
"model": "Qwen/Qwen3-VL-32B-Instruct",
"choices": [
{
"index": 0,
"text": "! I am trying to write a story, and",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
"system_fingerprint": "vllm-0.21.0-tp2-5054d0df",
"usage": {
"prompt_tokens": 1,
"total_tokens": 11,
"completion_tokens": 10
}
}

Streaming request (set stream: true; the response is server-sent events, so drop jq and use curl -N to flush chunks as they arrive):

curl -N -X POST http://${IP}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "'"${MODEL_NAME}"'",
"prompt": "Hello",
"max_tokens": 10,
"stream": true
}'

OpenAI /v1/chat/completions

Request:

curl -X POST http://${IP}/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "'"${MODEL_NAME}"'",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": "https://picsum.photos/640/360"}}
]
}
],
"max_tokens": 10
}' | jq

Streaming request:

curl -N -X POST http://${IP}/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "'"${MODEL_NAME}"'",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": "https://picsum.photos/640/360"}}
]
}
],
"max_tokens": 10,
"stream": true
}'
Streaming response (SSE)
data: {"id":"cmpl-abc124","object":"text_completion","created":1781036045,"model":"Qwen/Qwen3-VL-32B-Instruct","choices":[{"index":0,"text":"!","logprobs":null,"finish_reason":null,"stop_reason":null}]}

data: {"id":"cmpl-abc124","object":"text_completion","created":1781036045,"model":"Qwen/Qwen3-VL-32B-Instruct","choices":[{"index":0,"text":" I","logprobs":null,"finish_reason":null,"stop_reason":null}]}

data: {"id":"cmpl-abc124","object":"text_completion","created":1781036045,"model":"Qwen/Qwen3-VL-32B-Instruct","choices":[{"index":0,"text":"'m","logprobs":null,"finish_reason":null,"stop_reason":null}]}

data: {"id":"cmpl-abc124","object":"text_completion","created":1781036045,"model":"Qwen/Qwen3-VL-32B-Instruct","choices":[{"index":0,"text":" a","logprobs":null,"finish_reason":null,"stop_reason":null}]}

data: {"id":"cmpl-abc124","object":"text_completion","created":1781036045,"model":"Qwen/Qwen3-VL-32B-Instruct","choices":[{"index":0,"text":" student","logprobs":null,"finish_reason":null,"stop_reason":null}]}

data: {"id":"cmpl-abc124","object":"text_completion","created":1781036045,"model":"Qwen/Qwen3-VL-32B-Instruct","choices":[{"index":0,"text":" of","logprobs":null,"finish_reason":null,"stop_reason":null}]}

data: {"id":"cmpl-abc124","object":"text_completion","created":1781036045,"model":"Qwen/Qwen3-VL-32B-Instruct","choices":[{"index":0,"text":" the","logprobs":null,"finish_reason":null,"stop_reason":null}]}

data: {"id":"cmpl-abc124","object":"text_completion","created":1781036045,"model":"Qwen/Qwen3-VL-32B-Instruct","choices":[{"index":0,"text":" ","logprobs":null,"finish_reason":null,"stop_reason":null}]}

data: {"id":"cmpl-abc124","object":"text_completion","created":1781036045,"model":"Qwen/Qwen3-VL-32B-Instruct","choices":[{"index":0,"text":"1","logprobs":null,"finish_reason":null,"stop_reason":null}]}

data: {"id":"cmpl-abc124","object":"text_completion","created":1781036045,"model":"Qwen/Qwen3-VL-32B-Instruct","choices":[{"index":0,"text":"0","logprobs":null,"finish_reason":"length","stop_reason":null}],"system_fingerprint":"vllm-0.21.0-tp2-5054d0df"}

data: [DONE]

OpenAI /v1/responses

Request:

curl -X POST http://${IP}/v1/responses \
-H 'Content-Type: application/json' \
-d '{
"model": "'"${MODEL_NAME}"'",
"input": "Hello",
"max_output_tokens": 10
}' | jq

Response:

{
"id": "resp_abc127",
"created_at": 1781036107,
"incomplete_details": {"reason": "max_output_tokens"},
"model": "Qwen/Qwen3-VL-32B-Instruct",
"object": "response",
"output": [
{
"id": "msg_abc128",
"type": "message",
"role": "assistant",
"status": "completed",
"content": [
{
"type": "output_text",
"text": "Hello! How can I help you today?",
"annotations": []
}
]
}
],
"status": "incomplete",
"max_output_tokens": 10,
"usage": {
"input_tokens": 9,
"output_tokens": 10,
"total_tokens": 19
}
}

OpenAI /v1/embeddings

This endpoint requires an embedding model deployment (for example Qwen/Qwen3-Embedding-0.6B). Chat/instruct models do not expose this route.

Request:

export MODEL_NAME=Qwen/Qwen3-Embedding-0.6B
curl -X POST http://${IP}/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{
"model": "'"${MODEL_NAME}"'",
"input": "Hello"
}' | jq

Response (embedding vector truncated for readability):

{
"model": "Qwen/Qwen3-Embedding-0.6B",
"object": "list",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [-0.01350, -0.02152, -0.01368, -0.03032, 0.00941, "..."]
}
],
"usage": {
"prompt_tokens": 2,
"total_tokens": 2,
"completion_tokens": 0
}
}

Anthropic /v1/messages

Request:

curl -X POST http://${IP}/v1/messages \
-H 'Content-Type: application/json' \
-d '{
"model": "'"${MODEL_NAME}"'",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image", "source": {"type": "url", "url": "https://picsum.photos/640/360"}}
]
}
],
"max_tokens": 10
}' | jq

Response:

{
"id": "chatcmpl-abc125",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "This image is a close-up, shallow-focus photograph"
}
],
"model": "Qwen/Qwen3-VL-32B-Instruct",
"stop_reason": "max_tokens",
"usage": {
"input_tokens": 234,
"output_tokens": 10
}
}

Streaming request:

curl -N -X POST http://${IP}/v1/messages \
-H 'Content-Type: application/json' \
-d '{
"model": "'"${MODEL_NAME}"'",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image", "source": {"type": "url", "url": "https://picsum.photos/640/360"}}
]
}
],
"max_tokens": 10,
"stream": true
}'
Streaming response (SSE)
event: message_start
data: {"type":"message_start","message":{"id":"chatcmpl-abc126","content":[],"model":"Qwen/Qwen3-VL-32B-Instruct","stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":234,"output_tokens":0}}}

event: content_block_start
data: {"type":"content_block_start","content_block":{"type":"text","text":""},"index":0}

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"This"},"index":0}

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" image"},"index":0}

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" captures"},"index":0}

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" a"},"index":0}

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" serene"},"index":0}

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" and"},"index":0}

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" atmospheric"},"index":0}

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" urban"},"index":0}

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" landscape"},"index":0}

event: content_block_delta
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" at"},"index":0}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"max_tokens"},"usage":{"input_tokens":234,"output_tokens":10}}

event: message_stop
data: {"type":"message_stop"}

vLLM /inference/v1/generate

This endpoint requires the model server to be vLLM. Sampling controls must be nested inside a sampling_params object rather than placed at the top level.

Request:

curl -X POST http://${IP}/inference/v1/generate \
-H 'Content-Type: application/json' \
-d '{
"model": "'"${MODEL_NAME}"'",
"token_ids": [9906],
"sampling_params": {"max_tokens": 10}
}' | jq

Response:

{
"request_id": "abc129",
"choices": [
{
"index": 0,
"logprobs": null,
"finish_reason": "length",
"token_ids": [17993, 1894, 7332, 198, 286, 2415, 1140, 259, 4580, 892]
}
]
}