Skip to main content

Batch Inference

llm-d supports batch and offline inference workloads through two components that can be deployed independently or together:

  • Batch Gateway — an OpenAI-compatible Batch API (/v1/batches, /v1/files) for submitting, tracking, and managing batch inference jobs. Includes an API server and a processor that dispatches individual inference requests to the llm-d Router.

  • Async Processor — a lightweight dispatch agent that pulls individual inference requests from message queues (such as Redis and Google Pub/Sub) and sends them to the llm-d Router, adjusting the flow based on system metrics.

When deployed together, the Batch Gateway hands off individual requests to the Async Processor instead of dispatching directly, gaining metric-based flow control.