Skip to main content

Async Processor Architecture

The Async Processor is a lightweight dispatch agent that pulls inference requests from message queues and forwards them to the llm-d Router. It uses dispatch gates to regulate dispatch rate based on system metrics, ensuring that the dispatched workloads don't overflow the inference servers.

How It Works​

  1. Poll — workers pull requests from one or more message queues.
  2. Gate — before dispatching, each request passes through a dispatch gate that checks whether the system has capacity. If the gate is closed (budget = 0), the request waits.
  3. Dispatch — the worker sends an HTTP request to the llm-d Router with deadline propagation.
  4. Result — on success, results are written back to a queue. On retryable failure (rate limiting, transient errors), the request is re-queued with exponential backoff.

Dispatch Gates​

The dispatch gate controls the rate by which the processor sends requests. Each queue can have its own gate, allowing independent dispatch control per workload.

Gate typeBehavior
constantAlways open — no throttling.
redisReads a budget value from a Redis key, allowing external systems to control dispatch rate.
prometheus-saturationQueries Prometheus for model server saturation metrics. Dispatches when saturation is below a configurable threshold.
prometheus-budgetComputes available capacity from downstream metrics.

Message Queue Integrations​

ImplementationCharacteristics
Redis Sorted SetPersisted, priority-ordered by deadline. Supports per-queue gate configuration.
Redis Pub/SubEphemeral, fan-out delivery. Single global gate.
GCP Pub/SubCloud-native, scalable. Supports per-subscription gating.

Concurrency and Retries​

  • Worker pool — configurable number of concurrent workers (default 8) process requests in parallel.
  • Deadline enforcement — each request carries a deadline from the queue message. Workers abandon requests that cannot complete before their deadline.
  • Exponential backoff — retryable failures are re-queued with backoff (base 2s, max 60s, with jitter). Fatal errors (bad payload, unrecoverable failures) are not retried.

Observability​

Prometheus metrics include request totals, success/failure counts, retry counts, deadline-exceeded counts, shedded request counts, and request latency histograms.