[Experimental] Asynchronous Processing with Async Processor

The Async Processor provides a way to process inference requests asynchronously using a queue-based architecture. This is ideal for latency-insensitive workloads or for filling "slack" capacity in your inference pool.

Overview

Async Processor integrates with llm-d to:

Decouple submission from execution: Clients submit requests to a queue and retrieve results later.
Optimize resource utilization: Fill idle accelerator time with background tasks.
Provide Resilience: Automatic retries for failed requests without impacting real-time traffic.

Supported Queue Implementations

GCP Pub/Sub: Cloud-native, scalable messaging service.
Redis Sorted Set: High-performance, persisted, and prioritized queue implementation.

Prerequisites

Before installing Async Processor, ensure you have:

Kubernetes cluster: A running Kubernetes cluster (v1.31+).
- For local development, you can use Kind or Minikube.
- For production, GKE, AKS, or OpenShift are supported.
Gateway control plane: Configure and deploy your Gateway control plane (e.g., Istio) before installation.
llm-d Inference Stack: Async Processor requires an existing optimized baseline stack to dispatch requests to.

Installation

Async Processor can be installed via Helm. We recommend following the pattern used in the optimized baseline guide.

Step 1: Deploy llm-d Router

Apply the optimized baseline guide and get the llm-d Router's IP address:

# If using Standalone Mode:
export IP=$(kubectl get service optimized-baseline-epp -n llm-d-optimized-baseline -o jsonpath='{.spec.clusterIP}')

# If using Gateway Mode:
export IP=$(kubectl get gateway llm-d-inference-gateway -n llm-d-optimized-baseline -o jsonpath='{.status.addresses[0].value}')

Step 2: Configure Values

Choose your queue implementation (GCP Pub/Sub or Redis) and configure the corresponding values.yaml file:

guides/asynchronous-processing/gcp-pubsub/values.yaml
guides/asynchronous-processing/redis/values.yaml

Step 3: Deploy Async Processor

Deploy the Async Processor using the selected queue implementation's configuration:

export NAMESPACE=llm-d-async
export MQ_PROVIDER=gcp-pubsub # options are gcp-pubsub or redis
export ASYNC_VERSION=0.6.1

helm install async-processor \
    oci://ghcr.io/llm-d-incubation/charts/async-processor \
    -f guides/asynchronous-processing/${MQ_PROVIDER}/values.yaml \
    --set ap.igwBaseURL=http://${IP}:80 \
    -n ${NAMESPACE} --create-namespace --version ${ASYNC_VERSION}

Testing

Testing instructions vary depending on the chosen queue implementation. Please refer to the specific implementation guide for detailed testing steps:

Cleanup

helm uninstall async-processor -n ${NAMESPACE}

Overview​

Supported Queue Implementations​

Prerequisites​

Installation​

Step 1: Deploy llm-d Router​

Step 2: Configure Values​

Step 3: Deploy Async Processor​

Testing​

Cleanup​