[Experimental] Asynchronous Processing with Async Processor
The Async Processor provides a way to process inference requests asynchronously using a queue-based architecture. This is ideal for latency-insensitive workloads or for filling "slack" capacity in your inference pool.
Overview
Async Processor integrates with llm-d to:
- Decouple submission from execution: Clients submit requests to a queue and retrieve results later.
- Optimize resource utilization: Fill idle accelerator time with background tasks.
- Provide Resilience: Automatic retries for failed requests without impacting real-time traffic.
Supported Queue Implementations
- GCP Pub/Sub: Cloud-native, scalable messaging service.
- Redis Sorted Set: High-performance, persisted, and prioritized queue implementation.
Prerequisites
Before installing Async Processor, ensure you have:
- Kubernetes cluster: A running Kubernetes cluster (v1.31+).
- For local development, you can use Kind or Minikube.
- For production, GKE, AKS, or OpenShift are supported.
- Gateway control plane: Configure and deploy your Gateway control plane (e.g., Istio) before installation.
- llm-d Inference Stack: Async Processor requires an existing optimized baseline stack to dispatch requests to.
Installation
Async Processor can be installed via Helm. We recommend following the pattern used in the optimized baseline guide.
Step 1: Deploy llm-d Router
Apply the optimized baseline guide and get the llm-d Router's IP address:
# If using Standalone Mode:
export IP=$(kubectl get service optimized-baseline-epp -n llm-d-optimized-baseline -o jsonpath='{.spec.clusterIP}')
# If using Gateway Mode:
export IP=$(kubectl get gateway llm-d-inference-gateway -n llm-d-optimized-baseline -o jsonpath='{.status.addresses[0].value}')
Step 2: Configure Values
Choose your queue implementation (GCP Pub/Sub or Redis) and configure the corresponding values.yaml file:
guides/asynchronous-processing/gcp-pubsub/values.yamlguides/asynchronous-processing/redis/values.yaml
Step 3: Deploy Async Processor
Deploy the Async Processor using the selected queue implementation's configuration:
export NAMESPACE=llm-d-async
export MQ_PROVIDER=gcp-pubsub # options are gcp-pubsub or redis
export ASYNC_VERSION=0.6.1
helm install async-processor \
oci://ghcr.io/llm-d-incubation/charts/async-processor \
-f guides/asynchronous-processing/${MQ_PROVIDER}/values.yaml \
--set ap.igwBaseURL=http://${IP}:80 \
-n ${NAMESPACE} --create-namespace --version ${ASYNC_VERSION}
Testing
Testing instructions vary depending on the chosen queue implementation. Please refer to the specific implementation guide for detailed testing steps:
Cleanup
helm uninstall async-processor -n ${NAMESPACE}