Experimental Feature: Asynchronous Processing with Async Processor
The Async Processor provides a way to process inference requests asynchronously using a queue-based architecture. This is ideal for latency-insensitive workloads or for filling "slack" capacity in your inference pool.
Overview
Async Processor integrates with llm-d to:
- Decouple submission from execution: Clients submit requests to a queue and retrieve results later.
- Optimize resource utilization: Fill idle accelerator time with background tasks.
- Provide Resilience: Automatic retries for failed requests without impacting real-time traffic.
Supported Queue Implementations
- GCP Pub/Sub: Cloud-native, scalable messaging service.
- Redis Sorted Set: High-performance, persisted, and prioritized queue implementation.
Prerequisites
Before installing Async Processor, ensure you have:
- Kubernetes cluster: A running Kubernetes cluster (v1.31+).
- For local development, you can use Kind or Minikube.
- For production, GKE, AKS, or OpenShift are supported.
- Gateway control plane: Configure and deploy your Gateway control plane (e.g., Istio) before installation.
- llm-d Inference Stack: Async Processor requires an existing Intelligent Inference Scheduling stack to dispatch requests to.
Installation
Async Processor can be installed via Helm. We provide a helmfile for easy deployment.
Step 1: Configure Inference Gateway URL
The Async Processor needs to know where to send the requests it pulls from the queue. This is configured via the IGW_BASE_URL environment variable.
By default, it is set to http://infra-inference-scheduling-inference-gateway-istio.llm-d-inference-scheduler.svc.cluster.local:80, which assumes you have deployed the Intelligent Inference Scheduling stack in the llm-d-inference-scheduler namespace.
If your Inference Gateway is deployed elsewhere, or if you are using a different service name (e.g., based on the Gateway Provider guide), export the variable before running helmfile:
export IGW_BASE_URL="<your-inference-gateway-service-url>"
Step 2: Choose your Queue Implementation
Decide whether you want to use GCP Pub/Sub or Redis. Follow the setup instructions in the respective subdirectories:
Step 3: Configure Async Processor Values
Edit the values.yaml in the chosen implementation folder to match your environment.
Step 4: Deploy
export NAMESPACE=llm-d-async
cd guides/asynchronous-processing
helmfile apply -n ${NAMESPACE}
Testing
Testing instructions vary depending on the chosen queue implementation. Please refer to the specific implementation guide for detailed testing steps:
Cleanup
cd guides/asynchronous-processing
helmfile destroy -n ${NAMESPACE}
This content is automatically synced from guides/asynchronous-processing/README.md on the main branch of the llm-d/llm-d repository.
📝 To suggest changes, please edit the source file or create an issue.