Experimental Feature: Asynchronous Processing with Async Processor

The Async Processor provides a way to process inference requests asynchronously using a queue-based architecture. This is ideal for latency-insensitive workloads or for filling "slack" capacity in your inference pool.

Overview

Async Processor integrates with llm-d to:

Decouple submission from execution: Clients submit requests to a queue and retrieve results later.
Optimize resource utilization: Fill idle accelerator time with background tasks.
Provide Resilience: Automatic retries for failed requests without impacting real-time traffic.

Supported Queue Implementations

GCP Pub/Sub: Cloud-native, scalable messaging service.
Redis Sorted Set: High-performance, persisted, and prioritized queue implementation.

Prerequisites

Before installing Async Processor, ensure you have:

Kubernetes cluster: A running Kubernetes cluster (v1.31+).
- For local development, you can use Kind or Minikube.
- For production, GKE, AKS, or OpenShift are supported.
Gateway control plane: Configure and deploy your Gateway control plane (e.g., Istio) before installation.
llm-d Inference Stack: Async Processor requires an existing Intelligent Inference Scheduling stack to dispatch requests to.

Installation

Async Processor can be installed via Helm. We provide a helmfile for easy deployment.

Step 1: Configure Inference Gateway URL

The Async Processor needs to know where to send the requests it pulls from the queue. This is configured via the IGW_BASE_URL environment variable.

By default, it is set to http://infra-inference-scheduling-inference-gateway-istio.llm-d-inference-scheduler.svc.cluster.local:80, which assumes you have deployed the Intelligent Inference Scheduling stack in the llm-d-inference-scheduler namespace.

If your Inference Gateway is deployed elsewhere, or if you are using a different service name (e.g., based on the Gateway Provider guide), export the variable before running helmfile:

export IGW_BASE_URL="<your-inference-gateway-service-url>"

Step 2: Choose your Queue Implementation

Decide whether you want to use GCP Pub/Sub or Redis. Follow the setup instructions in the respective subdirectories:

Step 3: Configure Async Processor Values

Edit the values.yaml in the chosen implementation folder to match your environment.

Step 4: Deploy

export NAMESPACE=llm-d-async
cd guides/asynchronous-processing
helmfile apply -n ${NAMESPACE}

Testing

Testing instructions vary depending on the chosen queue implementation. Please refer to the specific implementation guide for detailed testing steps:

Cleanup

cd guides/asynchronous-processing
helmfile destroy -n ${NAMESPACE}

Content Source

This content is automatically synced from guides/asynchronous-processing/README.md on the main branch of the llm-d/llm-d repository.

📝 To suggest changes, please edit the source file or create an issue.

Overview​

Supported Queue Implementations​

Prerequisites​

Installation​

Step 1: Configure Inference Gateway URL​

Step 2: Choose your Queue Implementation​

Step 3: Configure Async Processor Values​

Step 4: Deploy​

Testing​

Cleanup​