vLLM Simulator
To help with development and testing we have developed a light weight vLLM simulator. It does not truly run inference, but it does emulate responses to the HTTP REST endpoints of vLLM. Currently it supports partial OpenAI-compatible API:
- /v1/chat/completions
- /v1/completions
- /v1/models
In addition, it supports a subset of vLLM's Prometheus metrics. These metrics are exposed via the /metrics HTTP REST endpoint. Currently supported are the following metrics:
- vllm:lora_requests_info
The simulated inferense has no connection with the model and LoRA adapters specified in the command line parameters. The /v1/models endpoint returns simulated results based on those same command line parameters.
The simulator supports two modes of operation:
echo
mode: the response contains the same text that was received in the request. For/v1/chat/completions
the last message for the role=user
is used.random
mode: the response is randomly chosen from a set of pre-defined sentences.
Timing of the response is defined by two parameters: time-to-first-token
and inter-token-latency
.
For a request with stream=true
: time-to-first-token
defines the delay before the first token is returned, inter-token-latency
defines the delay between subsequent tokens in the stream.
For a requst with stream=false
: the response is returned after delay of <time-to-first-token> + (<inter-token-latency> * (<number_of_output_tokens> - 1))
It can be run standalone or in a Pod for testing under packages such as Kind.
Limitations
API responses contains a subset of the fields provided by the OpenAI API.
Click to show the structure of requests/responses
/v1/chat/completions
- request
- stream
- model
- messages
- role
- content
- response
- id
- created
- model
- choices
- index
- finish_reason
- message
- request
/v1/completions
- request
- stream
- model
- prompt
- max_tokens (for future usage)
- response
- id
- created
- model
- choices
- text
- request
/v1/models
- response
- object (list)
- data
- id
- object (model)
- created
- owned_by
- root
- parent
- response
For more details see the vLLM documentation
Command line parameters
port
: the port the simulator listents on, mandatorymodel
: the currently 'loaded' model, mandatorylora
: a list of available LoRA adapters, separated by commas, optional, by default emptymode
: the simulator mode, optional, by defaultrandom
echo
: returns the same text that was sent in the requestrandom
: returns a sentence chosen at random from a set of pre-defined sentencestime-to-first-token
: the time to the first token (in milliseconds), optional, by default zerointer-token-latency
: the time to 'generate' each additional token (in milliseconds), optional, by default zeromax-loras
: maximum number of LoRAs in a single batch, optional, default is onemax-cpu-loras
: maximum number of LoRAs to store in CPU memory, optional, must be >= than max_loras, default is max_lorasmax-running-requests
: maximum number of inference requests that could be processed at the same time
Working with docker image
Building
To build a Docker image of the vLLM Simulator, run:
make build-llm-d-inference-sim-image
Running
To run the vLLM Simulator image under Docker, run:
docker run --rm --publish 8000:8000 ai-aware-router/llm-d-inference-sim:0.0.1 /ai-aware-router/llm-d-inference-sim --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct" --lora "tweet-summary-0,tweet-summary-1"
Note: The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.
Standalone testing
Building
To build the vLLM simulator, run:
make build-llm-d-inference-sim
Running
To run the router in a standalone test environment, run:
./bin/llm-d-inference-sim --model my_model --port 8000