vLLM Simulator

To help with development and testing we have developed a light weight vLLM simulator. It does not truly run inference, but it does emulate responses to the HTTP REST endpoints of vLLM. Currently it supports partial OpenAI-compatible API:

/v1/chat/completions
/v1/completions
/v1/models

In addition, it supports a subset of vLLM's Prometheus metrics. These metrics are exposed via the /metrics HTTP REST endpoint. Currently supported are the following metrics:

vllm:lora_requests_info

The simulated inferense has no connection with the model and LoRA adapters specified in the command line parameters. The /v1/models endpoint returns simulated results based on those same command line parameters.

The simulator supports two modes of operation:

echo mode: the response contains the same text that was received in the request. For /v1/chat/completions the last message for the role=user is used.
random mode: the response is randomly chosen from a set of pre-defined sentences.

Timing of the response is defined by two parameters: time-to-first-token and inter-token-latency.

For a request with stream=true: time-to-first-token defines the delay before the first token is returned, inter-token-latency defines the delay between subsequent tokens in the stream.

For a requst with stream=false: the response is returned after delay of <time-to-first-token> + (<inter-token-latency> * (<number_of_output_tokens> - 1))

It can be run standalone or in a Pod for testing under packages such as Kind.

Limitations

API responses contains a subset of the fields provided by the OpenAI API.

Click to show the structure of requests/responses

/v1/chat/completions
- request
  - stream
  - model
  - messages
    - role
    - content
- response
  - id
  - created
  - model
  - choices
    - index
    - finish_reason
    - message
/v1/completions
- request
  - stream
  - model
  - prompt
  - max_tokens (for future usage)
- response
  - id
  - created
  - model
  - choices
    - text
/v1/models
- response
  - object (list)
  - data
    - id
    - object (model)
    - created
    - owned_by
    - root
    - parent

For more details see the vLLM documentation

Command line parameters

port: the port the simulator listents on, mandatory
model: the currently 'loaded' model, mandatory
lora: a list of available LoRA adapters, separated by commas, optional, by default empty
mode: the simulator mode, optional, by default random
echo: returns the same text that was sent in the request
random: returns a sentence chosen at random from a set of pre-defined sentences
time-to-first-token: the time to the first token (in milliseconds), optional, by default zero
inter-token-latency: the time to 'generate' each additional token (in milliseconds), optional, by default zero
max-loras: maximum number of LoRAs in a single batch, optional, default is one
max-cpu-loras: maximum number of LoRAs to store in CPU memory, optional, must be >= than max_loras, default is max_loras
max-running-requests: maximum number of inference requests that could be processed at the same time

Working with docker image

Clone the llm-d-inference-sim repository

git clone https://github.com/llm-d/llm-d-inference-sim.git

Building

To build a Docker image of the vLLM Simulator, run:

make image-build

Running

To run the vLLM Simulator image under Docker, run:

docker run --rm --publish 8000:8000 ghcr.io/llm-d/llm-d-inference-sim:dev  --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct" --lora "tweet-summary-0,tweet-summary-1"

Note: The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.

Standalone testing

Building

To build the vLLM simulator, run:

make build

Running

To run the router in a standalone test environment, run:

./bin/llm-d-inference-sim --model my_model --port 8000

Limitations​

Command line parameters​

Working with docker image​

Clone the llm-d-inference-sim repository​

Building​

Running​

Standalone testing​

Building​

Running​

Limitations

Command line parameters

Working with docker image

Clone the llm-d-inference-sim repository

Building

Running

Standalone testing

Building

Running