Skip to main content

vLLM Simulator

To help with development and testing we have developed a light weight vLLM simulator. It does not truly run inference, but it does emulate responses to the HTTP REST endpoints of vLLM. Currently it supports partial OpenAI-compatible API:

  • /v1/chat/completions
  • /v1/completions
  • /v1/models

In addition, it supports a subset of vLLM's Prometheus metrics. These metrics are exposed via the /metrics HTTP REST endpoint. Currently supported are the following metrics:

  • vllm:lora_requests_info

The simulated inferense has no connection with the model and LoRA adapters specified in the command line parameters. The /v1/models endpoint returns simulated results based on those same command line parameters.

The simulator supports two modes of operation:

  • echo mode: the response contains the same text that was received in the request. For /v1/chat/completions the last message for the role=user is used.
  • random mode: the response is randomly chosen from a set of pre-defined sentences.

Timing of the response is defined by two parameters: time-to-first-token and inter-token-latency.

For a request with stream=true: time-to-first-token defines the delay before the first token is returned, inter-token-latency defines the delay between subsequent tokens in the stream.

For a requst with stream=false: the response is returned after delay of <time-to-first-token> + (<inter-token-latency> * (<number_of_output_tokens> - 1))

It can be run standalone or in a Pod for testing under packages such as Kind.

Limitations

API responses contains a subset of the fields provided by the OpenAI API.

Click to show the structure of requests/responses
  • /v1/chat/completions
    • request
      • stream
      • model
      • messages
        • role
        • content
    • response
      • id
      • created
      • model
      • choices
        • index
        • finish_reason
        • message
  • /v1/completions
    • request
      • stream
      • model
      • prompt
      • max_tokens (for future usage)
    • response
      • id
      • created
      • model
      • choices
        • text
  • /v1/models
    • response
      • object (list)
      • data
        • id
        • object (model)
        • created
        • owned_by
        • root
        • parent

For more details see the vLLM documentation

Command line parameters

  • port: the port the simulator listents on, mandatory
  • model: the currently 'loaded' model, mandatory
  • lora: a list of available LoRA adapters, separated by commas, optional, by default empty
  • mode: the simulator mode, optional, by default random
  • echo: returns the same text that was sent in the request
  • random: returns a sentence chosen at random from a set of pre-defined sentences
  • time-to-first-token: the time to the first token (in milliseconds), optional, by default zero
  • inter-token-latency: the time to 'generate' each additional token (in milliseconds), optional, by default zero
  • max-loras: maximum number of LoRAs in a single batch, optional, default is one
  • max-cpu-loras: maximum number of LoRAs to store in CPU memory, optional, must be >= than max_loras, default is max_loras
  • max-running-requests: maximum number of inference requests that could be processed at the same time

Working with docker image

Building

To build a Docker image of the vLLM Simulator, run:

make build-llm-d-inference-sim-image

Running

To run the vLLM Simulator image under Docker, run:

docker run --rm --publish 8000:8000 ai-aware-router/llm-d-inference-sim:0.0.1 /ai-aware-router/llm-d-inference-sim  --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct" --lora "tweet-summary-0,tweet-summary-1"

Note: The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.

Standalone testing

Building

To build the vLLM simulator, run:

make build-llm-d-inference-sim

Running

To run the router in a standalone test environment, run:

./bin/llm-d-inference-sim --model my_model --port 8000