Skip to main content

Request Handler

Functionality

The Request Handler manages the lifecycle of an inference request before and after the request scheduling phase within the EPP. It handles parsing the request payload, preparing and managing state for the Request Scheduler, interacting with Flow Control and processing the response from the model server.

Design

Architecture Overview

Core Components

  • Parser: Responsible for parsing the incoming request to a structured internal representation consumable by the request scheduler, and parsing the response to extract usage data if reported by the model server.
  • DataProducer: A pluggable extension that allows customizing request pre-processing and producing per-request state needed for scheduling, such as tokenization, prefix-cache matches, predicted processing latency etc.
  • Admitter: Decides whether to admit a request based on criteria like latency SLOs. Runs after dataProducer but before request scheduling. Requests failing admission are rejected, while admitted requests proceed to the request scheduling phase.

Advanced Hooks

The framework also supports advanced, auto-resolved hooks in the request control layer. If a plugin implements these interfaces, it is automatically wired into the execution flow:

  • PreRequest: Executes before the request is processed (e.g., for incrementing in-flight counts).
  • ResponseHeaderProcessor: Executes when response headers are received from the backend.
  • ResponseBodyProcessor: Executes during response streaming (e.g., for usage tracking on completion).

[!NOTE] In practice, these interfaces are often implemented by Data Producers to maintain state or track metrics across the request lifecycle. For example, the predicted-latency-producer implements these hooks to track request latency.


Concrete Plugins

Parsers

Parser plugins understand the payloads of requests and responses. This is key for features like prefix-cache aware scheduling and response usage tracking.

  • openai-parser: The default parser supporting the OpenAI API. Supported endpoints: /conversations, /responses, /chat/completions, /completions, /embeddings.
  • vllmgrpc-parser: Handles requests for the vLLM gRPC API. Supported methods: Generate, Embed.
  • passthrough-parser: Model-agnostic parser that passes content through without interpretation. Note that payload-related scheduling (e.g., prefix-cache-scorer) is not supported with this parser.

Admitter Plugins

  • latency-slo-admitter: Rejects sheddable requests (priority < 0) when no endpoint can meet latency SLO constraints.

Data Producers

  • predicted-latency-producer: Trains XGBoost models via a sidecar and generates per-endpoint TTFT/TPOT predictions. It calculates SLO headroom, collects training data, and tracks per-endpoint running request queues.
  • inflight-load-producer: Tracks the number of in-flight requests and estimated tokens for each endpoint. It increments counts in PreRequest and decrements them in ResponseBodyProcessor on end-of-stream.
  • approx-prefix-cache-producer: Prepares data for approximate prefix cache aware scheduling by hashing prompts in blocks and matching them against an indexer of cached prefixes on servers.

Metrics & Observability

The Request Handling subsystem exposes metrics tracking request volume, success, latency, and token usage. Unless otherwise noted, these metrics carry the labels model_name and target_model_name.

Request Volume & Success

MetricTypeDescriptionLabels
inference_objective_request_totalCounterTotal request count per modelmodel_name, target_model_name, priority
inference_objective_request_error_totalCounterTotal error count per modelmodel_name, target_model_name, error_code
inference_objective_running_requestsGaugeCurrently active requests per modelmodel_name

Latency & SLOs

MetricTypeDescriptionLabels
inference_objective_request_duration_secondsDistributionEnd-to-end response latencymodel_name, target_model_name
inference_objective_normalized_time_per_output_token_secondsDistributionNormalized Time Per Output Token (NTPOT)model_name, target_model_name
inference_objective_request_ttft_secondsDistributionTime to first token (TTFT)model_name, target_model_name
inference_objective_request_predicted_ttft_secondsDistributionPredicted TTFTmodel_name, target_model_name
inference_objective_request_ttft_prediction_duration_secondsDistributionTime spent predicting TTFTmodel_name, target_model_name
inference_objective_request_predicted_tpot_secondsDistributionPredicted TPOTmodel_name, target_model_name
inference_objective_request_tpot_prediction_duration_secondsDistributionTime spent predicting TPOTmodel_name, target_model_name
inference_objective_request_slo_violation_totalCounterTotal count of requests violating SLOmodel_name, target_model_name, type
MetricTypeDescriptionLabels
inference_objective_request_sizesDistributionRequest size in bytesmodel_name, target_model_name
inference_objective_response_sizesDistributionResponse size in bytesmodel_name, target_model_name
inference_objective_input_tokensDistributionInput token count per requestmodel_name, target_model_name
inference_objective_output_tokensDistributionOutput token count per requestmodel_name, target_model_name
inference_objective_prompt_cached_tokensDistributionNumber of prompt cached tokensmodel_name, target_model_name

Note: Response-level metrics (response sizes, output tokens, NTPOT) require Envoy body mode to be set to Buffered or Streamed. For vLLM streaming responses with usage data, include stream_options: {"include_usage": true} in the request.

MetricTypeDescriptionLabels
inference_objective_inference_request_metricGaugeConsolidated gauge for request metricsmodel_name, target_model_name, type
inference_extension_model_rewrite_decisions_totalCounterTotal number of model rewrite decisionsmodel_rewrite_name, model_name, target_model