Skip to main content

Metrics

This page covers how to enable and interpret metrics from an llm-d deployment. For Prometheus and Grafana installation, see Observability Setup first.

note

Commands in this page use ${NAMESPACE} for the namespace where your llm-d workload runs. Set it before following along:

export NAMESPACE=<your-llm-d-namespace>

Prerequisites

  • A running llm-d deployment with an InferencePool and model servers — see the quickstart if needed
  • Prometheus and Grafana installed — see Observability Setup

Step 1: Enable Model Server Metrics

Model server metrics are enabled by default. Configuration varies by deployment method.

Kustomize Deployments

If you deployed your model server using kustomize build, add the monitoring component to your kustomization.yaml:

components:
- ../../../recipes/modelserver/components/monitoring # decode PodMonitor
# - ../../../recipes/modelserver/components/monitoring-pd # add for prefill/decode disaggregation

The monitoring component creates PodMonitors that scrape model server metrics. See guides/recipes/modelserver/components/monitoring/ for details.

Verify PodMonitors

Verify the PodMonitors exist:

kubectl get podmonitors -n ${NAMESPACE}

Expected output:

NAME AGE
decode-podmonitor 5m
prefill-podmonitor 5m

Key vLLM Metrics

MetricWhat it measuresWhy it matters
vllm:num_requests_runningActive requests being processedHigh values indicate GPU saturation; new requests will queue. Watch for sustained spikes
vllm:num_requests_waitingRequests queued, waiting to be processedNon-zero means pods are saturated. Primary signal for autoscaling decisions
vllm:kv_cache_usage_percKV cache utilization (0.0 to 1.0)Above 0.9 means GPU memory is nearly full and requests may get preempted or rejected
vllm:time_to_first_token_seconds (histogram)Time from request arrival to first generated token (TTFT)Directly impacts user experience. Use histogram_quantile() to query percentiles
vllm:inter_token_latency_seconds (histogram)Time between consecutive generated tokens (ITL)Affects streaming response speed. High ITL causes choppy output. Use histogram_quantile() to query percentiles
vllm:prefix_cache_hits_totalNumber of prefix cache hitsCompare with prefix_cache_queries_total to get hit rate. Low hit rate suggests the EPP is not routing effectively
vllm:prefix_cache_queries_totalTotal prefix cache lookupsDivide prefix_cache_hits_total by this to get hit rate. A dropping ratio indicates routing or prompt pattern changes
vllm:prompt_tokens_totalTotal input tokens processedUse rate() to get tokens/sec per pod. Compare across pods to spot uneven load distribution
vllm:generation_tokens_totalTotal output tokens generatedUse rate() alongside prompt tokens to get total throughput. A drop signals degraded model performance

Key SGLang Metrics

MetricWhat it measuresWhy it matters
sglang:num_running_reqsActive requests being processedHigh values indicate GPU saturation; new requests will queue
sglang:num_queue_reqsRequests queued, waiting to be processedNon-zero means pods are saturated. Primary signal for autoscaling decisions
sglang:token_usageKV cache token utilization (0.0 to 1.0)Above 0.9 means GPU memory is nearly full
sglang:time_to_first_token_seconds (histogram)Time from request arrival to first generated token (TTFT)Directly impacts user experience. Use histogram_quantile() to query percentiles
sglang:inter_token_latency_seconds (histogram)Time between consecutive generated tokens (ITL)Affects streaming response speed. Use histogram_quantile() to query percentiles
sglang:prompt_tokens_totalTotal input tokens processedUse rate() to get tokens/sec per pod
sglang:generation_tokens_totalTotal output tokens generatedUse rate() alongside prompt tokens to get total throughput

Step 3: Enable EPP Metrics

EPP (Endpoint Picker) metrics are enabled by default. To verify or enable manually, see the Monitoring & Tracing Configuration section in the llm-d-router Helm chart docs.

Verify the ServiceMonitor exists:

kubectl get servicemonitors -n ${NAMESPACE}

Expected output:

NAME AGE
epp-servicemonitor 5m

Key llm-d Router EPP Metrics

MetricWhat it measuresWhy it matters
llm_d_router_epp_request_totalTotal request count per modelBaseline for calculating error rate and throughput per model
llm_d_router_epp_request_error_totalTotal error count per modelRising errors signal backend failures. Alert when error rate exceeds 5%
llm_d_router_epp_request_duration_secondsEnd-to-end response latencyThe SLO metric. Tracks full round-trip time from request to response
llm_d_router_epp_input_tokensInput token count per requestHelps identify expensive requests. Long prompts cost more compute
llm_d_router_epp_output_tokensOutput token count per requestCombined with duration, gives normalized cost per token
llm_d_router_epp_normalized_time_per_output_token_secondsNormalized time per output token (NTPOT)Key efficiency metric (lower is better). Compare across pods to find stragglers
llm_d_router_epp_running_requestsCurrently active requests per modelShows real-time load distribution. Uneven distribution suggests the EPP may need tuning
llm_d_router_epp_average_kv_cache_utilizationAverage KV cache utilization across the poolPool-wide memory pressure indicator. Above 0.8, consider scaling up to avoid preemptions
llm_d_router_epp_average_queue_sizeAverage queue depth across the poolPool-wide saturation signal. Non-zero means requests are waiting
llm_d_router_epp_ready_endpointsNumber of ready endpoints in the poolIf this drops below expected count, pods are crashing or not scheduling
llm_d_router_epp_scheduler_attempts_totalScheduling attempt counts and outcomesTrack failed scheduling attempts. High failure rate indicates filter/scorer misconfiguration

Flow Control Metrics

When flow control is enabled, these additional metrics are exposed:

MetricWhat it measuresWhy it matters
llm_d_router_epp_flow_control_queue_sizeRequests currently queuedGrowing queue means the pool cannot keep up. Consider scaling or adjusting priority bands
llm_d_router_epp_flow_control_queue_bytesTotal size of queued requests in bytesLarge queued payloads can exhaust EPP memory. Monitor alongside maxBytes config
llm_d_router_epp_flow_control_request_queue_duration_secondsTime a request spends in the queueDirectly impacts user-perceived latency. High values mean flow control is holding requests too long
llm_d_router_epp_flow_control_pool_saturationPool saturation level (0.0 to 1.0+)Above 1.0 means demand exceeds capacity and flow control is actively throttling. Scale up or shed load

Step 4: View Dashboards

llm-d provides pre-built Grafana dashboards for common monitoring scenarios.

Access Grafana

note

The commands below use namespace and service names from the bundled install script. If you use an existing Prometheus or Grafana instance, adjust the namespace and service names accordingly.

kubectl port-forward -n llm-d-monitoring svc/llmd-grafana 3000:80
# Open http://localhost:3000
# Default login: admin / admin

Import Dashboards

Load all llm-d dashboards into Grafana:

./guides/recipes/observability/load-llm-d-dashboards.sh

Verify dashboards were imported:

kubectl get configmaps -n llm-d-monitoring -l grafana_dashboard=1

Expected output:

NAME DATA AGE
llm-d-vllm-overview 1 30s
llm-d-sglang-overview 1 30s
llm-d-failure-saturation-dashboard 1 30s
llm-d-diagnostic-drilldown-dashboard 1 30s
llm-d-performance-kv-cache 1 30s
llm-d-pd-coordinator-metrics 1 30s

Or import individual dashboard JSON files manually from guides/recipes/observability/grafana/dashboards/:

DashboardWhat it shows
llm-d-vllm-overview.jsonGeneral vLLM metrics overview
llm-d-sglang-overview.jsonGeneral SGLang metrics overview
llm-d-failure-saturation-dashboard.jsonFailure and saturation indicators
llm-d-diagnostic-drilldown-dashboard.jsonDetailed diagnostic metrics for troubleshooting
llm-d-performance-kv-cache.jsonPerformance metrics including KV cache utilization
llm-d-pd-coordinator-metrics.jsonPrefill/decode disaggregation metrics

Step 5: Query Metrics

Access the Prometheus UI:

kubectl port-forward -n llm-d-monitoring svc/llmd-kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090 (or https://localhost:9090 if TLS is enabled)

Cleanup

./guides/recipes/observability/install-prometheus-grafana.sh -u -n llm-d-monitoring

Troubleshooting

Autoscaler reports "http: server gave HTTP response to HTTPS client"

The autoscaler is configured for HTTPS but Prometheus is serving HTTP. Enable TLS:

./guides/recipes/observability/install-prometheus-grafana.sh -u
./guides/recipes/observability/install-prometheus-grafana.sh --enable-tls

Metrics not appearing in Prometheus

  1. Check that PodMonitors and ServiceMonitors exist:

    kubectl get podmonitors,servicemonitors -n ${NAMESPACE}
  2. Verify Prometheus is scraping the targets. Open http://localhost:9090/targets (after port-forwarding) and check that vLLM and EPP targets show UP

  3. Confirm pods expose metrics:

    VLLM_POD=$(kubectl get pods -n ${NAMESPACE} -l app=my-model -o jsonpath='{.items[0].metadata.name}')
    kubectl port-forward -n ${NAMESPACE} ${VLLM_POD} 8000:8000
    curl http://localhost:8000/metrics | head -20

Grafana dashboards show "No data"

  1. Verify the Grafana datasource points to the correct Prometheus URL
  2. Check that metrics are flowing in Prometheus first (use the Prometheus UI)
  3. If using TLS, ensure the Grafana datasource is configured for HTTPS with the correct CA certificate