Trying llm-d via the Quick Start installer

Getting Started with llm-d on Kubernetes. For specific instructions on how to install llm-d on minikube, see the README-minikube.md instructions.

For more information on llm-d in general, see the llm-d git repository here and website here.

Overview

This guide will walk you through the steps to install and deploy llm-d on a Kubernetes cluster, using an opinionated flow in order to get up and running as quickly as possible.

For more information on llm-d, see the llm-d git repository here and website here.

Prerequisites

First ensure you have all the tools and resources as described in Prerequisites

llm-d Installation

Change to the directory holding your clone of the llm-d-deployer code
Navigate to the quickstart directory, e.g.
```
cd llm-d-deployer/quickstart
```

Only a single installation of llm-d on a cluster is currently supported. In the future, multiple model services will be supported. Until then, uninstall llm-d before reinstalling.

The llm-d-deployer contains all the helm charts necessary to deploy llm-d. To facilitate the installation of the helm charts, the llmd-installer.sh script is provided. This script will populate the necessary manifests in the manifests directory. After this, it will apply all the manifests in order to bring up the cluster.

The llmd-installer.sh script aims to simplify the installation of llm-d using the llm-d-deployer as it's main function. It scripts as many of the steps as possible to make the installation process more streamlined. This includes:

Installing the GAIE infrastructure
Creating the namespace with any special configurations
Creating the pull secret to download the images
Creating the model service CRDs
Applying the helm charts
Deploying the sample app (model service)

It also supports uninstalling the llm-d infrastructure and the sample app.

Before proceeding with the installation, ensure you have completed the prerequisites and are able to issue kubectl or oc commands to your cluster by configuring your ~/.kube/config file or by using the oc login command.

Usage

The installer needs to be run from the llm-d-deployer/quickstart directory as a cluster admin with CLI access to the cluster.

./llmd-installer.sh [OPTIONS]

Flags

Flag	Description	Example
`-z`, `--storage-size SIZE`	Size of storage volume	`./llmd-installer.sh --storage-size 15Gi`
`-c`, `--storage-class CLASS`	Storage class to use (default: efs-sc)	`./llmd-installer.sh --storage-class ocs-storagecluster-cephfs`
`-n`, `--namespace NAME`	K8s namespace (default: llm-d)	`./llmd-installer.sh --namespace foo`
`-f`, `--values-file PATH`	Path to Helm values.yaml file (default: values.yaml)	`./llmd-installer.sh --values-file /path/to/values.yaml`
`-u`, `--uninstall`	Uninstall the llm-d components from the current cluster	`./llmd-installer.sh --uninstall`
`-d`, `--debug`	Add debug mode to the helm install	`./llmd-installer.sh --debug`
`-i`, `--skip-infra`	Skip the infrastructure components of the installation	`./llmd-installer.sh --skip-infra`
`-t`, `--download-timeout`	Timeout for model download job	`./llmd-installer.sh --download-timeout`
`-D`, `--download-model`	Download the model to PVC from Hugging Face	`./llmd-installer.sh --download-model`
`-m`, `--disable-metrics-collection`	Disable metrics collection (Prometheus will not be installed)	`./llmd-installer.sh --disable-metrics-collection`
`-h`, `--help`	Show this help and exit	`./llmd-installer.sh --help`

Examples

Install llm-d on an Existing Kubernetes Cluster

export HF_TOKEN="your-token"
./llmd-installer.sh

Install on OpenShift

Before running the installer, ensure you have logged into the cluster as a cluster administrator. For example:

oc login --token=sha256~yourtoken --server=https://api.yourcluster.com:6443

export HF_TOKEN="your-token"
./llmd-installer.sh

Validation

The inference-gateway serves as the HTTP ingress point for all inference requests in our deployment. It’s implemented as a Kubernetes Gateway (gateway.networking.k8s.io/v1) using either kgateway or istio as the gatewayClassName, and sits in front of your inference pods to handle path-based routing, load balancing, retries, and metrics. This example validates that the gateway itself is routing your completion requests correctly. You can execute the test-request.sh script in the quickstart folder to test on the cluster.

# Default options (the model id will be discovered via /v1/models)
./test-request.sh

# Non-default namespace/model
./test-request.sh -n <NAMESPACE> -m <FULL_MODEL_NAME> --minikube

If you receive an error indicating PodSecurity "restricted" violations when running the smoke-test script, you need to remove the restrictive PodSecurity labels from the namespace. Once these labels are removed, re-run the script and it should proceed without PodSecurity errors. Run the following command:

kubectl label namespace <NAMESPACE> \
  pod-security.kubernetes.io/warn- \
  pod-security.kubernetes.io/warn-version- \
  pod-security.kubernetes.io/audit- \
  pod-security.kubernetes.io/audit-version-

Customizing your deployment

The helm charts can be customized by modifying the values.yaml file. However, it is recommended to override values in the values.yaml by creating a custom yaml file and passing it to the installer using the --values-file flag. Several examples are provided in the examples directory. You would invoke the installer with the following command:

./llmd-installer.sh --values-file ./examples/base.yaml

These files are designed to be used as a starting point to customize your deployment. Refer to the values.yaml file for all the possible options.

Sample Application and Model Configuration

Some of the more common options for changing the sample application model are:

sampleApplication.model.modelArtifactURI - The URI of the model to use. This is the path to the model either to Hugging Face (hf://meta-llama/Llama-3.2-3B-Instruct) or a persistent volume claim (PVC) (pvc://model-pvc/meta-llama/Llama-3.2-1B-Instruct). Using a PVC can be paired with the --download-model flag to download the model to PVC.
sampleApplication.model.modelName - The name of the model to use. This will be used in the naming of deployed resources and also the model ID when using the API.
sampleApplication.baseConfigMapRefName - The name of the preset base configuration to use. This will depend on the features you want to enable.
sampleApplication.prefill.replicas - The number of prefill replicas to deploy.
sampleApplication.decode.replicas - The number of decode replicas to deploy.

sampleApplication:
  model:
    modelArtifactURI: hf://meta-llama/Llama-3.2-1B-Instruct
    modelName: "llama3-1B"
  baseConfigMapRefName: basic-gpu-with-nixl-and-redis-lookup-preset
  prefill:
    replicas: 1
  decode:
    replicas: 1

Feature Flags

redis.enabled - Whether to enable Redis needed to enable the KV Cache Aware Scorer modelservice.epp.defaultEnvVarsOverride - The environment variables to override for the model service. For each feature flag, you can set the value to true or false to enable or disable the feature.

redis:
  enabled: true
modelservice:
  epp:
    defaultEnvVarsOverride:
      - name: ENABLE_KVCACHE_AWARE_SCORER
        value: "false"
      - name: ENABLE_PREFIX_AWARE_SCORER
        value: "true"
      - name: ENABLE_LOAD_AWARE_SCORER
        value: "true"
      - name: ENABLE_SESSION_AWARE_SCORER
        value: "false"
      - name: PD_ENABLED
        value: "false"
      - name: PD_PROMPT_LEN_THRESHOLD
        value: "10"
      - name: PREFILL_ENABLE_KVCACHE_AWARE_SCORER
        value: "false"
      - name: PREFILL_ENABLE_LOAD_AWARE_SCORER
        value: "false"
      - name: PREFILL_ENABLE_PREFIX_AWARE_SCORER
        value: "false"
      - name: PREFILL_ENABLE_SESSION_AWARE_SCORER
        value: "false"

Metrics Collection

llm-d includes built-in support for metrics collection using Prometheus and Grafana. This feature is enabled by default but can be disabled using the --disable-metrics-collection flag during installation. llm-d applies ServiceMonitors for vLLM and inference-gateway services to trigger Prometheus scrape targets. In OpenShift, the built-in user workload monitoring Prometheus stack can be utilized. In Kubernetes, Prometheus and Grafana are installed from the prometheus-community kube-prometheus-stack helm charts. The llm-d metrics overview lists the metrics scraped with a default llm-d install.

Accessing the Metrics UIs

If running on OpenShift, skip to OpenShift and Grafana.

Port Forwarding

Prometheus (port 9090):

kubectl port-forward -n llm-d-monitoring --address 0.0.0.0 svc/prometheus-kube-prometheus-prometheus 9090:9090

Grafana (port 3000):

kubectl port-forward -n llm-d-monitoring --address 0.0.0.0 svc/prometheus-grafana 3000:80

Access the User Interfaces at:

Prometheus: <http://YOUR_IP:9090>
Grafana: <http://YOUR_IP:3000> (default credentials: admin/admin)

Grafana Dashboards

Import the llm-d dashboard from the Grafana UI. Go to Dashboards -> New -> Import. Similarly, import the inference-gateway dashboard from the gateway-api-inference-extension repository. Or, if the Grafana Operator is installed in your environment, you might follow the Grafana setup guide to install the dashboards as GrafanaDashboard custom resources.

OpenShift and Grafana

If running on OpenShift with user workload monitoring enabled, you can access the metrics through the OpenShift console:

Navigate to the OpenShift console
In the left navigation bar, click on "Observe"
You can access:
- Metrics: Click on "Metrics" to view and query metrics using the built-in Prometheus UI
- Targets: Click on "Targets" to see all monitored endpoints and their status

The metrics are automatically integrated into the OpenShift monitoring stack. The llm-d-deployer does not install Grafana on OpenShift, but it's recommended that users install Grafana to view metrics and import dashboards.

Follow the Grafana setup guide in the llm-d-deployer repository. The guide includes manifests to install the following:

Grafana instance
Grafana Prometheus datasource from user workload monitoring stack
Grafana llm-d dashboard

Security Note

When running in a cloud environment (like EC2), make sure to:

Configure your security groups to allow inbound traffic on ports 9090 and 3000 (if using port-forwarding)
Use the --address 0.0.0.0 flag with port-forward to allow external access
Consider setting up proper authentication for production environments
If using ingress, ensure proper TLS configuration and authentication
For OpenShift, consider using the built-in OAuth integration for Grafana

Troubleshooting

The various images can take some time to download depending on your connectivity. Watching events and logs of the prefill and decode pods is a good place to start. Here are some examples to help you get started.

# View the status of the pods in the default llm-d namespace. Replace "llm-d" if you used a custom namespace on install
kubectl get pods -n llm-d

# Describe all prefill pods:
kubectl describe pods -l llm-d.ai/role=prefill -n llm-d

# Fetch logs from each prefill pod:
kubectl logs -l llm-d.ai/role=prefill --all-containers=true -n llm-d --tail=200

# Describe all decode pods:
kubectl describe pods -l llm-d.ai/role=decode -n llm-d

# Fetch logs from each decode pod:
kubectl logs -l llm-d.ai/role=decode --all-containers=true -n llm-d --tail=200

# Describe all endpoint-picker pods:
kubectl describe pod -n llm-d -l llm-d.ai/epp

# Fetch logs from each endpoint-picker pod:
kubectl logs -n llm-d -l llm-d.ai/epp --all-containers=true --tail=200

More examples of debugging logs can be found here.

Uninstall

This will remove llm-d resources from the cluster. This is useful, especially for test/dev if you want to make a change, simply uninstall and then run the installer again with any changes you make.

./llmd-installer.sh --uninstall

Overview​

Prerequisites​

llm-d Installation​

Usage​

Flags​

Examples​

Install llm-d on an Existing Kubernetes Cluster​

Install on OpenShift​

Validation​

Customizing your deployment​

Sample Application and Model Configuration​

Feature Flags​

Metrics Collection​

Accessing the Metrics UIs​

Port Forwarding​

Grafana Dashboards​

OpenShift and Grafana​

Security Note​

Troubleshooting​

Uninstall​