Skip to main content

llm-d-modelservice

ModelService is a Helm chart that simplifies LLM deployment on llm-d by declaratively managing Kubernetes resources for serving base models. It enables reproducible, scalable, and tunable model deployments through modular presets, and clean integration with llm-d ecosystem components (including vLLM, Gateway API Inference Extension, LeaderWorkerSet). It provides an opinionated but flexible path for deploying, benchmarking, and tuning LLM inference workloads.

The ModelService Helm Chart proposal is accepted on June 10, 2025. Read more about the roadmap, motivation, and other alternatives considered here.

TL;DR:

Active scearios supported:

  • P/D disaggregation
  • Multi-node inference, utilizing data parallelism
  • One pod per node (see llm-d-infra for the ModelService values file)
  • One pod per DP rank

Integration with llm-d components:

  • Quickstart guide in llm-d-infra depends on ModelService
  • Flexible configuration of llm-d-inference-scheduler for routing
  • Features llm-d-routing-sidecar in P/D disaggregation
  • Utilized in benchmarking experiments in llm-d-benchmark
  • Effortless use of llm-d-inference-sim for CPU-only workloads

Getting started

Add this repository to Helm.

helm repo add llm-d-modelservice https://llm-d-incubation.github.io/llm-d-modelservice/
helm repo update

ModelService operates under the assumption that llm-d-infra has been installed in a Kuberentes cluster, which installs the required prerequisites and CRDs. Read the llm-d-infra Quickstart for more information.

At a minimal, follow these steps to install the required external CRDs as the ModelService helm chart depends on them.

Note that in order to create HTTPRoute objects last, Helm hooks are used. As a consequence, these objects are not deleted when helm delete is executed. They should be manually deleted to avoid unexpected routing problems.

Examples

See examples for how to use this Helm chart. Some examples contain placeholders for components such as the gateway name. Use the --set flag to override placeholders. For example,

helm install cpu-only llm-d-modelservice -f examples/values-cpu.yaml --set prefill.replicas=0 --set "routing.parentRefs[0].name=MYGATEWAY"

Check Helm's official docs for more guidance.

Values

Below are the values you can set.

KeyDescriptionTypeDefault
modelArtifacts.namename of model in the form namespace/modelId. Required.stringN/A
modelArtifacts.uriModel artifacts URI. Current formats supported include hf://, pvc://, and oci://stringN/A
modelArtifacts.sizeSize used to create an emptyDir volume for downloading the model.stringN/A
modelArtifacts.authSecretNameThe name of the Secret containing HF_TOKEN for hf:// artifacts that require a token for downloading a model.stringN/A
modelArtifacts.mountPathPath to mount the volume created to store modelsstring/model-cache
multinodeDetermines whether to create P/D using Deployments (false) or LeaderWorkerSets (true)boolfalse
routing.servicePortThe port the routing proxy sidecar listens on.
If there is no sidecar, this is the port the request goes to.
intN/A
routing.proxy.imageImage used for the sidecarstringghcr.io/llm-d/llm-d-routing-sidecar:0.0.6
routing.proxy.targetPortThe port the vLLM decode container listens on.
If proxy is present, it will forward request to this port.
stringN/A
routing.proxy.debugLevelDebug level of the routing proxyint5
routing.proxy.parentRefs[*].nameThe name of the inference gatewaystringN/A
routing.inferencePool.createIf true, creates an InferencePool objectbooltrue
routing.inferencePool.extensionRefName of of an epp service to use instead of the default one created by this chart.stringN/A
routing.inferenceModel.createIf true, creates an InferenceModel objectboolfalse
routing.httpRoute.createIf true, creates an HTTPRoute objectbooltrue
routing.httpRoute.backendRefsOverride for HTTPRoute.backendRefsList[]
routing.httpRoute.matchesOverride for HTTPRoute.backendRefs[*].matches where backendRefs are created by this chart.Dict
routing.epp.createIf true, creates EPP objectsbooltrue
routing.epp.service.permissionsRole to be bound to the epp service account in place of the default created by this chart.stringN/A
routing.epp.service.typeType of Service created for the Inference Scheduler (Endpoint Picker) deploymentstringClusterIP
routing.epp.service.portThe port the Inference Scheduler listens onint9002
routing.epp.service.targetPortThe target port the Inference Scheduler listens onint9002
routing.epp.service.appProtocolThe app portocol the Inference Scheduler usesint9002
routing.epp.imageImage to be used for the epp containerstringghcr.io/llm-d/llm-d-inference-scheduler:0.0.4`
routing.epp.replicasNumber of replicas for the Inference Scheduler podint1
routing.epp.debugLevelDebug level used to start the Inference Scheduler podint4
routing.epp.disableReadinessProbeDisable readiness probe creation for the Inference Scheduler pod.
Set this to true if you want to debug on Kind.
boolfalse
routing.epp.disableLivenessProbeDisable liveness probe creation for the Inference Scheduler pod.
Set this to true if you want to debug on Kind.
boolfalse
routing.epp.envList of environment variablesList[]
decode.createIf true, creates decode Deployment or LeaderWorkerSetListtrue
decode.annotationsAnnotations that should be added to the Deployment or LeaderWorkerSetDict
decode.tolerationsTolerations that should be added to the Deployment or LeaderWorkerSetList[]
decode.replicasNumber of replicas for decode podsint1
decode.containers[*].nameName of the container for the decode deployment/LWSstringN/A
decode.containers[*].imageImage of the container for the decode deployment/LWSstringN/A
decode.containers[*].argsList of arguments for the decode container.List[string][]
decode.containers[*].modelCommandNature of the command. One of vllmServe, imageDefault or customstringimageDefault
decode.containers[*].commandList of commands for the decode container.List[string][]
decode.containers[*].portsList of ports for the decode container.List[Port][]
decode.parallelism.dataAmount of data parallelismint1
decode.parallelism.tensorAmount of tensor parallelismint1
decode.acceleratorTypes.labelKeyKey of label on node that identifies the hosted GPU typestringN/A
decode.acceleratorTypes.labelValueValue of label on node that identifies type of hosted GPUstringN/A
prefillSame fields supported in decodeSee aboveSee above

Contribute

We welcome contributions in the form of a GitHub issue or pull request. Please open a ticket if you see a gap in your use case as we continue to evolve this project.

Contact

Get involved or ask questions in the #sig-model-service channel in the llm-d Slack workspace! Details on how to join the workspace can be found here.:::info Content Source This content is automatically synced from README.md in the llm-d-incubation/llm-d-modelservice repository.

📝 To suggest changes, please edit the source file or create an issue. :::