Skip to main content

3 posts tagged with "Inference"

LLM inference serving and optimization

View All Tags

Serving Hybrid Models at Scale in llm-d

· 14 min read
Kfir Toledo
Kfir Toledo
Research Staff Member, IBM
Or Ozeri
Or Ozeri
Research Staff Member, IBM
Danny Harnik
Danny Harnik
Senior Technical Staff Member, IBM
Itay Etelis
Itay Etelis
Research Staff Member, IBM
Rachel Brill
Rachel Brill
Senior Technical Staff Member, IBM
Maroon Ayoub
Maroon Ayoub
Senior Principal Machine Learning Engineer, Red Hat

For most of the transformer era, the KV cache rested on a quiet assumption: one model, one uniform cache. Every layer attended the same way, every block was the same size, and everything built on top of the cache (allocators, offload connectors, schedulers) could treat it as a single pool.

Hybrid models broke this assumption. Many recent frontier and open-weight models increasingly mix attention types within a single model (full attention next to sliding-window, linear, or Mamba layers), making the cache heterogeneous: different layers now hold different amounts of state, in different shapes, with different reuse rules. A cache block that used to be allocated as one uniform unit is now constituted of several distinct parts.

To serve a hybrid model efficiently, an AI inference platform has to handle that heterogeneity in at least three aspects of the stack:

  • GPU Memory Allocation: How the cache is laid out and allocated on the GPU. vLLM solved this with its Hybrid Memory Allocator (HMA), rebuilt around a unified allocator (see Hybrid Models as First-Class Citizens in vLLM).
  • KV Offloading: Extending the KV cache to CPU and storage. Without HMA awareness, an offloading connector turns the HMA off and therefore discards the GPU memory improvements or potential data movement savings.
  • KV-Aware Routing: Sending each request to the right model-server replica. Ignoring hybrid memory structure may erroneously list nodes as having or not having the required KV data based on information stemming from just part of the layers.

vLLM's HMA solved hybrid GPU memory allocation when handling a single vLLM instance. This post shows how llm-d extends that to tiered KV cache management - including KV offloading to CPU and storage, and KV-aware request routing - significantly improving throughput and latency at scale for hybrid models.

Heterogeneous inference serving across three GPU vendors with llm-d

· 10 min read
Pravein Govindan Kannan
Staff Research Scientist, IBM
Praveen Jayachandran
Senior Technical Staff Member, IBM
Jaikrishnan Hari
Research Partnerships & BD Executive, IBM
Varun Raste
Solution Architect, IBM
Prasad Mukhedkar
Associate Principal AI Architect, Red Hat
Vinod Pathangay
Chief Architect, Field CTO Organization, Red Hat
Jayanth Babu Reddy
Principal Architect, NxtGen Cloud Technologies
Abhisyant Anasapurapu
VP, NxtGen Cloud Technologies

Most production inference clusters today are single-vendor because that is often the simplest way to configure and operate a cluster.

That is starting to change. Procurement cycles bring new generations alongside older ones, supply planning spans multiple accelerator options, and cost/performance profiles differ by workload. Real production fleets are accumulating heterogeneity whether or not the architecture planned for it.

This is an opportunity to unlock real value: different accelerator classes can be matched to workload requirements, stranded capacity gets reclaimed, and operators gain more flexibility in capacity planning. The case is stronger still for sovereign and on-premise deployments, where data residency, regulatory alignment, and the long-term economics of high-volume inference make local fleet optimization especially important.

Making that work in practice is a non-trivial systems problem. Each accelerator stack brings its own optimized drivers, firmware, container images, runtime settings, and attention kernels. A coherent serving layer needs to preserve those platform-specific optimizations while still giving operators one control plane for routing, observability, and policy.

Predicted-Latency Based Scheduling for LLMs

· 28 min read
Kaushik Mitra
Software Engineer, Google
Benjamin Braun
Software Engineer, Google
Abdullah Gharaibeh
Senior Staff Software Engineer, Google
Clayton Coleman
Distinguished Engineer, Google

Not all LLM requests cost the same. A short prompt might complete in milliseconds, while a long one can occupy a GPU for seconds. If we can predict how long a request will take on each candidate server before dispatching it, we can make substantially better routing decisions. This post describes a system that does exactly that: a lightweight ML model trained online from live traffic that replaces manually tuned heuristic weights with direct latency predictions.