6 / 283

Introducing Disaggregated Inference on AWS powered by llm-d

TL;DR

AWS introduces disaggregated inference on Amazon SageMaker HyperPod EKS, powered by the open-source llm-d project.

Key Points

  • Prefill and decode phases are split across separate compute resources, improving GPU utilization and throughput.
  • Intelligent request scheduling dynamically routes traffic based on the load of individual pipeline components.
  • Expert Parallelism enables more efficient use of Mixture-of-Experts models across multiple nodes.
  • The setup runs on Kubernetes and integrates into existing SageMaker workflows.

Nauti's Take

Disaggregated inference is not a buzzword – it is a real architectural shift that researchers have discussed for some time, and AWS is now packaging it into a managed product. On the plus side: the approach is technically sound, llm-d is open source, and the Kubernetes integration makes this more portable than a pure AWS lock-in.

The catch is that SageMaker HyperPod EKS is not cheap – anyone actually deploying this is already running inference at enterprise scale. For smaller teams it remains mostly theoretical for now, but the concepts will inevitably trickle down to more accessible setups.

Sources