Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock
TL;DR
In this post, we explain how we implemented multi-LoRA inference for Mixture of Experts (MoE) models in vLLM, describe the kernel-level optimizations we performed, and show you how you can benefit from this work. We use GPT-OSS 20B as our primary example throughout this post.
Nauti's Take
If you still think SageMaker is only for batch, you’re missing that vLLM now vets LoRA logic at the kernel level and lets Mixture-of-Experts share the same memory pool. GPT-OSS 20B and other MoEs can now run multiple fine-tunes in parallel without tripping over RAM, which immediately thins out cloud costs.
Summary
In this post, we explain how we implemented multi-LoRA inference for Mixture of Experts (MoE) models in vLLM, describe the kernel-level optimizations we performed, and show you how you can benefit from this work. We use GPT-OSS 20B as our primary example throughout this post.