tech-pub

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

February 25, 2026 at 08:56 PMUpdated: Feb 281 Sources

TL;DR

In this post, we explain how we implemented multi-LoRA inference for Mixture of Experts (MoE) models in vLLM, describe the kernel-level optimizations we performed, and show you how you can benefit from this work. We use GPT-OSS 20B as our primary example throughout this post.

Nauti's Take

If you still think SageMaker is only for batch, you’re missing that vLLM now vets LoRA logic at the kernel level and lets Mixture-of-Experts share the same memory pool. GPT-OSS 20B and other MoEs can now run multiple fine-tunes in parallel without tripping over RAM, which immediately thins out cloud costs.

Summary

Sources

25.2.26

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

#amazon

TL;DR

Nauti's Take

Summary

Sources

Related stories

From Our Newsletter