Here is a fact that should change how you price a product: a LoRA fine-tune is not a model. It's a few-megabyte low-rank delta bolted onto frozen, multi-gigabyte base weights. And that size asymmetry — kilobytes-to-megabytes of adapter against gigabytes of shared base — is the entire reason you can serve hundreds of distinct fine-tuned "models" from a single GPU at the same time.

Multi-LoRA serving is the infrastructure that exploits this. Load the base model once; keep a pool of adapters in memory; swap the right one in per request; batch many different adapters together. The marginal cost of an additional fine-tune collapses from a reserved GPU to near-zero storage plus a cheap on-demand load. That's what turns per-customer fine-tuning from a premium feature into something you can offer at commodity scale — the same shift that LoRA itself made for training, now extended to the serving side.

The kernel trick that makes it real

The naive objection is obvious: if every request in a batch wants a different adapter, you can't share the matmul, so you've just rebuilt one-model-per-request with extra steps.

The research answer is the interesting part.

Punica reports a negligible performance difference between batching identical adapters and batching different ones. That single result is the whole ballgame.

Punica introduced the Segmented Gather Matrix-Vector multiplication (SGMV) kernel: it groups requests by adapter, fuses their heterogeneous low-rank deltas into one batched operation, and raises arithmetic intensity enough to keep the Tensor Cores fed. Punica reports roughly 12x higher throughput than prior multi-tenant systems while adding only about 2ms of latency per token.

S-LoRA then solved the memory side. Its Unified Paging manages variable-rank adapter weights and variable-length KV-cache in one pooled allocator to fight fragmentation, while custom kernels handle the mixed-adapter batch. S-LoRA reports up to 4x higher throughput than HuggingFace PEFT and than vLLM's naive LoRA path, while scaling the number of served adapters by orders of magnitude — thousands on a single GPU. (Both papers are MLSys 2024; the headline numbers come from the abstracts and project READMEs, evaluated on Llama-family models at adapter ranks 8–64.)

Every production tool below stands on these two ideas.

The tools

Multi-LoRA inference server that scales to thousands of fine-tuned LLMs via just-in-time adapter loading
★ 3.8kPythonpredibase/lorax

LoRAX is the one that treats multi-tenancy as the product, not a feature. Its Dynamic Adapter Loading fetches each adapter from the HF Hub, S3, local disk, or Predibase just-in-time per request, without blocking concurrent requests, then batches the heterogeneous set together. If you're building a platform where every customer gets their own fine-tune, start here.

High-throughput LLM serving engine; multi-LoRA via --enable-lora
★ 83kPythonvllm-project/vllm

If you already run vLLM, multi-LoRA is a set of flags: --enable-lora, --max-loras (how many co-resident in one batch), --max-lora-rank (set it to the highest rank you actually use — not arbitrarily high, because you pay memory for the ceiling), and --max-cpu-loras for the host-side pool. Runtime hot-swapping via POST /v1/load_lora_adapter requires VLLM_ALLOW_RUNTIME_LORA_UPDATING=True. It's the path of least resistance and the biggest ecosystem.

High-performance LLM/multimodal serving framework with S-LoRA/Punica-style LoRA support
★ 29.6kPythonsgl-project/sglang

SGLang's distinguishing move is attacking cold-start directly. --enable-lora-overlap-loading transfers an adapter to the GPU while compute proceeds, which the docs claim cuts median time-to-first-token by roughly 35% under adapter-loading-bound conditions. It also exposes dynamic /load_lora_adapter and /unload_lora_adapter.

Rust/Python text-generation server; multi-LoRA via LORA_ADAPTERS

TGI lists adapters at startup with LORA_ADAPTERS and is explicitly built on the punica/lorax kernels. Hugging Face's own framing — "Deploy Once, Serve 30 Models" — is the clearest one-line statement of the economics.

Research system: "serving multiple LoRA finetuned LLM as one" (origin of the SGMV kernel)
★ 1.2kPythonpunica-ai/punica
Research system for serving thousands of concurrent LoRA adapters
★ 1.9kPythonS-LoRA/S-LoRA

The two research repos are where the ideas live in their purest form. You probably won't deploy them, but reading the SGMV and Unified Paging code is the fastest way to understand what your production server is doing under the hood. On the closed side, NVIDIA NIM ships an adapter store with per-request model selection, and Friendli's container takes --adapter-model over a single base copy — same pattern, vendor-managed.

What actually constrains you

The economics are seductive, so be honest about the walls:

One more caveat worth stating plainly: nearly every benchmark above is self-reported by the project that ships it. I found no neutral, identical-hardware head-to-head of LoRAX vs vLLM vs SGLang vs TGI. The mechanism is sound and the wins are directionally real, but if you're choosing between them, benchmark on your adapters and your traffic before you believe anyone's multiplier.

The decision tree is short. Already on vLLM and want the feature? Flip the flag. Cold-start is your bottleneck? Look hard at SGLang's overlap-loading. Building a multi-tenant fine-tuning platform from scratch? LoRAX was designed for exactly your problem. And if you're serving on dedicated GPUs you've already paid for, the inference engine you picked probably already supports this — you're leaving the cheapest model-proliferation strategy in the box.