Here is a fact that should change how you price a product: a LoRA fine-tune is not a model. It's a few-megabyte low-rank delta bolted onto frozen, multi-gigabyte base weights. And that size asymmetry — kilobytes-to-megabytes of adapter against gigabytes of shared base — is the entire reason you can serve hundreds of distinct fine-tuned "models" from a single GPU at the same time.
Multi-LoRA serving is the infrastructure that exploits this. Load the base model once; keep a pool of adapters in memory; swap the right one in per request; batch many different adapters together. The marginal cost of an additional fine-tune collapses from a reserved GPU to near-zero storage plus a cheap on-demand load. That's what turns per-customer fine-tuning from a premium feature into something you can offer at commodity scale — the same shift that LoRA itself made for training, now extended to the serving side.
The kernel trick that makes it real
The naive objection is obvious: if every request in a batch wants a different adapter, you can't share the matmul, so you've just rebuilt one-model-per-request with extra steps.
The research answer is the interesting part.
Punica reports a negligible performance difference between batching identical adapters and batching different ones. That single result is the whole ballgame.
Punica introduced the Segmented Gather Matrix-Vector multiplication (SGMV) kernel: it groups requests by adapter, fuses their heterogeneous low-rank deltas into one batched operation, and raises arithmetic intensity enough to keep the Tensor Cores fed. Punica reports roughly 12x higher throughput than prior multi-tenant systems while adding only about 2ms of latency per token.
S-LoRA then solved the memory side. Its Unified Paging manages variable-rank adapter weights and variable-length KV-cache in one pooled allocator to fight fragmentation, while custom kernels handle the mixed-adapter batch. S-LoRA reports up to 4x higher throughput than HuggingFace PEFT and than vLLM's naive LoRA path, while scaling the number of served adapters by orders of magnitude — thousands on a single GPU. (Both papers are MLSys 2024; the headline numbers come from the abstracts and project READMEs, evaluated on Llama-family models at adapter ranks 8–64.)
Every production tool below stands on these two ideas.
The tools
LoRAX is the one that treats multi-tenancy as the product, not a feature. Its Dynamic Adapter Loading fetches each adapter from the HF Hub, S3, local disk, or Predibase just-in-time per request, without blocking concurrent requests, then batches the heterogeneous set together. If you're building a platform where every customer gets their own fine-tune, start here.
If you already run vLLM, multi-LoRA is a set of flags: --enable-lora, --max-loras (how many co-resident in one batch), --max-lora-rank (set it to the highest rank you actually use — not arbitrarily high, because you pay memory for the ceiling), and --max-cpu-loras for the host-side pool. Runtime hot-swapping via POST /v1/load_lora_adapter requires VLLM_ALLOW_RUNTIME_LORA_UPDATING=True. It's the path of least resistance and the biggest ecosystem.
SGLang's distinguishing move is attacking cold-start directly. --enable-lora-overlap-loading transfers an adapter to the GPU while compute proceeds, which the docs claim cuts median time-to-first-token by roughly 35% under adapter-loading-bound conditions. It also exposes dynamic /load_lora_adapter and /unload_lora_adapter.
TGI lists adapters at startup with LORA_ADAPTERS and is explicitly built on the punica/lorax kernels. Hugging Face's own framing — "Deploy Once, Serve 30 Models" — is the clearest one-line statement of the economics.
The two research repos are where the ideas live in their purest form. You probably won't deploy them, but reading the SGMV and Unified Paging code is the fastest way to understand what your production server is doing under the hood. On the closed side, NVIDIA NIM ships an adapter store with per-request model selection, and Friendli's container takes --adapter-model over a single base copy — same pattern, vendor-managed.
What actually constrains you
The economics are seductive, so be honest about the walls:
- One base model per deployment. Every adapter must target the same frozen base. NIM enforces one foundation model per microservice; the constraint is universal. A library of fine-tunes on three different base models is three deployments.
- Rank is a budgeted ceiling. You declare a max adapter rank and pay GPU memory for it whether or not every adapter uses it.
- The pool isn't free. Host and GPU memory for resident adapters is real; S-LoRA's entire design exists to manage that fragmentation.
- Diversity still has a cost. Throughput degrades as the number of distinct active adapters in a batch climbs. SGMV mitigates it; it doesn't repeal it.
One more caveat worth stating plainly: nearly every benchmark above is self-reported by the project that ships it. I found no neutral, identical-hardware head-to-head of LoRAX vs vLLM vs SGLang vs TGI. The mechanism is sound and the wins are directionally real, but if you're choosing between them, benchmark on your adapters and your traffic before you believe anyone's multiplier.
The decision tree is short. Already on vLLM and want the feature? Flip the flag. Cold-start is your bottleneck? Look hard at SGLang's overlap-loading. Building a multi-tenant fine-tuning platform from scratch? LoRAX was designed for exactly your problem. And if you're serving on dedicated GPUs you've already paid for, the inference engine you picked probably already supports this — you're leaving the cheapest model-proliferation strategy in the box.



