A managed inference API hands you someone else's model behind an endpoint. The moment you fine-tune your own — or want to serve a base model nobody hosts for you — that stops being enough. You need a GPU you can put your own weights on, that wakes up when a request arrives and goes back to sleep when the traffic stops, that you aren't paying for at 3am. That is the serverless-GPU problem, and four platforms own the conversation: Modal, Replicate, RunPod, and Baseten.

They will all run your model on an autoscaling GPU and bill you for roughly the time it's working. Compare them on price and they blur together. The decision that actually follows you for years is quieter: the format you package the model in. Each platform makes you author your deployment in a different abstraction, and that abstraction — not the per-second rate — is the thing wired into your repo, your CI, and your team's muscle memory.

The Python-native one: Modal

Modal's bet is that deploying a model should feel like writing a Python function. You decorate a function with the GPU and dependencies it needs, run modal deploy, and there is no Dockerfile and no separate artifact to maintain — the infrastructure is declared inline in the code that uses it. That makes it the lowest-ceremony path for a team that already lives in Python and wants the GPU to disappear into the language.

The interesting part is what Modal is doing about cold starts. Scale-to-zero's tax is the boot: a 7B-plus model can take tens of seconds to load onto a cold GPU. Modal's answer is memory snapshotting — capturing the initialized process (and, experimentally, GPU memory) so a cold container restores from a snapshot instead of re-loading from scratch. Their published benchmark cut a small model's median cold start from roughly two minutes to about twelve seconds. Whether you hit those numbers depends on your model, but the strategic point stands: Modal is trying to dissolve the cold-start-vs-cost tradeoff rather than make you choose a side of it.

The format wars: Replicate's Cog vs Baseten's Truss

Replicate and Baseten make the opposite, more explicit bet: your model should be packaged in a real, named format that produces a portable container.

Open-source format that packages an ML model into a production-ready Docker container with an auto-generated HTTP API, handling CUDA/PyTorch/Python versions for you
★ 9.4kGo/Pythonreplicate/cog

Cog is the more widely adopted of the two by a wide margin. You write a config, run cog push, and Replicate builds an optimized Docker image, generates an HTTP API server, and deploys it on their GPU fleet — and because the output is a standard container, you can run it on your own infra too. It's the lowest-friction "push and get an API" workflow, backed by Replicate's marketplace of public models. One piece of 2026 context worth knowing: Replicate was acquired by Cloudflare in late 2025, which points its future at Cloudflare's edge network.

Open-source CLI that packages a model as a config.yaml plus an optional Model class (load/predict), targeting production serving on vLLM, SGLang, or TensorRT-LLM
★ 1.2kPythonbasetenlabs/truss

Truss is Baseten's equivalent format, and Baseten aims it upmarket: single-tenant dedicated deployments, compliance certifications, and an optimized inference stack for teams running mission-critical inference. The fewer stars reflect a narrower, more production-grade audience rather than a weaker tool. Both Cog and Truss are open source and both emit portable containers — so the lock-in isn't a closed runtime, it's the authoring workflow and the platform features you build around it.

The no-format one: RunPod

Python SDK for RunPod serverless; you deploy any custom Docker image as a serverless worker, with no enforced packaging framework

RunPod's answer to "what format?" is "whatever Docker image you already have." Its serverless workers run your container directly, with no Cog, no Truss, no opinion — which makes it the most flexible and the cheapest of the four, and the one with the least lock-in, because a raw Docker image runs anywhere. On the cold-start axis it splits the choice cleanly: Flex workers scale to zero (you pay $0 idle, accept a cold start), while Active workers run 24/7 at a lower per-GPU rate (no cold start, continuous bill). Its FlashBoot feature targets sub-second cold starts on Flex to soften the penalty. RunPod is the platform you choose when you want the control of owning the container and don't want a vendor's abstraction between you and the GPU.

How to actually choose

All four scale to zero, bill near the second (Baseten bills by the minute), and will serve your fine-tune. Decide on two axes, in this order:

Pick the price last. The model and the cold-start behavior change with every deploy; the format you committed to is still there in three years. That's the choice to make on purpose. If you haven't fine-tuned anything yet, the toolchain that produces the weights is the decision upstream of this one — and if you're weighing a managed API against hosting your own at all, that comparison comes first.