The four sizing questions to answer before you ship an agent — capacity, price, speed, and context. Each runs in the browser on editable, sourced defaults; the reasoning behind every formula links out to the articles.
How much GPU memory it takes to serve a model — weights + GQA-aware KV cache + overhead, and how many accelerators you need.
What a feature will cost per month — modelling the two levers that move the invoice: prompt caching and the input/output price split.
How fast an agent will feel — time-to-first-token vs throughput across the sequential turns that dominate multi-step agents.
How much context your agent actually gets — after system prompt, tool schemas, memory, and output reserve — and how many turns before it must compact.