A 70B-parameter model in FP16 wants about 140 GB just for weights — more than any single GPU on the market holds. So the model gets cut into pieces and spread across cards. There are two classic ways to make that cut, and developers reach for them as if they were interchangeable knobs labeled "more parallelism." They are not. Tensor parallelism and pipeline parallelism make opposite bets, and the right one is decided almost entirely by a single physical fact: how fast the wires between your GPUs are.

Two ways to cut a model

Tensor parallelism (TP) slices across every layer. Each weight matrix is partitioned column- or row-wise, and every GPU holds a slice of every layer — so they all work on the same layer at the same time, then combine their partial results. The technique comes from Megatron-LM, and the combining step is the catch: TP issues two all-reduce operations per transformer layer — one after attention, one after the feed-forward block. For a model with dozens of layers, that's a torrent of synchronization for every single token.

Pipeline parallelism (PP) slices the other way — by depth. GPU 0 holds the first chunk of layers, GPU 1 the next, and so on. A request flows through them like an assembly line, and the only cross-GPU traffic is a single activation hand-off at each stage boundary. Dramatically less communication. The cost shows up as the pipeline bubble: while the first request is still in stage 0, stages 1 and 2 sit idle, and they drain idle at the end too.

Tensor parallelism spends bandwidth to buy latency. Pipeline parallelism spends latency to save bandwidth. Your interconnect decides which currency you can afford.

The interconnect is the decision

Put the two costs next to your hardware and the choice makes itself.

TP's per-layer all-reduces are only cheap over a fast fabric — NVLink or NVSwitch, the high-bandwidth mesh that joins GPUs inside a single server. Run TP across the slow link between two nodes (ordinary Ethernet, or even InfiniBand) and that synchronization traffic becomes the bottleneck; throughput falls off a cliff. PP, needing just one hand-off per stage, barely notices a slow link — which is exactly what you want when you have to cross the boundary between machines.

That's why the canonical production recipe, the one vLLM's own docs recommend, reads the way it does:

Read that twice and you'll notice it isn't a tuning heuristic at all. It's a literal description of the hardware hierarchy: fast inside a node, slow between nodes — so use the high-communication strategy inside and the low-communication strategy across. The same logic governs the GPU you pick in the first place.

The corollary that catches people

Here's the part that surprises teams: NVLink, not the node boundary, is the real dividing line. If the GPUs inside a single server have no NVLink — PCIe-only cards like the L40S — then tensor parallelism's per-layer chatter is expensive even within that one box, and pipeline parallelism can win there too. The right question was never "am I crossing nodes?" It was always "are these two GPUs joined by a fast fabric?" The node boundary is just where the answer usually flips.

A working rulebook

A compact way to choose, drawn from how practitioners actually tune serving stacks like vLLM and SGLang:

In practice large deployments run a hybrid: TP inside each node, PP across nodes, DP for replicas on top — and for mixture-of-experts models, a fourth axis, expert parallelism, that scatters experts across GPUs so only the few each token needs ever fire. The bubble and the latency-throughput tension don't vanish; serving research like Sarathi-Serve keeps chipping at them with tricks like chunked prefill. But the first decision — TP or PP — isn't something you should be guessing. Go look at your interconnect. The wires already wrote the answer.