Testing an MCP server feels, at first, reassuringly ordinary. It speaks JSON-RPC. It has a schema. It lists tools and resources and prompts, takes a request, returns a response. You write some assertions, they go green, you ship. And then a user's agent calls delete_record when they asked it to find one, and you discover that everything you tested was the half that was never going to break.

Let me lay out the three tiers, because they are genuinely different jobs and most people stop after the first.

Tier one: the Inspector, for "does it even work"

The official MCP Inspector is the first thing to reach for, and it needs no install:

npx @modelcontextprotocol/inspector node build/index.js

That launches a small proxy plus a React UI at http://localhost:6274, connected to your server over stdio, SSE, or Streamable HTTP. You get a live panel to list tools, resources, and prompts and to call them with hand-typed arguments — the MCP equivalent of curling your own endpoint before you write a single test. It's the fastest possible answer to "did I wire this up right," and for a lot of small servers it's 80% of the debugging you'll ever do.

The part that matters for a real project is the CLI mode, which exists specifically for automation:

npx @modelcontextprotocol/inspector --cli node build/index.js --method tools/list
npx @modelcontextprotocol/inspector --cli node build/index.js \
  --method tools/call --tool-name search_orders --tool-arg query=shoes

The repo's own docs call CLI mode "ideal for scripting, automation, and integration." It runs non-interactively and exits with a status code, which means you can drop it into CI as a protocol-conformance gate: does the server list the tools you expect, do they accept the arguments you expect, do they return without erroring. That's tier one — necessary, deterministic, and not nearly enough.

Tier two: in-memory transports, for "does the logic hold"

Spawning a subprocess and parsing its stdout for every unit test is slow and, worse, racy — a class of flaky test that has nothing to do with your server's logic. The SDKs give you a way out: an in-memory transport that wires a client to a server inside a single process, no I/O, no spawning.

In TypeScript:

const [clientTransport, serverTransport] = InMemoryTransport.createLinkedPair();
await server.connect(serverTransport);
await client.connect(clientTransport);
const result = await client.callTool({ name: "search_orders", arguments: { query: "shoes" } });

InMemoryTransport.createLinkedPair() returns two linked transports; connect your Server to one and a Client to the other and you can exercise a full request/response cycle — argument validation, error paths, structured output — as a plain unit test.

Python has the exact mirror. create_client_server_memory_streams() in mcp/shared/memory.py yields linked client and server stream pairs for in-process testing; at the FastMCP layer you can skip even that and pass a FastMCP server instance directly to a Client, which the docs recommend precisely to avoid subprocess race conditions. This is where your tool logic gets covered: edge-case inputs, failure modes, idempotency, the shape of what you return.

So far, so normal. Both tiers are deterministic, and both are testing the server as an API. Which is the trap, because an MCP server isn't only an API.

Tier three: the eval you're actually missing

Here is the thing nobody tells you when you start. A tool's name and description are not metadata. They are prompt text — the literal words the host LLM reads when it decides, mid-reasoning, whether and when to call your tool and how to fill its arguments. Which means a server can pass every protocol test and every unit test and still fail in the only way that matters: the model reads two of your descriptions, finds them ambiguous, and calls the wrong one.

Protocol tests prove your server works. Only an eval proves an LLM will use it correctly. Those are different claims, and the gap between them is where real users live.

This failure is non-deterministic and invisible to everything in tiers one and two, because both of them call your tools directly. There's no model in the loop making a choice. To test the choice, you need a model in the loop — an LLM-as-judge / tool-selection eval.

The tooling for this is young but real. mcp-evals (a Node package and GitHub Action) evaluates tool implementations with LLM-based scoring across accuracy, completeness, relevance, and clarity, and can post results as PR comments. lastmile-ai/mcp-eval is an MCP-native eval framework built on mcp-agent, with both programmatic and LLM-as-judge evaluators. The point of either is the same: feed the server realistic user prompts, let a model pick tools, and score whether it picked yours, correctly.

The payoff is not theoretical. Neon built evals to test whether a model picks the right database tool out of 20-plus, scoring with Claude as judge, and through description and prompt iteration alone — no code changes — took tool-selection success from 60% to 100%. That's the entire argument in one number: the thing that was broken wasn't the code, and no amount of code-level testing would have found it. The descriptions were the product, and the eval was the only instrument that could see them.

And one more: test it like an attacker

There's a fourth class that isn't about correctness at all. Because tool descriptions are prompt text injected into a model's context, they're an attack surface: a malicious server can hide instructions in a description (tool poisoning), or ship benign and mutate after you trust it (a rug pull), or have its tool output carry a prompt injection back into the agent.

The reference tool is mcp-scan: uvx mcp-scan@latest reads your MCP client's config files, connects to the listed servers, pulls their tool descriptions, and scans them for injection, poisoning, and tool shadowing; mcp-scan inspect shows you the raw descriptions, and pinning a tool to a known hash via the whitelist defends against post-install rug pulls. It shares only names and descriptions, not call contents. (Worth knowing for provenance: Invariant Labs, the team that named tool poisoning and MCP rug pulls, was acquired by Snyk in 2025; the original PyPI package still resolves.) If your server is going anywhere near other people's agents, this isn't optional hygiene — it's part of the test suite.


So: the Inspector to prove it answers, in-memory transports to prove the logic, an LLM-in-the-loop eval to prove the model will actually reach for the right tool, and a scanner to prove nobody can weaponize your descriptions. Skip the third tier and you'll ship a server that passes every test you wrote and fails the one your users run for you — silently, in production, the first time an agent has to choose.