Everyone selling you LLM observability wants you to believe it's a logging problem with a nicer chart. It isn't. Logging answers "what did the model say." Observability for agents has to answer "why did the agent take that path through six tool calls, two retries, and a sub-agent that quietly hallucinated a customer ID." Those are different questions, and the tools that pretend they're the same are the ones you'll rip out in eight months.

So before you compare feature grids, compare the thing nobody on the pricing page mentions: what unit of work you actually trace.

The LLM call is not the agent

The first generation of these tools traced the LLM call. Prompt in, completion out, token count, latency, cost. Beautiful for a chatbot. Useless for an agent, because an agent's failures live between the calls — in the tool it chose, the retry it triggered, the sub-agent it spawned, the loop it never exited.

The unit that matters now is the trace: a tree of spans where one root request fans out into tool calls, model calls, and nested agent runs. When a deployment goes sideways, you don't want the bad completion. You want the span tree showing the agent called the search tool, got nothing, called it again with the same query, and then confidently fabricated an answer rather than admitting the gap.

If you can't see the shape of the agent's reasoning as a span tree, you don't have observability — you have receipts.

This is why OpenTelemetry quietly became the spine of the whole category. Spans are an OTel primitive. Phoenix layers a semantic convention called OpenInference on top; LangSmith shipped end-to-end OTel ingest in 2026; Langfuse takes OTel-native traces from any SDK. The frameworks differ, but they've all converged on the same skeleton.

Axis one: who owns the trace data

This is the question that actually constrains your choice, and it's a governance question, not a technical one. Your traces contain prompts, retrieved documents, tool arguments — frequently your most sensitive data. The split:

Langfuse is the open-source, self-hostable end of the spectrum. MIT-licensed, runs on Postgres and ClickHouse, framework-agnostic via OTel. In January 2026 ClickHouse acquired it (as part of a $400M Series D, at a reported $15B valuation) and both parties publicly committed to keeping it MIT and self-hostable. If "the trace data never leaves our VPC" is a hard requirement, this is the default.

Open-source LLM engineering platform: traces, evals, metrics, prompt management, OTel-native
★ 29kTypeScriptlangfuse/langfuse

LangSmith sits at the managed end. It is not open source — it's a hosted product from the LangChain team, with cloud, bring-your-own-cloud, and self-hosted enterprise tiers. The reflex assumption is that it only works with LangChain; that hasn't been true for a while. It's framework-agnostic and ingests OTel from the OpenAI SDK, the Anthropic SDK, Vercel AI SDK, LlamaIndex, or raw instrumentation. The honest pitch: if you're already on LangGraph, its node-by-node trace fidelity is unmatched, and you pay for someone else to run the backend.

Axis two: tracing-first or eval-first

The second-best-kept secret in this space: the trace is the eval dataset. Once you've captured real production span trees, your evaluation set isn't a hand-written fixture file — it's a filtered slice of what actually happened. Tools approach this from opposite ends.

Arize Phoenix is eval-first. It's open-source, OTel/OpenInference-native, and its center of gravity is a serious evaluation library — LLM-as-judge, retrieval relevance, hallucination scoring — with the tracing built to feed it. If your problem is "is this RAG pipeline actually correct," start here. The replay-against-a-new-model workflow alone justifies the install when you're migrating model versions.

Open-source AI observability and evaluation, built on OpenTelemetry/OpenInference
★ 10kPythonArize-ai/phoenix

Langfuse is tracing-first with evals bolted on competently; Phoenix is eval-first with tracing as the input layer. Both are valid. Knowing which problem keeps you up at night — is it broken vs. is it correct — picks the tool.

Axis three: the cheap insurance policy

Then there's the layer below all of this. If you're not ready to commit, instrument with OpenLLMetry — a set of OpenTelemetry extensions for GenAI that emits standard OTel data to wherever you point it: Datadog, Honeycomb, or any of the above. You write OTel once and keep your backend swappable. It's the closest thing to vendor insurance in the category.

OpenTelemetry-based instrumentation for LLM apps; export to any OTel backend

And if your actual problem is cost and rate-limit chaos across many models rather than reasoning traces, Helicone is the pragmatic outlier — an Apache-2.0 gateway you drop in front of your calls. One line, you proxy through it, you get spend and latency telemetry without restructuring your code. Less reasoning depth, far less friction.

Open-source LLM observability via a drop-in AI gateway/proxy
★ 5.8kTypeScriptHelicone/helicone

So which one

There is no winner, which is the point. Run the three questions:

  1. Must the trace data stay in your infrastructure? Self-host Langfuse or Phoenix. If managed is fine and you want zero ops, LangSmith.
  2. Are you tied to a framework? Deep on LangGraph, LangSmith pays off. Everything else, go OTel-agnostic — Langfuse, Phoenix, or raw OpenLLMetry.
  3. **Is your pain broken or incorrect?** Broken-in-production leans Langfuse. Incorrect-on-eval leans Phoenix.

The world-weary version: pick the tool whose default unit of work matches the bug you'll actually be chasing at 2am. Everyone supports OTel now, everyone has a trace view, everyone has LLM-as-judge. The differences that survive contact with a real incident are the three axes above — not the screenshots.