Every team shipping LLM features eventually hits the same wall: the model did something weird in production and nobody can say why, or worse, why it got worse. Langfuse, LangSmith, and Braintrust all promise to fix that, and the lazy take is that they're interchangeable. They are not: each optimizes for a different primary job, and the lock-in you inherit depends entirely on which job you thought you were buying.

Langfuse: the open escape hatch

Langfuse is the open-source one, and that single fact reorganizes the whole comparison. It's a YC W23 project with roughly 30k GitHub stars, an MIT-licensed core (only the ee enterprise folders are held back), and a self-hosting story that goes well beyond a token gesture: Docker Compose for a five-minute local stack, plus Helm charts and Terraform templates for AWS, GCP, and Azure.

The product itself is observability-first: tracing, prompt management, datasets, a playground, and LLM-as-judge evals. Crucially, it's OpenTelemetry-native, so you instrument once with an open standard and your traces are yours. If you ever want to leave, you can take your deployment and your data with you. That's not a feature most teams use; it's an option whose mere existence changes the negotiation.

Langfuse's real selling point isn't any single feature. It's that you can walk away.

The trade-off is the usual open-source one: you operate it, or you pay for their cloud. The platform is unopinionated, which is freedom if you have a strong workflow and a vacuum if you don't.

LangSmith: deepest inside the gravity well

LangSmith is LangChain's commercial platform, and it inherits both the strengths and the gravity of that lineage. It's framework-agnostic on paper, and it genuinely accepts OpenTelemetry traces via an OTLP endpoint in OpenLLMetry format. But the honest pitch is this: if your application already runs on LangChain or LangGraph, tracing is close to automatic, and nothing else will feel as frictionless.

It's proprietary SaaS. The free Developer tier covers individual prototyping (around 5,000 traces a month), the Plus tier runs about $39 per seat with a trace allowance plus per-1,000-trace overage, and self-hosting is reserved for the Enterprise tier via BYOC or your own Kubernetes cluster. Observability and evals are both mature here. The catch is that the same ecosystem that makes onboarding effortless is the one you're now tethered to. That's a fine trade if you've committed to LangGraph anyway, and a quiet tax if you haven't.

Braintrust: the eval loop as the product

Braintrust is the odd one out, and deliberately so. It's eval-first, built for teams whose core loop is change a prompt or model, measure the result, ship the better one. Its playground loads an expensive production trace and runs alternative prompts and models against it, returning scored, side-by-side results from real requests rather than toy inputs. That's a different mental model than "debug this trace."

It's proprietary SaaS, framework-agnostic, and it ingests OpenTelemetry spans across dozens of frameworks, so it plays well with non-LangChain stacks. The free tier is generous with unlimited users and meters on data processed and scores; on-prem and hybrid deployment is an enterprise arrangement. Its reputation comes from where it's used: teams like Notion, Stripe, and Vercel lean on it precisely because their work is eval-grade, aligning many engineers on one regression suite so they can adopt a new frontier model within hours of release.

How to choose

Stop comparing feature checklists; they've all converged. All three trace, all three eval, all three speak OpenTelemetry. Ask two sharper questions instead.

First: what is your primary daily loop? If it's staring at a trace to figure out what broke, that's observability, and Langfuse or LangSmith fit. If it's running structured experiments to decide what ships, that's Braintrust's home turf, and it will feel underbuilt to use anything else for that. Either way, the platform only scores what you feed it — the harder, prior problem is building an eval dataset that actually looks like production.

Second: how much do you value owning your data? This is where the OTel symmetry becomes a trap. Yes, LangSmith and Braintrust both accept OpenTelemetry, but they are proprietary backends — you can pipe traces in, you cannot pick up the platform and leave. Langfuse is the only one that is itself open source. If portability and self-hosting-for-free are non-negotiable, the field narrows to one.

A useful tiebreaker, then. Already all-in on LangGraph? LangSmith, and don't overthink it. Evals are your product loop and you're not on LangChain? Braintrust. You want the escape hatch, or you want to self-host without a sales call? Langfuse. The mistake isn't picking the wrong one — it's not noticing they were answering different questions.