Every team that ships an agent eventually hits the same wall: the thing is a black box. A request comes in, eleven tool calls and four model invocations later an answer comes out, and when it's wrong you have no idea which step lied. The fix is tracing — turn each model call, retrieval, and tool invocation into a span you can inspect. Two open-source libraries dominate that job for LLM apps, and the way people choose between them is almost always wrong.

The wrong way is to compare their feature lists and star counts. The right way is to notice they're not really competing on the axis you think.

They both speak OpenTelemetry

OpenTelemetry-based auto-instrumentation for GenAI apps — one Traceloop.init() wires up 16+ model providers, vector DBs, and frameworks and exports OTel spans to any OTLP backend
OpenTelemetry instrumentation for AI observability plus its own semantic-convention spec; ~40 instrumentors across Python, TypeScript, Java, Go; powers Arize Phoenix but exports to any OTLP backend

Start with what's identical, because it's most of the picture. Both are instrumentation layers, not backends. You add them to your app, they monkey-patch the SDKs you already call — OpenAI, Anthropic, LangChain, LlamaIndex, CrewAI, your vector DB — and every call becomes an OpenTelemetry span. Those spans leave your process over OTLP, the standard OpenTelemetry wire protocol, and land in whatever you point them at. Neither one is a place you look at traces; both are the thing that produces the traces.

That matters because it kills the framing most blog posts reach for. This is not "agent framework A vs agent framework B," where the choice locks your whole codebase. It's a thin layer at the boundary, and the spans it emits are portable by construction. You could rip one out and drop the other in over a weekend. So if switching cost is low, what are you actually choosing?

The fork is the schema, not the library

You're choosing a vocabulary. OpenTelemetry has been standardizing a set of GenAI semantic conventions — the gen_ai.* attributes: gen_ai.system for the provider, gen_ai.request.model for the model, gen_ai.usage.input_tokens for cost. OpenLLMetry aligns to that vocabulary. Its bet is that the LLM world should converge on the same OTel conventions everything else already uses, so your agent traces sit in the same Grafana or Datadog or Honeycomb as your HTTP and database spans, named the same way.

OpenInference made a different bet. It defines its own spec, built around a required openinference.span.kind attribute with ten kinds — LLM, EMBEDDING, CHAIN, RETRIEVER, RERANKER, TOOL, AGENT, GUARDRAIL, EVALUATOR, PROMPT — plus namespaces like retrieval.documents and reranker.* and input.value/output.value. It is, deliberately, a richer taxonomy than OTel's, because it was designed to feed an evaluation product, not just a dashboard.

Open-source AI observability and evaluation platform; reads OpenInference spans natively to trace, evaluate, and debug LLM and agent runs
★ 10kPythonArize-ai/phoenix

That product is Phoenix, and it explains everything about OpenInference's design. Phoenix doesn't just want to draw your spans; it wants to evaluate them — run an LLM-as-judge over a RETRIEVER span's documents, score a RERANKER, attach a pass/fail to an EVALUATOR span. The OTel gen_ai.* conventions, still sitting in experimental/development status as of mid-2026, don't yet model "this span was a reranker and here's what it scored." OpenInference does, because it had to.

Both libraries emit OpenTelemetry. The actual question is which convention your backend reads — and that's a property of where you're sending traces, not of the library you import.

So choose your backend first

This inverts the decision. Don't pick the instrumentation and then find a viewer. Pick where traces live, then let that dictate the convention.

If your traces are going into a general-purpose observability stack — Grafana Tempo, Datadog, Honeycomb, New Relic — or into Langfuse, whose OTLP endpoint ingests gen_ai. traces and maps known LLM-instrumentor attributes into its model, then OpenLLMetry's gen_ai. alignment is the path of least resistance. Your agent shows up next to the rest of your services in a vocabulary the platform already understands.

If your traces are going into Phoenix or Arize because you want the evaluation loop — judging retrievals, scoring rerankers, regression-testing prompts — then OpenInference is the native citizen and you should use it rather than fighting Phoenix to understand gen_ai.*.

The convergence nobody mentions

Here's the non-obvious part. These two are framed as rivals, but the ecosystem is quietly stitching them into a stack. Because both ride OTLP, you can sit a span processor in the export path that rewrites attributes — there are already OpenLLMetry-to-OpenInference mappers that take gen_ai. spans and relabel them into openinference.span.kind form on the way out. Instrument once with OpenLLMetry's broad auto-instrumentation; translate at the edge; feed a gen_ai. backend and a Phoenix-shaped backend from the same trace.

Which tells you where the real lock-in lives. It isn't the import statement — that's swappable, and increasingly translatable. It's the attribute schema your downstream tooling was built to read. The library you pick today is a detail. The convention your backend speaks is the thing you'll still be living with in a year, and it's the only part of this decision worth agonizing over.

So stop reading the star counts. Phoenix's 10k against OpenInference's 1k measures platform adoption versus instrumentation-library adoption — two different things, not a winner. Decide where your spans need to land, learn which vocabulary that destination reads, and let the instrumentation follow. The trace is portable. The schema is the commitment.