feat: 1st class OpenTelemetry support#310
Merged
Merged
Conversation
Adds a new chatlas/_otel.py module that emits OpenTelemetry spans for the chat lifecycle: invoke_agent (top-level), chat (per model call), and execute_tool (per tool invocation). Spans follow the GenAI semantic conventions with attributes like gen_ai.usage.input_tokens, gen_ai.response.model, and optional message content capture controlled by the OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT env var. Also adds an `otel` optional dependency extra (`pip install chatlas[otel]`).
Wires the _otel span functions into the six core Chat methods: _chat_impl/_chat_impl_async (agent spans), _submit_turns/ _submit_turns_async (chat spans), and _invoke_tool/ _invoke_tool_async (tool spans). Parent context is passed explicitly via _otel_parent to avoid async context hazards.
Adds 7 tests covering span hierarchy, token usage, content capture (on/off), tool error recording, streaming lifecycle, and no-op behavior. Includes VCR cassettes for replay without live API keys. Updates docs/get-started/monitor.qmd with a new framework-level tracing section (console quickstart, Logfire production path, config-module pattern) before the existing provider-specific content.
The OTel API is ~212KB with no heavy transitive deps, and its default ProxyTracer already no-ops when no SDK is configured. Making it a hard dep lets us drop the lazy initialization (cache_tracer/initialized/is_tracing guards) and always create spans, relying on the no-op machinery for zero overhead when nobody is collecting. Removes the `chatlas[otel]` extra — the API is now always available. Users still opt in to collection by installing opentelemetry-sdk and configuring a TracerProvider.
…amework-spans # Conflicts: # pyproject.toml
Activate chatlas's chat and execute_tool spans in the OTel context around the bounded provider call and tool invocation. This lets spans created by others nest under ours via ambient context: a provider HTTP instrumentor's span becomes a child of the chat span, and work done inside a tool becomes a child of its execute_tool span. Previously those started disconnected root traces. The agent span is deliberately left unactivated -- it brackets the whole streaming loop, and holding context active across a yield would leak it into the consumer's scope (matching ellmer's approach). Adds nesting tests for the provider-HTTP and tool-internal cases (sync + async).
The context activation also makes spans emitted inside a tool nest under its execute_tool span, not just provider HTTP spans under chat spans. Spell out both in the monitor guide.
The per-function 'from ._otel import ...' statements were load-bearing only while opentelemetry was an optional dependency; once it became a hard dep, they were vestigial. _otel.py imports no chatlas modules at top level, so there's no circular-import risk in importing it eagerly.
Provider failures now mark the chat and invoke_agent spans as errored with the GenAI error.type attribute and an exception event, mirroring how tool failures are already recorded. Previously a failed LLM call ended with UNSET status, indistinguishable from one that never finished (and streaming-iteration errors were missed entirely, since use_span's auto-recording only covered the bounded chat_perform call). Generalize record_tool_error to record_error (identical body, now used for chat, tool, and agent spans) and disable use_span's exception recording so record_error is the single source of error truth -- this also fixes a pre-existing double exception event on tool spans.
Type the OTel span parent honestly: _otel_parent is now Optional[Span] instead of Any throughout _chat.py, and start_chat_span/start_tool_span take Optional[Span] instead of Span (they legitimately receive None from the parallel path and direct tool-invocation tests). The parent context is now built only when a parent exists, so a None parent uses the ambient context (root if none active) rather than relying on set_span_in_context accepting None -- which its type signature disallows. Hoist _otel.py's function-body imports (orjson and the _content/_turn types used in isinstance checks) to module top level; neither _content nor _turn imports _otel, so there is no circular-import risk.
Generator and async-generator tools run their body during iteration, not when the tool function is first called. Since iteration happened outside the activate_span scope, any spans such tools emitted between yields did not nest under their execute_tool span. Wrap each next()/__anext__() step in activate_span so the generator body executes while the tool span is active. Use res.__anext__() rather than the anext() builtin to preserve Python 3.9 support (anext is 3.10+).
Add sync and async tests asserting that a span emitted inside a generator tool nests under its execute_tool span. These invoke the tool directly (no VCR needed) and fail against the pre-fix iteration that ran outside the activate_span scope, locking in the df4b0de behavior.
The system-turn/history split feeding start_chat_span was duplicated verbatim in _submit_turns and _submit_turns_async. Hoist it into a single helper.
- Fix the instrumentation scope name to co.posit (Posit's domain is posit.co), matching the reverse-DNS convention. - Only set gen_ai.usage.* token attributes when there are tokens to report, mirroring ellmer. - Keep structured (non-string) tool-result values structured via a json_safe helper, so they nest inside the gen_ai.*.messages JSON rather than being embedded as a double-encoded JSON string. - Add tests for the structured/unserializable tool-response paths and a CHANGELOG entry for the framework-level OpenTelemetry feature.
The system-turn/history split was duplicated in _submit_turns and _submit_turns_async only to feed start_chat_span. Move it into start_chat_span (an OTel-semconv concern) so both call sites just pass the full turn list, removing the duplication and the Chat helper.
Address PR review feedback: - Add a `chatlas[otel]` optional-dependency group (opentelemetry-sdk), so the documented install path actually works (the API is a hard dep; the SDK is what users need to configure an exporter). - Point the monitoring guide at `pip install "chatlas[otel]"` instead of installing opentelemetry-sdk directly. - Use monkeypatch.setattr for _otel.capture_content so the global flag is always restored, even if an assertion fails mid-test.
The OTel spec says instrumentation libraries SHOULD NOT set status to OK and SHOULD leave it UNSET unless there is an error. Setting OK also makes it final, which would mask a later error recorded on the same span. Drop the explicit OK on chat spans (errors still go through record_error) and add a regression test. Note this intentionally diverges from ellmer, which sets "ok".
For streaming, chat_perform only returns the response iterator; the actual network I/O happens while iterating it, which was outside the activate_span(chat_span) scope. So a provider HTTP instrumentor's per-chunk spans started a disconnected trace instead of nesting under the chat span. Activate the chat span around each next()/__anext__() (the bounded fetch) -- not across the yield, so context still can't leak to the consumer. Mirrors the generator-tool span fix; uses __anext__() to stay Python 3.9-compatible.
The streaming loop now activates the chat span around each chunk fetch (not just the initial request), so the old "must not run while the span is active" wording was inaccurate. Clarify that activation is scoped to the bounded provider calls and never spans a yield.
Make the OpenTelemetry guide approachable to newcomers while staying useful to experienced users: - Lead with the production-visibility problem and a short "what is a trace?" primer (span/trace), plus a note on the api-vs-sdk split. - Add a worked travel-assistant example and embed a real trace waterfall (docs/images/otel-trace.svg) captured from an actual run, showing the httpx and tool-internal spans nesting under chatlas's own spans. - Unify the lower-level instrumentation story into one ladder (httpx -> OpenLLMetry -> official per-provider libs) and connect it to the worked example, instead of an orphaned "combining" note far from it. - Document the GenAI semantic-convention attributes accurately and keep the content-capture env var documented in a single place. The figure is regenerated by scripts/gen_otel_trace_svg.py.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Chatlas now emits OpenTelemetry spans that capture the full structure of multi-turn conversations and tool execution — without requiring any provider-specific instrumentor libraries. When a TracerProvider is configured, every
chat()/stream()call automatically produces a 3-level span hierarchy:Users opt in with
pip install "chatlas[otel]"and a standard TracerProvider setup (console exporter, Logfire, or any OTLP-compatible backend). The approach is consistent with Shiny for Python's OTel story — same[otel]extra pattern, same recommended tools, same config-module pattern.Spans follow the GenAI semantic conventions and record token usage, response model/ID, and optionally full message content (gated by
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT). These framework spans complement (not replace) provider-specific SDK instrumentors likeopentelemetry-instrumentation-openai-v2.chatlas/_otel.pymodule with span lifecycle functionsChatmethods (sync + async for agent/chat/tool spans)docs/get-started/monitor.qmdwith framework-level tracing docsTest plan
pytest tests/test_otel.py— 7/7 passingpyright chatlas/_otel.py— 0 errorsruff checkandruff format— clean