Skip to content

tracing(csharp): introduce Connection/Statement lifetime activities for hierarchical traces#501

Draft
eric-wang-1990 wants to merge 2 commits into
mainfrom
tracing/477-connection-statement-lifetime
Draft

tracing(csharp): introduce Connection/Statement lifetime activities for hierarchical traces#501
eric-wang-1990 wants to merge 2 commits into
mainfrom
tracing/477-connection-statement-lifetime

Conversation

@eric-wang-1990

Copy link
Copy Markdown
Collaborator

Motivation

Issue #477 shows every top-level driver call creates a fresh root Activity
with a new TraceId. The MemoryStress baseline measured 12,316 of 12,958
TraceIds (95%) as single-span throwaway ReadNextRecordBatchAsync roots,
breaking trace-tree visualization on fleet dashboards. The
ConcurrentDisposeTests evidence confirms every
DatabricksCompositeReader.Dispose and HiveServer2Connection.DisposeClient
span has ParentSpanId=0, so even within one connection Dispose cannot be
linked back to the Execute that spawned the cleanup.

Design

Mirror the existing object hierarchy (Connection owns Statement owns calls)
in the Activity tree by introducing two long-lived "lifetime" Activities:

DatabricksConnection.Lifetime (root)
├─ HiveServer2Connection.OpenAsync
├─ ApplyServerSidePropertiesAsync
├─ DatabricksStatement.Lifetime
│   ├─ HiveServer2Statement.ExecuteStatementAsync
│   ├─ ReadNextRecordBatchAsync × N
│   └─ DatabricksCompositeReader.Dispose
├─ DatabricksStatement.Lifetime  (next statement, same connection)
└─ HiveServer2Connection.DisposeClient

Existing per-call activities need no code change: StartActivity already
uses Activity.Current as the parent, and the lifetime Activities are on
the stack throughout the synchronous Database.Connect flow and all
subsequent operations.

Implementation choices

Static ActivitySource for the lifetime spans. The per-instance
ActivitySource held by ActivityTrace is disposed inside
TracingConnection.Dispose before DatabricksConnection.Dispose returns;
listeners stop receiving notifications once the source is disposed. A
process-wide static readonly source survives connection disposal and keeps
notifying listeners. (This was caught by the failing test — first attempt
used the per-instance source and the ConnectionLifetime Stop() silently
dropped.)

Explicit parent context for Statement → Connection.
Activity.Current is AsyncLocal-scoped. If CreateStatement runs on a
different async continuation than where the connection was opened, naïve
reliance on Activity.Current would not see the connection's Lifetime.
DatabricksStatement therefore passes connection.LifetimeActivity?.Context
explicitly to StartActivity, making parenting robust regardless of async
context.

Stop order matters. ConnectionLifetime is stopped after base.Dispose
runs DisposeClient + emits DELETE_SESSION telemetry, so those tail spans
chain underneath the lifetime root. Statement lifetime is stopped after
base.Dispose runs the CLOSE_STATEMENT telemetry. Both Stop() calls are
idempotent via a _lifetimeActivityStopped guard.

Lifetime opens after ValidateProperties. If the parser throws (e.g.
ArgumentException from a bad adbc.databricks.fetch_heartbeat_interval
value), no half-open Activity is leaked — the constructor never reaches the
StartActivity call.

Tests

  • ConnectionStatementLifetimeTraceTests.ConnectionLifetime_ParentsStatementAndCalls_Issue477
    drives a SELECT 1 workflow through a Thrift connection, captures every
    Activity emitted on AdbcDrivers.*, and asserts:

    1. DatabricksConnection.Lifetime exists and is a root span
    2. DatabricksStatement.Lifetime is a child of ConnectionLifetime,
      same TraceId
    3. At least one per-call activity is a child of ConnectionLifetime
      (proves the lifetime is on the Activity.Current stack during sync
      Database.Connect)
    4. At least one per-call activity is a child of StatementLifetime
      (proves the explicit parent context works)
    5. A single TraceId covers every activity on the
      AdbcDrivers.Databricks source

    Before the fix all 5 assertions fail (no Lifetime spans exist at all,
    and the captured 24 activities split across 9 distinct TraceIds — the
    exact tracing(csharp): no driver-session root span — every top-level call creates a fresh TraceId #477 baseline).

    After the fix every span shares one TraceId and the parent relationships
    match the diagram above.

Regression sweep

All targeted regressions pass against a live pecotesting Thrift connection:

Filter Result
DatabricksConnectionTest 93 passed (0 failed)
StatementTests 94 passed (3 skipped)
CloseOperationE2ETest 3 passed
TelemetryBaselineTests 10 passed
ConcurrentDisposeTests 3 passed
Telemetry (broader sweep) 414 passed (7 skip)

No existing test required modification — the lifetime spans don't disturb
any assertion that depended on per-call activity names, status codes,
events, or tags. Tests that checked "is this span a root" continue to
hold because they only checked specific operation names; the new Lifetime
spans are net-additive.

Files touched

  • csharp/src/DatabricksConnection.cs (+45 net)
  • csharp/src/DatabricksStatement.cs (+27 net)
  • csharp/test/E2E/Telemetry/ConnectionStatementLifetimeTraceTests.cs (+255, new file)

No submodule changes. No new public API surface beyond internal LifetimeActivity
on the connection for the statement constructor to consume.

Closes #477

This pull request and its description were written by Isaac.

… for #477)

Captures every Activity emitted on AdbcDrivers.* during a SELECT 1 workflow
and asserts the trace-tree shape required by #477:

- DatabricksConnection.Lifetime span exists and is a root
- DatabricksStatement.Lifetime is a child of ConnectionLifetime, same TraceId
- At least one per-call activity is a child of each Lifetime
- A single TraceId covers the entire connection lifetime

Before the fix the first assertion fails with "no DatabricksConnection.Lifetime
span found" — the captured 24 activities split across 9 distinct TraceIds,
confirming the #477 baseline (12,958 TraceIds / 37,950 spans in MemoryStress,
95% throwaway single-span roots).

Co-authored-by: Isaac
…ities

Mirrors the object hierarchy (Connection owns Statement owns calls) in the
Activity tree so every per-call activity emitted on AdbcDrivers.Databricks
during one connection shares a single TraceId — restoring trace-tree
visualization on fleet dashboards.

Implementation:
- DatabricksConnection opens a long-lived "DatabricksConnection.Lifetime"
  Activity at the end of its constructor (after ValidateProperties succeeds).
  The sync call site DatabricksDatabase.Connect then runs OpenAsync().Wait()
  and ApplyServerSidePropertiesAsync().Wait() in the same async context, so
  those per-call activities chain naturally via Activity.Current — no extra
  plumbing required for them.
- DatabricksStatement opens a "DatabricksStatement.Lifetime" Activity in its
  constructor, parented EXPLICITLY to the connection's lifetime context via
  the ActivityContext overload of StartActivity. The explicit parent guards
  against cross-async-context construction where Activity.Current on the
  thread executing the Statement constructor would not observe the
  connection-open context.
- Both lifetime Activities live on a STATIC ActivitySource so they survive
  TracingConnection.Dispose, which disposes the per-instance source mid-Dispose
  (and would otherwise silently drop the Stop() call to listeners).
- DatabricksConnection.Dispose stops its lifetime Activity AFTER base.Dispose
  has run DisposeClient + DELETE_SESSION telemetry, so those tail spans
  chain underneath the root. Idempotent stop via _lifetimeActivityStopped.
- DatabricksStatement.Dispose stops its lifetime Activity AFTER base.Dispose
  has issued CLOSE_STATEMENT telemetry. Idempotent.

Constraints:
- No submodule changes. The per-call activities in HiveServer2 already use
  Activity.Current as their parent, so they pick up the lifetime parent
  automatically without code change.
- Existing OperationName values are unchanged; only parent relationships
  and TraceIds change.

Test passes against a live Thrift connection: every span in a SELECT 1
workflow now shares one TraceId, ConnectionLifetime is the only root span
on AdbcDrivers.Databricks, and DisposeClient / Dispose / ExecuteStatement
all correctly chain under their respective lifetime activities.

Closes #477

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tracing(csharp): no driver-session root span — every top-level call creates a fresh TraceId

1 participant