Skip to content

fix(csharp): tag download_slot_acquired with wait_duration_ms#495

Merged
eric-wang-1990 merged 3 commits into
mainfrom
tracing/483-slot-wait-duration
Jun 8, 2026
Merged

fix(csharp): tag download_slot_acquired with wait_duration_ms#495
eric-wang-1990 merged 3 commits into
mainfrom
tracing/483-slot-wait-duration

Conversation

@eric-wang-1990

Copy link
Copy Markdown
Collaborator

Summary

  • Closes tracing(csharp): cloudfetch.download_slot_acquired carries no wait_duration — slot contention invisible #483
  • cloudfetch.download_slot_acquired previously fired with only chunk_index and is_sequential_mode, so slot contention was invisible in traces. Under CloudFetchStressTests the event fired 252 times with no way to distinguish "acquired instantly" from "queued behind another download."
  • This PR wraps the SemaphoreSlim.WaitAsync in a Stopwatch and attaches wait_duration_ms (long, milliseconds) plus queue_depth (count of pending downloads at acquisition) to the event.

Why both tags?

The issue calls for wait_duration_ms as the primary signal and lists queue_depth as an optional follow-up. I included both because _downloadQueue.Count was already used in the surrounding code (see cloudfetch.download_loop_start at line 326) — capturing it under the stopwatch snapshot is a single extra field load with no new locking, so leaving it out would be more work to justify than to keep.

Test (RED -> GREEN)

  • New test CloudFetchE2ETest.DownloadSlotAcquired_EmitsWaitDurationMs_Issue483 runs a 24-file CloudFetch download (SELECT * FROM main.tpcds_sf100_delta.store_sales LIMIT 1000000 with MaxBytesPerFile=10MB) and asserts that every captured cloudfetch.download_slot_acquired event carries a non-negative integer wait_duration_ms tag.
  • RED: pre-fix run captures 24 events, all with only chunk_index/is_sequential_mode, no wait_duration_ms.
  • GREEN: post-fix run passes. Actual values from the verified run show real backpressure (chunks 0-2 acquire immediately at 0ms, then later chunks wait 50-605ms behind the default _maxParallelDownloads=3 cap):
    wait_duration_ms = 0    queue_depth = 0
    wait_duration_ms = 0    queue_depth = 4
    wait_duration_ms = 605  queue_depth = 3
    wait_duration_ms = 94   queue_depth = 3
    ...
    
  • Full CloudFetchE2ETest suite (21 tests) passes after the fix.

Files changed

  • csharp/src/Reader/CloudFetch/CloudFetchDownloader.cs (+23/-2)
  • csharp/test/E2E/CloudFetchE2ETest.cs (+139)

This pull request and its description were written by Isaac.

…nt (failing for #483)

The cloudfetch.download_slot_acquired event currently carries only
chunk_index and is_sequential_mode. Under CloudFetchStressTests it
fires 252 times with no timing information, so an operator looking at
the trace cannot tell whether each slot was acquired immediately or
whether the call blocked waiting for another download to finish.

This test fails RED today (24 events emitted, none has wait_duration_ms)
and will pass once the slot-wait is wrapped in a Stopwatch and the
elapsed ms is attached to the event (issue #483).
…ration_ms

Previously the event fired with no timing information, so slot
contention was invisible to operators. Under CloudFetchStressTests the
event fired 252 times with only chunk_index and is_sequential_mode, so
operators had no way to distinguish "acquired instantly" from "queued
behind 3 other downloads."

Now wrap the SemaphoreSlim.WaitAsync in a Stopwatch and attach
wait_duration_ms (long, milliseconds) to the event so dashboards can
surface "downloads are slow because they're queued." Also include
queue_depth (count of items still pending in the download queue at the
moment of acquisition) — it's free to compute alongside the stopwatch
snapshot and gives the event a second contention signal.

Verified end-to-end against a 24-file CloudFetch download
(SELECT * FROM main.tpcds_sf100_delta.store_sales LIMIT 1000000): the
first chunks acquire slots in 0ms while later chunks wait 50-600ms
behind the 3-parallel-download cap, matching expected backpressure.

Closes #483
@eric-wang-1990 eric-wang-1990 changed the title tracing(csharp): tag download_slot_acquired with wait_duration_ms fix(csharp): tag download_slot_acquired with wait_duration_ms May 29, 2026
@eric-wang-1990 eric-wang-1990 marked this pull request as ready for review May 29, 2026 06:25

@msrathore-db msrathore-db left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified live: test passes; wait_duration_ms and queue_depth are emitted on the download_slot_acquired event. Clean change. LGTM. (Tiny nit, non-blocking: queue_depth is snapshotted just before WaitAsync rather than at acquisition — cosmetic.)

@eric-wang-1990 eric-wang-1990 added this pull request to the merge queue Jun 1, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 1, 2026
@eric-wang-1990 eric-wang-1990 added this pull request to the merge queue Jun 8, 2026
Merged via the queue into main with commit bd46672 Jun 8, 2026
20 checks passed
@eric-wang-1990 eric-wang-1990 deleted the tracing/483-slot-wait-duration branch June 8, 2026 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tracing(csharp): cloudfetch.download_slot_acquired carries no wait_duration — slot contention invisible

2 participants