Drain polled tasks on shutdown#2261
Conversation
| select { | ||
| case bw.taskQueueCh <- &polledTask{task: task, permit: slotPermit}: | ||
| didSendTask = true | ||
| case <-bw.stopCh: |
There was a problem hiding this comment.
Just checking the stop channel is still used elsewhere since we removed it in two spots
There was a problem hiding this comment.
yep! still used in plenty of other places
|
recheck |
b930672 to
29b355a
Compare
| bw.limiterContextCancel() | ||
|
|
||
| // Wait for pollers to finish. (pollTaskServiceTimeOut) bounds this if the connection is broken. | ||
| bw.pollerWG.Wait() |
There was a problem hiding this comment.
Unbounded pollerWG.Wait() bypasses user-configured stopTimeout
Medium Severity
bw.pollerWG.Wait() in Stop() blocks without any timeout, and runs before awaitWaitGroup(&bw.stopWG, bw.options.stopTimeout). Combined with doPoll now waiting unconditionally on <-doneC (bounded only by pollTaskServiceTimeOut = 70s), Stop() can block for up to 70 seconds before the user's stopTimeout even begins counting. Previously, a 5-second fallback cancellation bounded this. In failure scenarios (broken gRPC connection, unresponsive server), total Stop() duration becomes ~70s + stopTimeout instead of just stopTimeout.
Additional Locations (1)
| } | ||
| fmt.Println("DEBUG: doPoll graceful path: stopC fired, waiting for poll to complete") | ||
| <-doneC | ||
| fmt.Println("DEBUG: doPoll graceful path: poll completed after stopC") |
There was a problem hiding this comment.
Debug print statements left in production code
High Severity
Two fmt.Println("DEBUG: ...") statements were left in the graceful shutdown path of doPoll. These will print to stdout in production whenever a worker shuts down with workerPollCompleteOnShutdown enabled. No other production code in this file uses fmt.Println — the existing usages are in test files and CLI tools.


What was changed
Refactored worker shutdown to use a two-stage approach: pollers shut down first, then the task dispatcher drains any remaining tasks before exiting. This ensures tasks polled during shutdown are processed rather than silently dropped.
Key changes:
Why?
PR #2199 changed shutdown to let the server complete in-flight polls, instead of cancelling them. This exposed a pre-existing race when a poller receives a task during shutdown, Go would silently dropping the task. The dispatcher had the same issue — it could exit on stopCh before reading pending tasks from the channel.
This aligns the Go SDK's shutdown with how Core SDK handles it:
Checklist
Closes Drain polled tasks on shutdown #1197
How was this tested:
Any docs updates needed?
No
Note
Medium Risk
Touches worker shutdown concurrency (poller lifecycle, channel closing, dispatcher draining), which can affect graceful termination and task processing ordering. Adds blocking sends and new wait conditions that could deadlock if assumptions break, so needs careful review/testing.
Overview
Implements a two-stage worker shutdown so tasks polled during shutdown are still dispatched and processed instead of being dropped: pollers exit first, then the task dispatcher drains remaining items from
taskQueueChuntil it is closed.This adds a dedicated
pollerWGto track poller goroutines, closestaskQueueChonly after all pollers finish, updatesrunTaskDispatchertorangeover the channel (no early exit onstopCh), and makespollTaskalways send polled tasks to the dispatcher. It also removes the prior 5s shutdown timeout hack indoPoll(now waiting for poll completion) and adds a unit testTestTaskNotDroppedDuringShutdownto cover the shutdown race.Written by Cursor Bugbot for commit be5c0e4. This will update automatically on new commits. Configure here.