fix(bloomplanner): Skip queue_length metrics cleanup while tenant queue is non-empty#21758
Open
o6ivp wants to merge 2 commits intografana:mainfrom
Open
fix(bloomplanner): Skip queue_length metrics cleanup while tenant queue is non-empty#21758o6ivp wants to merge 2 commits intografana:mainfrom
o6ivp wants to merge 2 commits intografana:mainfrom
Conversation
…o keep queue_length non-negative The active-users cleanup ticker only saw `Enqueue` traffic, so a tenant whose backlog was being drained for longer than the inactivity timeout without any new enqueues had its `loki_bloomplanner_queue_length` series deleted mid-drain. Subsequent `Dequeue` calls then lazily recreated the series and decremented it into negative values. Refresh the active user timestamp from `Dequeue` and `Release` as well so the series is preserved while builders are still working through the backlog. Fixes grafana#19490 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chaudum
reviewed
May 9, 2026
| // Both Enqueue and Dequeue/Release refresh the user's timestamp so that an in-flight backlog being drained | ||
| // by builders keeps the gauge series alive even when no new tasks are enqueued for a while. | ||
| activeUsers := util.NewActiveUsersCleanupService(activeUsersCleanupInterval, activeUsersInactiveTimeout, func(user string) { | ||
| metrics.Cleanup(user) |
Contributor
There was a problem hiding this comment.
I wonder whether it would make more sense to check the queue length of an inactive user before cleaning up the their metrics.
It could still be the case that there haven't been any dequeue operations for more than the inactive period.
…queue Per review feedback on grafana#21758, switch from refreshing the active-user timestamp on Dequeue/Release to checking the per-tenant queue length directly before purging metrics. The length check is the necessary-and-sufficient condition for safe cleanup and also covers the case where Dequeue stops happening for longer than the inactivity timeout (e.g. all builders busy on other tenants), which the timestamp-only fix could not protect against. Adds RequestQueue.GetUserQueueLength so the bloomplanner cleanup callback can read the live count under the queue lock. When the queue is non-empty the callback re-registers the user so a future tick can purge the metrics once the backlog drains. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
The active-users cleanup ticker only saw
Enqueuetraffic, so a tenant whose backlog was being drained for longer than the inactivity timeout (1h) without any new enqueues had itsloki_bloomplanner_queue_lengthseries deleted mid-drain. SubsequentDequeuecalls then lazily recreated the series and decremented it into negative values.Gate the per-tenant metrics cleanup on the queue actually being empty — if
RequestQueue.GetUserQueueLength(tenant)is non-zero we skipmetrics.Cleanupand re-register the user so a future tick can purge the metrics once the backlog drains. The empty-queue check is the necessary-and-sufficient condition: it covers backlogs being slowly drained (the original report) as well as the case whereDequeueitself stops happening for longer than the inactivity timeout (e.g. all builders busy on other tenants), which a timestamp-refresh approach could not protect against.Which issue(s) this PR fixes:
Fixes #19490
Special notes for your reviewer:
The bug only surfaces on the bloom planner because its tasks are long-lived relative to the 1h inactivity timeout; other consumers of
pkg/queue(query frontend, scheduler, bloom gateway) drain in seconds and never reach this state.NewQueueis split into a public default-timing wrapper and an internalnewQueuethat takes the cleanup interval and inactive timeout, so the regression test can exercise the cleanup path in milliseconds instead of hours.RequestQueue.GetUserQueueLengthis added onpkg/queueso the bloomplanner cleanup callback can read the live per-tenant count under the queue's mutex.Checklist
CONTRIBUTING.mdguide (required)docs/sources/setup/upgrade/_index.mdNote
Medium Risk
Touches bloom planner queue cleanup behavior and adds a new
RequestQueuemethod used under lock; incorrect interactions could cause stale metrics or cleanup not firing, but scope is limited to metrics/housekeeping logic.Overview
Prevents
loki_bloomplanner_queue_length(and related per-tenant metrics) from being cleaned up while a tenant still has pending work, avoiding series recreation and negative decrements during long drains.NewQueueis split into a default wrapper plus an internalnewQueuethat accepts cleanup timings (used by tests). The active-user cleanup callback now checksRequestQueue.GetUserQueueLength()and re-registers the user instead of cleaning metrics when the queue is still non-empty.Adds
RequestQueue.GetUserQueueLength()and a regression test that simulates millisecond-scale inactivity timeouts to verify metrics persist during draining and are removed after the queue empties.Reviewed by Cursor Bugbot for commit 22c0eb4. Bugbot is set up for automated code reviews on this repo. Configure here.