Skip to content

fix(bloomplanner): Skip queue_length metrics cleanup while tenant queue is non-empty#21758

Open
o6ivp wants to merge 2 commits intografana:mainfrom
o6ivp:fix/bloomplanner-queue-length-active-users
Open

fix(bloomplanner): Skip queue_length metrics cleanup while tenant queue is non-empty#21758
o6ivp wants to merge 2 commits intografana:mainfrom
o6ivp:fix/bloomplanner-queue-length-active-users

Conversation

@o6ivp
Copy link
Copy Markdown

@o6ivp o6ivp commented May 4, 2026

What this PR does / why we need it:

The active-users cleanup ticker only saw Enqueue traffic, so a tenant whose backlog was being drained for longer than the inactivity timeout (1h) without any new enqueues had its loki_bloomplanner_queue_length series deleted mid-drain. Subsequent Dequeue calls then lazily recreated the series and decremented it into negative values.

Gate the per-tenant metrics cleanup on the queue actually being empty — if RequestQueue.GetUserQueueLength(tenant) is non-zero we skip metrics.Cleanup and re-register the user so a future tick can purge the metrics once the backlog drains. The empty-queue check is the necessary-and-sufficient condition: it covers backlogs being slowly drained (the original report) as well as the case where Dequeue itself stops happening for longer than the inactivity timeout (e.g. all builders busy on other tenants), which a timestamp-refresh approach could not protect against.

Which issue(s) this PR fixes:
Fixes #19490

Special notes for your reviewer:

The bug only surfaces on the bloom planner because its tasks are long-lived relative to the 1h inactivity timeout; other consumers of pkg/queue (query frontend, scheduler, bloom gateway) drain in seconds and never reach this state.

NewQueue is split into a public default-timing wrapper and an internal newQueue that takes the cleanup interval and inactive timeout, so the regression test can exercise the cleanup path in milliseconds instead of hours.

RequestQueue.GetUserQueueLength is added on pkg/queue so the bloomplanner cleanup callback can read the live per-tenant count under the queue's mutex.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • NA Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
  • NA Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md

Note

Medium Risk
Touches bloom planner queue cleanup behavior and adds a new RequestQueue method used under lock; incorrect interactions could cause stale metrics or cleanup not firing, but scope is limited to metrics/housekeeping logic.

Overview
Prevents loki_bloomplanner_queue_length (and related per-tenant metrics) from being cleaned up while a tenant still has pending work, avoiding series recreation and negative decrements during long drains.

NewQueue is split into a default wrapper plus an internal newQueue that accepts cleanup timings (used by tests). The active-user cleanup callback now checks RequestQueue.GetUserQueueLength() and re-registers the user instead of cleaning metrics when the queue is still non-empty.

Adds RequestQueue.GetUserQueueLength() and a regression test that simulates millisecond-scale inactivity timeouts to verify metrics persist during draining and are removed after the queue empties.

Reviewed by Cursor Bugbot for commit 22c0eb4. Bugbot is set up for automated code reviews on this repo. Configure here.

…o keep queue_length non-negative

The active-users cleanup ticker only saw `Enqueue` traffic, so a tenant
whose backlog was being drained for longer than the inactivity timeout
without any new enqueues had its `loki_bloomplanner_queue_length` series
deleted mid-drain. Subsequent `Dequeue` calls then lazily recreated the
series and decremented it into negative values.

Refresh the active user timestamp from `Dequeue` and `Release` as well
so the series is preserved while builders are still working through the
backlog.

Fixes grafana#19490

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@o6ivp o6ivp requested a review from a team as a code owner May 4, 2026 00:08
// Both Enqueue and Dequeue/Release refresh the user's timestamp so that an in-flight backlog being drained
// by builders keeps the gauge series alive even when no new tasks are enqueued for a while.
activeUsers := util.NewActiveUsersCleanupService(activeUsersCleanupInterval, activeUsersInactiveTimeout, func(user string) {
metrics.Cleanup(user)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it would make more sense to check the queue length of an inactive user before cleaning up the their metrics.
It could still be the case that there haven't been any dequeue operations for more than the inactive period.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, took the suggestion in 22c0eb4.

…queue

Per review feedback on grafana#21758, switch from refreshing the active-user timestamp
on Dequeue/Release to checking the per-tenant queue length directly before
purging metrics. The length check is the necessary-and-sufficient condition for
safe cleanup and also covers the case where Dequeue stops happening for longer
than the inactivity timeout (e.g. all builders busy on other tenants), which
the timestamp-only fix could not protect against.

Adds RequestQueue.GetUserQueueLength so the bloomplanner cleanup callback can
read the live count under the queue lock. When the queue is non-empty the
callback re-registers the user so a future tick can purge the metrics once
the backlog drains.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@o6ivp o6ivp changed the title fix(bloomplanner): Refresh active user timestamp on Dequeue/Release to keep queue_length non-negative fix(bloomplanner): Skip queue_length metrics cleanup while tenant queue is non-empty May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

loki_bloomplanner_queue_length can fall into negative

2 participants