Optimize queued_jobs_by_label query#8092
Open
huydhn wants to merge 1 commit into
Open
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/api/clickhouse/queued_jobs_by_labelis the second HUD endpoint to hit ClickHouse's 32 GiB total-memory limit (MEMORY_LIMIT_EXCEEDED ... While executing ReplacingSorted). Same root cause as #8088:FINALonworkflow_jobover a multi-day window, repeated across two CTEs.This applies the same optimization pattern as #8088 to
queued_jobs_by_label:FINALonworkflow_job; replace with a manual ReplacingMergeTree dedup viaLIMIT 1 BY id ORDER BY _inserted_at DESC, scoped to the candidate id set.workflow_run(status, repo, run_id in candidates) beforeFINALso the JOIN's right side is tiny.latest_jobs+latest_workflowspair so the EC2 and ARC paths share one scan instead of two.Measurements (production ClickHouse, current state)
≈3-4× peak-memory reduction per query; result rows identical (numeric
queue_sdrift between runs is justCURRENT_TIMESTAMP()advancing, not a correctness change).Test plan
hud.pytorch.org/api/clickhouse/queued_jobs_by_labelafter rollout — error rate should drop from non-zero to ~0%.