Skip to content

Optimize queued_jobs_by_label query#8092

Open
huydhn wants to merge 1 commit into
mainfrom
optimize-queued-jobs-by-label
Open

Optimize queued_jobs_by_label query#8092
huydhn wants to merge 1 commit into
mainfrom
optimize-queued-jobs-by-label

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented May 16, 2026

Summary

/api/clickhouse/queued_jobs_by_label is the second HUD endpoint to hit ClickHouse's 32 GiB total-memory limit (MEMORY_LIMIT_EXCEEDED ... While executing ReplacingSorted). Same root cause as #8088: FINAL on workflow_job over a multi-day window, repeated across two CTEs.

This applies the same optimization pattern as #8088 to queued_jobs_by_label:

  • Drop FINAL on workflow_job; replace with a manual ReplacingMergeTree dedup via LIMIT 1 BY id ORDER BY _inserted_at DESC, scoped to the candidate id set.
  • Pre-filter workflow_run (status, repo, run_id in candidates) before FINAL so the JOIN's right side is tiny.
  • Consolidate the per-CTE FINAL JOINs into a single latest_jobs + latest_workflows pair so the EC2 and ARC paths share one scan instead of two.

Measurements (production ClickHouse, current state)

Scenario wall mem read_rows read_bytes
OLD 3.3-3.7 s 2.72 GB 7.09 M 4.79 GB
NEW 3.5-3.8 s 0.61-1.00 GB 6.84 M 4.11 GB

≈3-4× peak-memory reduction per query; result rows identical (numeric queue_s drift between runs is just CURRENT_TIMESTAMP() advancing, not a correctness change).

Test plan

  • OLD vs NEW return the same (count, machine_type) tuples on prod.
  • Peak memory drops from ~2.7 GB to ~1 GB on prod.
  • Monitor hud.pytorch.org/api/clickhouse/queued_jobs_by_label after rollout — error rate should drop from non-zero to ~0%.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
torchci Ready Ready Preview May 16, 2026 6:22am

Request Review

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2026
@huydhn huydhn mentioned this pull request May 16, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant