Optimize queued_jobs_by_label query by huydhn · Pull Request #8092 · pytorch/test-infra

huydhn · 2026-05-16T06:20:06Z

Summary

/api/clickhouse/queued_jobs_by_label is the second HUD endpoint to hit ClickHouse's 32 GiB total-memory limit (MEMORY_LIMIT_EXCEEDED ... While executing ReplacingSorted). Same root cause as #8088: FINAL on workflow_job over a multi-day window, repeated across two CTEs.

This applies the same optimization pattern as #8088 to queued_jobs_by_label:

Drop FINAL on workflow_job; replace with a manual ReplacingMergeTree dedup via LIMIT 1 BY id ORDER BY _inserted_at DESC, scoped to the candidate id set.
Pre-filter workflow_run (status, repo, run_id in candidates) before FINAL so the JOIN's right side is tiny.
Consolidate the per-CTE FINAL JOINs into a single latest_jobs + latest_workflows pair so the EC2 and ARC paths share one scan instead of two.

Measurements (production ClickHouse, current state)

Scenario	wall	mem	read_rows	read_bytes
OLD	3.3-3.7 s	2.72 GB	7.09 M	4.79 GB
NEW	3.5-3.8 s	0.61-1.00 GB	6.84 M	4.11 GB

≈3-4Ã— peak-memory reduction per query; result rows identical (numeric queue_s drift between runs is just CURRENT_TIMESTAMP() advancing, not a correctness change).

Test plan

OLD vs NEW return the same (count, machine_type) tuples on prod.
Peak memory drops from ~2.7 GB to ~1 GB on prod.
Monitor hud.pytorch.org/api/clickhouse/queued_jobs_by_label after rollout â€” error rate should drop from non-zero to ~0%.

vercel · 2026-05-16T06:20:11Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
torchci	Ready	Preview	May 16, 2026 6:22am

Optimize queued_jobs_by_label query

aa7dd5a

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2026

huydhn mentioned this pull request May 16, 2026

Optimize ttrs_percentiles query #8093

Open

4 tasks

vercel Bot deployed to Preview May 16, 2026 06:22 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize queued_jobs_by_label query#8092

Optimize queued_jobs_by_label query#8092
huydhn wants to merge 1 commit into
mainfrom
optimize-queued-jobs-by-label

huydhn commented May 16, 2026

Uh oh!

vercel Bot commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

huydhn commented May 16, 2026

Summary

Measurements (production ClickHouse, current state)

Test plan

Uh oh!

vercel Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 16, 2026 •

edited

Loading