feat: add gas-benchmark skill for automated repricing benchmarks#11526
feat: add gas-benchmark skill for automated repricing benchmarks#11526kamilchodola wants to merge 7 commits intomasterfrom
Conversation
Interactive pipeline: discover release → resolve gas-benchmarks branch → build diag Docker image → trigger repricing workflow → wait → analyze results (logs, timings, sstore/sload counts, dotTrace XML). Invokable via /gas-benchmark or programmatically with user confirmation.
- Narrow gh permissions to run/workflow/release/api subcommands - Narrow git permissions to read-only + push - Remove --no-diag from argument-hint (replaced by --dottrace logic) - Add dotTrace interactive question - Extract gas-benchmarks branch from release notes
|
Claude finished @kamilchodola's task in 3m 24s —— View job Review
SummaryThis PR adds a well-structured skill with a clear multi-phase pipeline (discover → build → trigger → wait → analyze). The mandatory exception-scan requirement and the dotTrace guard ( Two High issues and two Medium issues need to be addressed before merge. FindingsHigh
Medium
Low
Verdict: not yet mergeable — H1 ( |
| - Bash(git branch *) | ||
| - Bash(git log *) | ||
| - Bash(git status *) | ||
| - Bash(git push *) |
There was a problem hiding this comment.
High — unnecessary permission in allowed-tools
Bash(git push *) does not appear anywhere in the skill's instructions — the Docker build and repricing are triggered via gh workflow run, not by pushing commits. This entry grants Claude the ability to push to any remote/branch without user approval, which is unnecessary.
| ## Phase 3 — Wait for completion | ||
|
|
||
| 1. Get the run ID from `gh run list --repo NethermindEth/gas-benchmarks --workflow=repricing-nethermind.yml --limit 1`. | ||
| 2. Poll: `gh run view <run-id> --repo NethermindEth/gas-benchmarks --json status,conclusion` every 30 seconds. |
There was a problem hiding this comment.
High — race condition: wrong run may be analyzed
gh run list --limit 1 picks up the most-recently-created run on that workflow, but there is no synchronization point between gh workflow run (Phase 2) and this listing. If another engineer (or a scheduled job) triggers repricing-nethermind.yml in the same window, this step will poll the wrong run and produce a completely incorrect analysis.
A safer pattern is to capture the creation timestamp just before triggering, then filter:
BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
gh workflow run repricing-nethermind.yml ...
# wait a few seconds for GitHub to register the run
sleep 10
gh run list --repo NethermindEth/gas-benchmarks \
--workflow=repricing-nethermind.yml \
--created ">$BEFORE" \
--limit 1 \
--json databaseId --jq '.[0].databaseId'Or at minimum add --actor @me to narrow to runs triggered by the current token.
|
|
||
| Scan for ALL of these patterns. Report every match with the full log line: | ||
| ``` | ||
| grep -iE "Exception|Invalid Block|InvalidBlock|Rejected invalid" | grep -v "docker\|pip\|node-exporter\|apt\|dotnet\|nuget\|npm\|orphan" |
There was a problem hiding this comment.
Medium — grep -v "dotnet" may suppress real Nethermind exceptions
The exclusion list includes dotnet, which appears in fully-qualified .NET type names that Nethermind itself logs. For example, a stack frame or exception message like:
Nethermind.Core.Exceptions.InvalidBlockException ... at dotnet.runtime/...
would be silently dropped. The intent seems to be filtering out setup noise (pip, apt, npm), not .NET runtime frames. Consider tightening the exclusion to only match lines where the keyword is at the start of the line or part of a known setup-step prefix, rather than anywhere on the line:
grep -v "^\s*\(docker\|pip\|node-exporter\|apt\|nuget\|npm\|orphan\)"Or exclude only specific known-noisy prefixes rather than any line containing "dotnet".
| 2. Poll: `gh run view <run-id> --repo NethermindEth/gas-benchmarks --json status,conclusion` every 30 seconds. | ||
| 3. Report to the user when the run completes with success or failure. | ||
|
|
||
| ## Phase 4 — Analyze results |
There was a problem hiding this comment.
Medium — no upper bound on the polling loop
Phase 3 polls every 30 seconds with no documented maximum wait time. A workflow that gets stuck in a GitHub Actions queue or hangs mid-run will cause the skill to loop indefinitely, consuming the entire conversation context window.
Add an explicit timeout (e.g., 2 hours = 240 iterations at 30 s each) and break out with a clear message if exceeded:
Max wait: 2 hours. After 240 polls with no terminal status, report "timed out" and provide the run URL for manual inspection.
| ``` | ||
| gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \ | ||
| --jq '.content' | base64 -d | head -60 | ||
| ``` |
There was a problem hiding this comment.
Low — base64 -d is not portable to macOS
base64 -d is the Linux/GNU form; macOS uses base64 -D. For cross-platform support, use:
base64 -d 2>/dev/null || base64 -DOr note in the skill that this step assumes a Linux runner (which may be fine given the CI context).
| -f tag=<tag> \ | ||
| -f dockerfile=<dockerfile> \ | ||
| -f build-config=release | ||
| ``` |
There was a problem hiding this comment.
Low — run ID for Docker build not captured explicitly
After gh workflow run publish-docker.yml ... fires, Step 5 says "Poll until complete: gh run view <run-id>" but no step shows how <run-id> is obtained. The same race condition present in Phase 3 applies here. Consider adding an explicit step (with the same --created >$BEFORE approach) immediately after the trigger.
- H1: remove unused `Bash(git push *)` from allowed-tools - H2: use timestamp-based run ID lookup to avoid race conditions in both Docker build (Phase 1) and benchmark (Phase 3) polling - M1: tighten grep exclusions to not suppress real .NET exceptions - M2: add 2-hour polling timeout to prevent infinite loops
|
@claude re-review |
|
Claude finished @kamilchodola's task in 1m 56s —— View job Re-review
All four previously-flagged findings have been properly addressed. Verified fixes
Remaining low-severity item (not blocking)L1 — No new issues were introduced by the fixes. Verdict: mergeable. All High and Medium findings are resolved. |
|
@claude can you do a full review again not onlyu focusing on previous points? |
|
Claude finished @kamilchodola's task in 3m 24s —— View job Full PR Review
SummaryThis is a full independent review of the PR, not limited to the previously flagged findings. The H1/H2/M1/M2 fixes from the prior round are confirmed correct. Two new Medium and four new Low issues were found. Previously Fixed (confirmed ✅)
New FindingsMedium
Low
Not an Issue (reviewed, no finding)
Verdict: mergeable with caveats. M3 ( |
| Read the workflow YAML on the chosen branch to learn which inputs it supports: | ||
| ``` | ||
| gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \ | ||
| --jq '.content' | base64 -d | head -60 |
There was a problem hiding this comment.
Medium — head -60 may truncate workflow input discovery
The head -60 cap is dangerous: GitHub Actions workflow files typically define their on.workflow_dispatch.inputs block after frontmatter, comments, and the on: key — often past line 60 for moderately complex workflows. If the inputs block is truncated, the skill discovers zero inputs and Phase 2 silently omits required flags (like release_tag, genesis_file, or diagnostics_mode), causing the triggered run to fail with confusing errors.
Fix: remove the head limit entirely, or use a larger cap like head -120. The file is a small YAML blob fetched via API — there's no cost to reading it fully.
| --jq '.content' | base64 -d | head -60 | |
| gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \ | |
| --jq '.content' | base64 -d |
| 2. Determine Dockerfile based on dotTrace: | ||
| - dotTrace enabled → `Dockerfile.diag`, tag suffix `-diag` | ||
| - dotTrace disabled → regular `Dockerfile`, no suffix | ||
| 3. Compute tag: `<branch-name>-diag` (if diag) or `<branch-name>` (if regular). |
There was a problem hiding this comment.
Medium — branch names containing / produce invalid Docker image tags
Docker image tags cannot contain /. A branch like feature/my-branch or devnets/bal-devnet-6 would generate a tag like feature/my-branch, which Docker rejects with invalid reference format.
The skill should sanitize the branch name when computing the tag:
TAG=$(echo "<branch-name>" | tr '/' '-')
# e.g., feature/my-branch → feature-my-branchAdd this normalization step before constructing the tag (Step 3), and surface the final tag to the user in the Phase 0e confirmation so they can verify it before the build starts.
| Report sload/sstore/create counts for the heaviest test blocks. | ||
|
|
||
| ### 4d. Opcode tracing comparison | ||
|
|
There was a problem hiding this comment.
Low — no download command for opcode tracing JSON
Phase 4d says to "download opcodes_tracing-stateful-<network>.json from the relevant release(s)" but doesn't give a concrete command. An agent seeing this will have to guess — likely reaching for gh release download, which requires the asset name exactly. Provide the command:
gh release download <tag> --repo NethermindEth/gas-benchmarks \
--pattern "opcodes_tracing-stateful-<network>.json" \
--output opcodes_tracing.jsonAlso clarify when this step is triggered: is it always run, or only when explicitly comparing two runs?
|
|
||
| ### 4b. Timing analysis | ||
|
|
||
| Extract all `Processed` lines. For each test block (blocks with `Gas gwei` in the log line): |
There was a problem hiding this comment.
Low — vague log extraction instructions for timing and block stats
Phases 4b and 4c say "extract all Processed lines" and "extract Block stats lines" but give no grep pattern. An agent interpreting these against real Nethermind logs may get confused or extract the wrong lines, especially across log versions.
Add concrete patterns, for example:
# 4b: timing lines
grep "Processed" logs.txt | grep "Gas gwei"
# 4c: block stats
grep -E "\bsload\b|\bsstore\b|\bcreate\b" logs.txtEven approximate patterns are better than none — the agent can adapt if the log format shifts, but it needs a starting point.
- M3: remove head -60 cap on workflow input discovery (truncated inputs) - M4: sanitize branch names with / for Docker tags (tr '/' '-') - L4: add concrete gh release download command for opcode tracing - L5: add grep patterns for timing and block stats extraction
… comparison guide
Summary
/gas-benchmarkClaude Code skill that automates the full gas-benchmarks pipeline: build Docker image → trigger repricing workflow → wait → analyze results (logs, timings, sstore/sload counts, dotTrace XML)Changes
.agents/skills/gas-benchmark/SKILL.md— canonical skill definition.claude/skills/gas-benchmark— symlink per repo conventionTesting
allowed-toolsscoped to specificgh/gitsubcommandsHeaderGasUsedMismatchand other block validation errors