feat: add gas-benchmark skill for automated repricing benchmarks by kamilchodola · Pull Request #11526 · NethermindEth/nethermind

kamilchodola · 2026-05-08T12:31:13Z

Summary

Add /gas-benchmark Claude Code skill that automates the full gas-benchmarks pipeline: build Docker image → trigger repricing workflow → wait → analyze results (logs, timings, sstore/sload counts, dotTrace XML)
Interactive mode when called without arguments — shows available releases, asks for image/network/filter/dotTrace preferences
Discovers the correct gas-benchmarks branch from release notes automatically
Mandatory exception scan that catches gas mismatches, BAL corruption, and invalid blocks

Changes

.agents/skills/gas-benchmark/SKILL.md — canonical skill definition
.claude/skills/gas-benchmark — symlink per repo convention

Testing

Tested end-to-end: interactive flow → Docker build → repricing trigger → result analysis
Security reviewed: allowed-tools scoped to specific gh/git subcommands
Verified exception scan catches HeaderGasUsedMismatch and other block validation errors

Interactive pipeline: discover release → resolve gas-benchmarks branch → build diag Docker image → trigger repricing workflow → wait → analyze results (logs, timings, sstore/sload counts, dotTrace XML). Invokable via /gas-benchmark or programmatically with user confirmation.

- Narrow gh permissions to run/workflow/release/api subcommands - Narrow git permissions to read-only + push - Remove --no-diag from argument-hint (replaced by --dottrace logic) - Add dotTrace interactive question - Extract gas-benchmarks branch from release notes

… skill

claude · 2026-05-08T12:31:38Z

Claude finished @kamilchodola's task in 3m 24s —— View job

Review

Read changed files (SKILL.md, symlink, agent-skills.md rules)
Review SKILL.md for correctness, security, and robustness
Check symlink convention against agent-skills.md
Post inline comments for specific issues
Post top-level summary

Summary

This PR adds a well-structured skill with a clear multi-phase pipeline (discover → build → trigger → wait → analyze). The mandatory exception-scan requirement and the dotTrace guard (do NOT pass diagnostics_mode=dottrace if the image was not built with Dockerfile.diag) are good defensive design choices. The symlink (../../.agents/skills/gas-benchmark) is correct per the repo convention.

Two High issues and two Medium issues need to be addressed before merge.

Findings

High

#	Location	Issue
H1	`allowed-tools` line 12	`Bash(git push )` is not used anywhere* in the skill but grants Claude the ability to push to arbitrary remotes without user approval. Remove it.
H2	Phase 3, Step 1 (line 193)	`gh run list --limit 1` is a race condition: if any other actor triggers `repricing-nethermind.yml` between the `gh workflow run` call and the list query, the wrong run is polled and the analysis is corrupted. Fix: capture a timestamp before triggering, then filter with `--created ">$BEFORE"` (and optionally `--actor @me`). The same issue applies to Phase 1 Step 5 (Docker build run ID).

Medium

#	Location	Issue
M1	Phase 4a, line 208	`grep -v "dotnet"` may silently drop real Nethermind exceptions whose stack frames include `.NET`/`dotnet` runtime references. The intent is to skip setup-phase noise (pip, apt), not .NET frames. Tighten the exclusion to anchor on line-start or known setup prefixes.
M2	Phase 3, line 196	No polling timeout. A queued or hung workflow will cause the skill to loop indefinitely, exhausting the conversation context window. Define a maximum wait (e.g., 2 hours / 240 iterations) with a graceful break.

Low

#	Location	Issue
L1	Phase 0c, line 119	`base64 -d` is Linux/GNU only; macOS requires `base64 -D`. A fallback like `base64 -d 2>/dev/null \|\| base64 -D` or a note that a Linux runner is assumed would avoid surprises.
L2	Phase 1, Step 5 (line 159)	Docker build run ID retrieval is not shown explicitly — same race as H2.
L3	`agent-skills.md` convention	The rule says to also create `.cursor/skills/<name>` (see `.agents/rules/agent-skills.md`). No existing skills have it either, so this may be aspirational, but it's worth noting.

Verdict: not yet mergeable — H1 (git push * in allowed-tools) and H2 (run-ID race condition) should be fixed. M1 and M2 are straightforward to address and would meaningfully improve reliability.

claude · 2026-05-08T12:34:09Z

+  - Bash(git branch *)
+  - Bash(git log *)
+  - Bash(git status *)
+  - Bash(git push *)


High — unnecessary permission in allowed-tools

Bash(git push *) does not appear anywhere in the skill's instructions — the Docker build and repricing are triggered via gh workflow run, not by pushing commits. This entry grants Claude the ability to push to any remote/branch without user approval, which is unnecessary.

Fix this →

claude · 2026-05-08T12:34:18Z

+## Phase 3 — Wait for completion
+
+1. Get the run ID from `gh run list --repo NethermindEth/gas-benchmarks --workflow=repricing-nethermind.yml --limit 1`.
+2. Poll: `gh run view <run-id> --repo NethermindEth/gas-benchmarks --json status,conclusion` every 30 seconds.


High — race condition: wrong run may be analyzed

gh run list --limit 1 picks up the most-recently-created run on that workflow, but there is no synchronization point between gh workflow run (Phase 2) and this listing. If another engineer (or a scheduled job) triggers repricing-nethermind.yml in the same window, this step will poll the wrong run and produce a completely incorrect analysis.

A safer pattern is to capture the creation timestamp just before triggering, then filter:

BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ) gh workflow run repricing-nethermind.yml ... # wait a few seconds for GitHub to register the run sleep 10 gh run list --repo NethermindEth/gas-benchmarks \ --workflow=repricing-nethermind.yml \ --created ">$BEFORE" \ --limit 1 \ --json databaseId --jq '.[0].databaseId'

Or at minimum add --actor @me to narrow to runs triggered by the current token.

claude · 2026-05-08T12:34:27Z

+
+Scan for ALL of these patterns. Report every match with the full log line:
+```
+grep -iE "Exception|Invalid Block|InvalidBlock|Rejected invalid" | grep -v "docker\|pip\|node-exporter\|apt\|dotnet\|nuget\|npm\|orphan"


Medium — grep -v "dotnet" may suppress real Nethermind exceptions

The exclusion list includes dotnet, which appears in fully-qualified .NET type names that Nethermind itself logs. For example, a stack frame or exception message like:

Nethermind.Core.Exceptions.InvalidBlockException ... at dotnet.runtime/...

would be silently dropped. The intent seems to be filtering out setup noise (pip, apt, npm), not .NET runtime frames. Consider tightening the exclusion to only match lines where the keyword is at the start of the line or part of a known setup-step prefix, rather than anywhere on the line:

grep -v "^\s*$docker\|pip\|node-exporter\|apt\|nuget\|npm\|orphan$"

Or exclude only specific known-noisy prefixes rather than any line containing "dotnet".

claude · 2026-05-08T12:34:34Z

+2. Poll: `gh run view <run-id> --repo NethermindEth/gas-benchmarks --json status,conclusion` every 30 seconds.
+3. Report to the user when the run completes with success or failure.
+
+## Phase 4 — Analyze results


Medium — no upper bound on the polling loop

Phase 3 polls every 30 seconds with no documented maximum wait time. A workflow that gets stuck in a GitHub Actions queue or hangs mid-run will cause the skill to loop indefinitely, consuming the entire conversation context window.

Add an explicit timeout (e.g., 2 hours = 240 iterations at 30 s each) and break out with a clear message if exceeded:

Max wait: 2 hours. After 240 polls with no terminal status, report "timed out" and provide the run URL for manual inspection.

claude · 2026-05-08T12:34:39Z

+```
+gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \
+  --jq '.content' | base64 -d | head -60
+```


Low — base64 -d is not portable to macOS

base64 -d is the Linux/GNU form; macOS uses base64 -D. For cross-platform support, use:

base64 -d 2>/dev/null || base64 -D

Or note in the skill that this step assumes a Linux runner (which may be fine given the CI context).

claude · 2026-05-08T12:34:45Z

+     -f tag=<tag> \
+     -f dockerfile=<dockerfile> \
+     -f build-config=release
+   ```


Low — run ID for Docker build not captured explicitly

After gh workflow run publish-docker.yml ... fires, Step 5 says "Poll until complete: gh run view <run-id>" but no step shows how <run-id> is obtained. The same race condition present in Phase 3 applies here. Consider adding an explicit step (with the same --created >$BEFORE approach) immediately after the trigger.

- H1: remove unused `Bash(git push *)` from allowed-tools - H2: use timestamp-based run ID lookup to avoid race conditions in both Docker build (Phase 1) and benchmark (Phase 3) polling - M1: tighten grep exclusions to not suppress real .NET exceptions - M2: add 2-hour polling timeout to prevent infinite loops

kamilchodola · 2026-05-08T12:48:29Z

@claude re-review

claude · 2026-05-08T12:48:56Z

Claude finished @kamilchodola's task in 1m 56s —— View job

Re-review

Read current state of SKILL.md after fixes
Verify H1: git push * removed from allowed-tools
Verify H2: race condition fixed for run ID capture
Verify M1: grep -v "dotnet" exception-filter tightened
Verify M2: polling timeout added
Check for any new issues introduced by the fixes
Post final verdict

All four previously-flagged findings have been properly addressed.

Verified fixes

Finding	Status	How fixed
H1 — `Bash(git push *)` in allowed-tools	✅ Fixed	Broad `Bash(gh )` and `Bash(git )` replaced with fine-grained subcommand entries (`Bash(gh run )`, `Bash(gh workflow run )`, etc.). `git push` is gone entirely.
H2 — run ID race condition	✅ Fixed	Phase 1 (line 152) and Phase 2 (line 171) now capture `BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ)` before triggering, and both run-ID queries filter with `select(.createdAt > "<BEFORE>")`.
M1 — `grep -v "dotnet"` silently dropping .NET frames	✅ Fixed	`dotnet` removed from the exclusion list. An explicit note at line 223 ("do NOT exclude `dotnet`") prevents regressing this in future edits.
M2 — no polling timeout	✅ Fixed	Line 206: "Timeout after 2 hours (240 polls)" with a break-and-report instruction.

Remaining low-severity item (not blocking)

L1 — base64 -d (line 117): Still Linux/GNU only; macOS requires base64 -D. Since this skill runs in a CI/Linux context, it's unlikely to matter in practice, but a comment like # Linux only; macOS: base64 -D would document the assumption.

No new issues were introduced by the fixes.

Verdict: mergeable. All High and Medium findings are resolved.

kamilchodola · 2026-05-08T12:53:25Z

@claude can you do a full review again not onlyu focusing on previous points?

claude · 2026-05-08T12:53:41Z

Claude finished @kamilchodola's task in 3m 24s —— View job

Full PR Review

Read SKILL.md and rules files
Full analysis: security, correctness, robustness, UX
Inline comments for specific issues
Final verdict

Summary

This is a full independent review of the PR, not limited to the previously flagged findings. The H1/H2/M1/M2 fixes from the prior round are confirmed correct. Two new Medium and four new Low issues were found.

Previously Fixed (confirmed ✅)

Finding	Status
H1 — `Bash(git push *)` in allowed-tools	✅ Removed
H2 — run ID race condition	✅ Fixed with `BEFORE` timestamp filter
M1 — `grep -v "dotnet"` dropping .NET frames	✅ Fixed
M2 — no polling timeout	✅ 2-hour / 240-poll limit added

New Findings

Medium

#	Location	Issue
M3	Phase 0c, line 117	`head -60` may truncate workflow input discovery. GitHub Actions `on.workflow_dispatch.inputs` is commonly defined past line 60 in complex workflows. If the inputs block is cut off, the skill discovers zero inputs and Phase 2 silently omits required flags (`release_tag`, `genesis_file`, `diagnostics_mode`), causing the triggered run to fail with confusing errors. Fix: remove the `head` cap entirely — this is a small YAML blob, not a large file.
M4	Phase 1, Step 3, line 149	Branch names with `/` produce invalid Docker tags. Docker tags cannot contain `/`. A branch like `feature/my-branch` or `devnets/bal-devnet-6` passed to tag computation would cause the Docker build to reject the tag outright (`invalid reference format`). The skill should sanitize with `tr '/' '-'` and surface the resolved tag in the Phase 0e confirmation.

Low

#	Location	Issue
L1	Phase 4b/4c, lines 238–246	No concrete grep patterns for timing and block-stats extraction. Both phases describe extracting `Processed` lines and `Block` stats lines by keyword but give no command. Agents will guess and may extract wrong or incomplete data. Add at minimum approximate grep patterns as a starting point.
L2	Phase 4d, line 250	No download command for opcode tracing JSON. Phase 4d says "download `opcodes_tracing-stateful-<network>.json`" but provides no `gh release download` command. Also unclear whether this step runs always or only when comparing two runs explicitly.
L3	Phase 0d, line 123	Unknown networks silently produce no genesis file. The mapping covers `perf-devnet-3`, `jochemnet`, and `mainnet`. Any other network value silently omits the `genesis_file` flag with no warning or error, which may cause the repricing workflow to misfire. Should add an explicit warning or error for unmapped networks.
L4	`.agents/rules/agent-skills.md`	`.cursor/skills/gas-benchmark` symlink missing. The rule requires both `.claude/skills/<name>` and `.cursor/skills/<name>` symlinks. No existing skills in the repo have the Cursor symlink either, so this appears aspirational — but worth noting for consistency.

Not an Issue (reviewed, no finding)

allowed-tools scope: Fine. Bash(gh api repos/NethermindEth/*) is correctly scoped to NethermindEth repos. No write git operations. Bash(gh run download *) is permitted via Bash(gh run *) and is necessary for dotTrace artifacts.
Bash(cd *) and Bash(until *) in allowed-tools: Shell built-ins that can't be meaningfully scoped, but harmless.
Phase 3 timestamp race: Fixed correctly — select(.createdAt > "<BEFORE>") guards both Docker build and repricing run lookups.
Exception exclusion list: The nuget.org exclusion is reasonable — repricing runs don't do NuGet restore.
dotTrace guard (do NOT pass diagnostics_mode=dottrace if image not built with Dockerfile.diag): Clear and correctly placed.
Shutdown confirmation (Nethermind is shut down grep): Good defensive check.
MSYS_NO_PATHCONV=1: Correctly included for Windows Git Bash path conversion.
Phase 0e confirmation step: Good UX, though no explicit "user says No → abort" instruction.

Verdict: mergeable with caveats. M3 (head -60) and M4 (branch / in Docker tag) are real failure modes worth fixing before the skill is used in production. The Low findings are minor polish items. No new High or Critical issues.

claude · 2026-05-08T12:56:00Z

+Read the workflow YAML on the chosen branch to learn which inputs it supports:
+```
+gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \
+  --jq '.content' | base64 -d | head -60


Medium — head -60 may truncate workflow input discovery

The head -60 cap is dangerous: GitHub Actions workflow files typically define their on.workflow_dispatch.inputs block after frontmatter, comments, and the on: key — often past line 60 for moderately complex workflows. If the inputs block is truncated, the skill discovers zero inputs and Phase 2 silently omits required flags (like release_tag, genesis_file, or diagnostics_mode), causing the triggered run to fail with confusing errors.

Fix: remove the head limit entirely, or use a larger cap like head -120. The file is a small YAML blob fetched via API — there's no cost to reading it fully.

Suggested change

--jq '.content' | base64 -d | head -60

gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \

--jq '.content' | base64 -d

claude · 2026-05-08T12:56:10Z

+2. Determine Dockerfile based on dotTrace:
+   - dotTrace enabled → `Dockerfile.diag`, tag suffix `-diag`
+   - dotTrace disabled → regular `Dockerfile`, no suffix
+3. Compute tag: `<branch-name>-diag` (if diag) or `<branch-name>` (if regular).


Medium — branch names containing / produce invalid Docker image tags

Docker image tags cannot contain /. A branch like feature/my-branch or devnets/bal-devnet-6 would generate a tag like feature/my-branch, which Docker rejects with invalid reference format.

The skill should sanitize the branch name when computing the tag:

TAG=$(echo "<branch-name>" | tr '/' '-') # e.g., feature/my-branch → feature-my-branch

Add this normalization step before constructing the tag (Step 3), and surface the final tag to the user in the Phase 0e confirmation so they can verify it before the build starts.

claude · 2026-05-08T12:56:20Z

+Report sload/sstore/create counts for the heaviest test blocks.
+
+### 4d. Opcode tracing comparison
+


Low — no download command for opcode tracing JSON

Phase 4d says to "download opcodes_tracing-stateful-<network>.json from the relevant release(s)" but doesn't give a concrete command. An agent seeing this will have to guess — likely reaching for gh release download, which requires the asset name exactly. Provide the command:

gh release download <tag> --repo NethermindEth/gas-benchmarks \ --pattern "opcodes_tracing-stateful-<network>.json" \ --output opcodes_tracing.json

Also clarify when this step is triggered: is it always run, or only when explicitly comparing two runs?

claude · 2026-05-08T12:56:28Z

+
+### 4b. Timing analysis
+
+Extract all `Processed` lines. For each test block (blocks with `Gas gwei` in the log line):


Low — vague log extraction instructions for timing and block stats

Phases 4b and 4c say "extract all Processed lines" and "extract Block stats lines" but give no grep pattern. An agent interpreting these against real Nethermind logs may get confused or extract the wrong lines, especially across log versions.

Add concrete patterns, for example:

# 4b: timing lines grep "Processed" logs.txt | grep "Gas gwei" # 4c: block stats grep -E "\bsload\b|\bsstore\b|\bcreate\b" logs.txt

Even approximate patterns are better than none — the agent can adapt if the log format shifts, but it needs a starting point.

- M3: remove head -60 cap on workflow input discovery (truncated inputs) - M4: sanitize branch names with / for Docker tags (tr '/' '-') - L4: add concrete gh release download command for opcode tracing - L5: add grep patterns for timing and block stats extraction

… comparison guide

kamilchodola added 4 commits May 8, 2026 12:36

fix: use symlink for .claude/skills/gas-benchmark instead of copy

359429e

fix: make exception scan mandatory and more thorough in gas-benchmark…

b6fed01

… skill

kamilchodola requested review from LukaszRozmej, MarekM25 and rubo as code owners May 8, 2026 12:31

github-actions Bot added the new feature label May 8, 2026

claude Bot reviewed May 8, 2026

View reviewed changes

kamilchodola added 2 commits May 8, 2026 15:03

feat: expand dotTrace analysis section with download, interpretation,…

35bb336

… comparison guide

	--jq '.content' \| base64 -d \| head -60
	gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \
	--jq '.content' \| base64 -d

		Report sload/sstore/create counts for the heaviest test blocks.

		### 4d. Opcode tracing comparison


		### 4b. Timing analysis

		Extract all `Processed` lines. For each test block (blocks with `Gas gwei` in the log line):

Conversation

kamilchodola commented May 8, 2026

Summary

Changes

Testing

Uh oh!

claude Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Summary

Findings

High

Medium

Low

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

kamilchodola commented May 8, 2026

Uh oh!

claude Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review

Verified fixes

Remaining low-severity item (not blocking)

Uh oh!

kamilchodola commented May 8, 2026

Uh oh!

claude Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Full PR Review

Summary

Previously Fixed (confirmed ✅)

New Findings

Medium

Low

Not an Issue (reviewed, no finding)

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 8, 2026 •

edited

Loading

claude Bot commented May 8, 2026 •

edited

Loading

claude Bot commented May 8, 2026 •

edited

Loading