Skip to content

feat: add gas-benchmark skill for automated repricing benchmarks#11526

Open
kamilchodola wants to merge 7 commits intomasterfrom
feat/gas-benchmark-skill
Open

feat: add gas-benchmark skill for automated repricing benchmarks#11526
kamilchodola wants to merge 7 commits intomasterfrom
feat/gas-benchmark-skill

Conversation

@kamilchodola
Copy link
Copy Markdown
Contributor

Summary

  • Add /gas-benchmark Claude Code skill that automates the full gas-benchmarks pipeline: build Docker image → trigger repricing workflow → wait → analyze results (logs, timings, sstore/sload counts, dotTrace XML)
  • Interactive mode when called without arguments — shows available releases, asks for image/network/filter/dotTrace preferences
  • Discovers the correct gas-benchmarks branch from release notes automatically
  • Mandatory exception scan that catches gas mismatches, BAL corruption, and invalid blocks

Changes

  • .agents/skills/gas-benchmark/SKILL.md — canonical skill definition
  • .claude/skills/gas-benchmark — symlink per repo convention

Testing

  • Tested end-to-end: interactive flow → Docker build → repricing trigger → result analysis
  • Security reviewed: allowed-tools scoped to specific gh/git subcommands
  • Verified exception scan catches HeaderGasUsedMismatch and other block validation errors

Interactive pipeline: discover release → resolve gas-benchmarks branch →
build diag Docker image → trigger repricing workflow → wait → analyze
results (logs, timings, sstore/sload counts, dotTrace XML).

Invokable via /gas-benchmark or programmatically with user confirmation.
- Narrow gh permissions to run/workflow/release/api subcommands
- Narrow git permissions to read-only + push
- Remove --no-diag from argument-hint (replaced by --dottrace logic)
- Add dotTrace interactive question
- Extract gas-benchmarks branch from release notes
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 8, 2026

Claude finished @kamilchodola's task in 3m 24s —— View job


Review

  • Read changed files (SKILL.md, symlink, agent-skills.md rules)
  • Review SKILL.md for correctness, security, and robustness
  • Check symlink convention against agent-skills.md
  • Post inline comments for specific issues
  • Post top-level summary

Summary

This PR adds a well-structured skill with a clear multi-phase pipeline (discover → build → trigger → wait → analyze). The mandatory exception-scan requirement and the dotTrace guard (do NOT pass diagnostics_mode=dottrace if the image was not built with Dockerfile.diag) are good defensive design choices. The symlink (../../.agents/skills/gas-benchmark) is correct per the repo convention.

Two High issues and two Medium issues need to be addressed before merge.


Findings

High

# Location Issue
H1 allowed-tools line 12 Bash(git push *) is not used anywhere in the skill but grants Claude the ability to push to arbitrary remotes without user approval. Remove it.
H2 Phase 3, Step 1 (line 193) gh run list --limit 1 is a race condition: if any other actor triggers repricing-nethermind.yml between the gh workflow run call and the list query, the wrong run is polled and the analysis is corrupted. Fix: capture a timestamp before triggering, then filter with --created ">$BEFORE" (and optionally --actor @me). The same issue applies to Phase 1 Step 5 (Docker build run ID).

Medium

# Location Issue
M1 Phase 4a, line 208 grep -v "dotnet" may silently drop real Nethermind exceptions whose stack frames include .NET/dotnet runtime references. The intent is to skip setup-phase noise (pip, apt), not .NET frames. Tighten the exclusion to anchor on line-start or known setup prefixes.
M2 Phase 3, line 196 No polling timeout. A queued or hung workflow will cause the skill to loop indefinitely, exhausting the conversation context window. Define a maximum wait (e.g., 2 hours / 240 iterations) with a graceful break.

Low

# Location Issue
L1 Phase 0c, line 119 base64 -d is Linux/GNU only; macOS requires base64 -D. A fallback like base64 -d 2>/dev/null || base64 -D or a note that a Linux runner is assumed would avoid surprises.
L2 Phase 1, Step 5 (line 159) Docker build run ID retrieval is not shown explicitly — same race as H2.
L3 agent-skills.md convention The rule says to also create .cursor/skills/<name> (see .agents/rules/agent-skills.md). No existing skills have it either, so this may be aspirational, but it's worth noting.

Verdict: not yet mergeable — H1 (git push * in allowed-tools) and H2 (run-ID race condition) should be fixed. M1 and M2 are straightforward to address and would meaningfully improve reliability.

Comment thread .agents/skills/gas-benchmark/SKILL.md Outdated
- Bash(git branch *)
- Bash(git log *)
- Bash(git status *)
- Bash(git push *)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — unnecessary permission in allowed-tools

Bash(git push *) does not appear anywhere in the skill's instructions — the Docker build and repricing are triggered via gh workflow run, not by pushing commits. This entry grants Claude the ability to push to any remote/branch without user approval, which is unnecessary.

Fix this →

## Phase 3 — Wait for completion

1. Get the run ID from `gh run list --repo NethermindEth/gas-benchmarks --workflow=repricing-nethermind.yml --limit 1`.
2. Poll: `gh run view <run-id> --repo NethermindEth/gas-benchmarks --json status,conclusion` every 30 seconds.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — race condition: wrong run may be analyzed

gh run list --limit 1 picks up the most-recently-created run on that workflow, but there is no synchronization point between gh workflow run (Phase 2) and this listing. If another engineer (or a scheduled job) triggers repricing-nethermind.yml in the same window, this step will poll the wrong run and produce a completely incorrect analysis.

A safer pattern is to capture the creation timestamp just before triggering, then filter:

BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
gh workflow run repricing-nethermind.yml ...
# wait a few seconds for GitHub to register the run
sleep 10
gh run list --repo NethermindEth/gas-benchmarks \
  --workflow=repricing-nethermind.yml \
  --created ">$BEFORE" \
  --limit 1 \
  --json databaseId --jq '.[0].databaseId'

Or at minimum add --actor @me to narrow to runs triggered by the current token.

Comment thread .agents/skills/gas-benchmark/SKILL.md Outdated

Scan for ALL of these patterns. Report every match with the full log line:
```
grep -iE "Exception|Invalid Block|InvalidBlock|Rejected invalid" | grep -v "docker\|pip\|node-exporter\|apt\|dotnet\|nuget\|npm\|orphan"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — grep -v "dotnet" may suppress real Nethermind exceptions

The exclusion list includes dotnet, which appears in fully-qualified .NET type names that Nethermind itself logs. For example, a stack frame or exception message like:

Nethermind.Core.Exceptions.InvalidBlockException ... at dotnet.runtime/...

would be silently dropped. The intent seems to be filtering out setup noise (pip, apt, npm), not .NET runtime frames. Consider tightening the exclusion to only match lines where the keyword is at the start of the line or part of a known setup-step prefix, rather than anywhere on the line:

grep -v "^\s*\(docker\|pip\|node-exporter\|apt\|nuget\|npm\|orphan\)"

Or exclude only specific known-noisy prefixes rather than any line containing "dotnet".

2. Poll: `gh run view <run-id> --repo NethermindEth/gas-benchmarks --json status,conclusion` every 30 seconds.
3. Report to the user when the run completes with success or failure.

## Phase 4 — Analyze results
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — no upper bound on the polling loop

Phase 3 polls every 30 seconds with no documented maximum wait time. A workflow that gets stuck in a GitHub Actions queue or hangs mid-run will cause the skill to loop indefinitely, consuming the entire conversation context window.

Add an explicit timeout (e.g., 2 hours = 240 iterations at 30 s each) and break out with a clear message if exceeded:

Max wait: 2 hours. After 240 polls with no terminal status, report "timed out" and provide the run URL for manual inspection.

```
gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \
--jq '.content' | base64 -d | head -60
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — base64 -d is not portable to macOS

base64 -d is the Linux/GNU form; macOS uses base64 -D. For cross-platform support, use:

base64 -d 2>/dev/null || base64 -D

Or note in the skill that this step assumes a Linux runner (which may be fine given the CI context).

-f tag=<tag> \
-f dockerfile=<dockerfile> \
-f build-config=release
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — run ID for Docker build not captured explicitly

After gh workflow run publish-docker.yml ... fires, Step 5 says "Poll until complete: gh run view <run-id>" but no step shows how <run-id> is obtained. The same race condition present in Phase 3 applies here. Consider adding an explicit step (with the same --created >$BEFORE approach) immediately after the trigger.

- H1: remove unused `Bash(git push *)` from allowed-tools
- H2: use timestamp-based run ID lookup to avoid race conditions
  in both Docker build (Phase 1) and benchmark (Phase 3) polling
- M1: tighten grep exclusions to not suppress real .NET exceptions
- M2: add 2-hour polling timeout to prevent infinite loops
@kamilchodola
Copy link
Copy Markdown
Contributor Author

@claude re-review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 8, 2026

Claude finished @kamilchodola's task in 1m 56s —— View job


Re-review

  • Read current state of SKILL.md after fixes
  • Verify H1: git push * removed from allowed-tools
  • Verify H2: race condition fixed for run ID capture
  • Verify M1: grep -v "dotnet" exception-filter tightened
  • Verify M2: polling timeout added
  • Check for any new issues introduced by the fixes
  • Post final verdict

All four previously-flagged findings have been properly addressed.

Verified fixes

Finding Status How fixed
H1Bash(git push *) in allowed-tools ✅ Fixed Broad Bash(gh *) and Bash(git *) replaced with fine-grained subcommand entries (Bash(gh run *), Bash(gh workflow run *), etc.). git push is gone entirely.
H2 — run ID race condition ✅ Fixed Phase 1 (line 152) and Phase 2 (line 171) now capture BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ) before triggering, and both run-ID queries filter with select(.createdAt > "<BEFORE>").
M1grep -v "dotnet" silently dropping .NET frames ✅ Fixed dotnet removed from the exclusion list. An explicit note at line 223 ("do NOT exclude dotnet") prevents regressing this in future edits.
M2 — no polling timeout ✅ Fixed Line 206: "Timeout after 2 hours (240 polls)" with a break-and-report instruction.

Remaining low-severity item (not blocking)

L1 — base64 -d (line 117): Still Linux/GNU only; macOS requires base64 -D. Since this skill runs in a CI/Linux context, it's unlikely to matter in practice, but a comment like # Linux only; macOS: base64 -D would document the assumption.

No new issues were introduced by the fixes.


Verdict: mergeable. All High and Medium findings are resolved.

@kamilchodola
Copy link
Copy Markdown
Contributor Author

@claude can you do a full review again not onlyu focusing on previous points?

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 8, 2026

Claude finished @kamilchodola's task in 3m 24s —— View job


Full PR Review

  • Read SKILL.md and rules files
  • Full analysis: security, correctness, robustness, UX
  • Inline comments for specific issues
  • Final verdict

Summary

This is a full independent review of the PR, not limited to the previously flagged findings. The H1/H2/M1/M2 fixes from the prior round are confirmed correct. Two new Medium and four new Low issues were found.


Previously Fixed (confirmed ✅)

Finding Status
H1 — Bash(git push *) in allowed-tools ✅ Removed
H2 — run ID race condition ✅ Fixed with BEFORE timestamp filter
M1 — grep -v "dotnet" dropping .NET frames ✅ Fixed
M2 — no polling timeout ✅ 2-hour / 240-poll limit added

New Findings

Medium

# Location Issue
M3 Phase 0c, line 117 head -60 may truncate workflow input discovery. GitHub Actions on.workflow_dispatch.inputs is commonly defined past line 60 in complex workflows. If the inputs block is cut off, the skill discovers zero inputs and Phase 2 silently omits required flags (release_tag, genesis_file, diagnostics_mode), causing the triggered run to fail with confusing errors. Fix: remove the head cap entirely — this is a small YAML blob, not a large file.
M4 Phase 1, Step 3, line 149 Branch names with / produce invalid Docker tags. Docker tags cannot contain /. A branch like feature/my-branch or devnets/bal-devnet-6 passed to tag computation would cause the Docker build to reject the tag outright (invalid reference format). The skill should sanitize with tr '/' '-' and surface the resolved tag in the Phase 0e confirmation.

Low

# Location Issue
L1 Phase 4b/4c, lines 238–246 No concrete grep patterns for timing and block-stats extraction. Both phases describe extracting Processed lines and Block stats lines by keyword but give no command. Agents will guess and may extract wrong or incomplete data. Add at minimum approximate grep patterns as a starting point.
L2 Phase 4d, line 250 No download command for opcode tracing JSON. Phase 4d says "download opcodes_tracing-stateful-<network>.json" but provides no gh release download command. Also unclear whether this step runs always or only when comparing two runs explicitly.
L3 Phase 0d, line 123 Unknown networks silently produce no genesis file. The mapping covers perf-devnet-3, jochemnet, and mainnet. Any other network value silently omits the genesis_file flag with no warning or error, which may cause the repricing workflow to misfire. Should add an explicit warning or error for unmapped networks.
L4 .agents/rules/agent-skills.md .cursor/skills/gas-benchmark symlink missing. The rule requires both .claude/skills/<name> and .cursor/skills/<name> symlinks. No existing skills in the repo have the Cursor symlink either, so this appears aspirational — but worth noting for consistency.

Not an Issue (reviewed, no finding)

  • allowed-tools scope: Fine. Bash(gh api repos/NethermindEth/*) is correctly scoped to NethermindEth repos. No write git operations. Bash(gh run download *) is permitted via Bash(gh run *) and is necessary for dotTrace artifacts.
  • Bash(cd *) and Bash(until *) in allowed-tools: Shell built-ins that can't be meaningfully scoped, but harmless.
  • Phase 3 timestamp race: Fixed correctly — select(.createdAt > "<BEFORE>") guards both Docker build and repricing run lookups.
  • Exception exclusion list: The nuget.org exclusion is reasonable — repricing runs don't do NuGet restore.
  • dotTrace guard (do NOT pass diagnostics_mode=dottrace if image not built with Dockerfile.diag): Clear and correctly placed.
  • Shutdown confirmation (Nethermind is shut down grep): Good defensive check.
  • MSYS_NO_PATHCONV=1: Correctly included for Windows Git Bash path conversion.
  • Phase 0e confirmation step: Good UX, though no explicit "user says No → abort" instruction.

Verdict: mergeable with caveats. M3 (head -60) and M4 (branch / in Docker tag) are real failure modes worth fixing before the skill is used in production. The Low findings are minor polish items. No new High or Critical issues.

Comment thread .agents/skills/gas-benchmark/SKILL.md Outdated
Read the workflow YAML on the chosen branch to learn which inputs it supports:
```
gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \
--jq '.content' | base64 -d | head -60
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — head -60 may truncate workflow input discovery

The head -60 cap is dangerous: GitHub Actions workflow files typically define their on.workflow_dispatch.inputs block after frontmatter, comments, and the on: key — often past line 60 for moderately complex workflows. If the inputs block is truncated, the skill discovers zero inputs and Phase 2 silently omits required flags (like release_tag, genesis_file, or diagnostics_mode), causing the triggered run to fail with confusing errors.

Fix: remove the head limit entirely, or use a larger cap like head -120. The file is a small YAML blob fetched via API — there's no cost to reading it fully.

Suggested change
--jq '.content' | base64 -d | head -60
gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \
--jq '.content' | base64 -d

Comment thread .agents/skills/gas-benchmark/SKILL.md Outdated
2. Determine Dockerfile based on dotTrace:
- dotTrace enabled → `Dockerfile.diag`, tag suffix `-diag`
- dotTrace disabled → regular `Dockerfile`, no suffix
3. Compute tag: `<branch-name>-diag` (if diag) or `<branch-name>` (if regular).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — branch names containing / produce invalid Docker image tags

Docker image tags cannot contain /. A branch like feature/my-branch or devnets/bal-devnet-6 would generate a tag like feature/my-branch, which Docker rejects with invalid reference format.

The skill should sanitize the branch name when computing the tag:

TAG=$(echo "<branch-name>" | tr '/' '-')
# e.g., feature/my-branch → feature-my-branch

Add this normalization step before constructing the tag (Step 3), and surface the final tag to the user in the Phase 0e confirmation so they can verify it before the build starts.

Report sload/sstore/create counts for the heaviest test blocks.

### 4d. Opcode tracing comparison

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — no download command for opcode tracing JSON

Phase 4d says to "download opcodes_tracing-stateful-<network>.json from the relevant release(s)" but doesn't give a concrete command. An agent seeing this will have to guess — likely reaching for gh release download, which requires the asset name exactly. Provide the command:

gh release download <tag> --repo NethermindEth/gas-benchmarks \
  --pattern "opcodes_tracing-stateful-<network>.json" \
  --output opcodes_tracing.json

Also clarify when this step is triggered: is it always run, or only when explicitly comparing two runs?

Comment thread .agents/skills/gas-benchmark/SKILL.md Outdated

### 4b. Timing analysis

Extract all `Processed` lines. For each test block (blocks with `Gas gwei` in the log line):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — vague log extraction instructions for timing and block stats

Phases 4b and 4c say "extract all Processed lines" and "extract Block stats lines" but give no grep pattern. An agent interpreting these against real Nethermind logs may get confused or extract the wrong lines, especially across log versions.

Add concrete patterns, for example:

# 4b: timing lines
grep "Processed" logs.txt | grep "Gas gwei"

# 4c: block stats
grep -E "\bsload\b|\bsstore\b|\bcreate\b" logs.txt

Even approximate patterns are better than none — the agent can adapt if the log format shifts, but it needs a starting point.

- M3: remove head -60 cap on workflow input discovery (truncated inputs)
- M4: sanitize branch names with / for Docker tags (tr '/' '-')
- L4: add concrete gh release download command for opcode tracing
- L5: add grep patterns for timing and block stats extraction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant