Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
290 changes: 290 additions & 0 deletions .agents/skills/gas-benchmark/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
---
name: gas-benchmark
description: Build a diag Docker image, run gas-benchmarks repricing workflow, and analyze results including dotTrace XML reports. Use when asked to "run benchmarks", "trigger gas benchmarks", "benchmark this branch", or "profile block processing".
allowed-tools:
- Bash(gh run *)
- Bash(gh workflow run *)
- Bash(gh release *)
- Bash(gh api repos/NethermindEth/*)
- Bash(git branch *)
- Bash(git log *)
- Bash(git status *)
- Bash(cd *)
- Bash(ls *)
- Bash(mkdir *)
- Bash(find *)
- Bash(unzip *)
- Bash(cat *)
- Bash(wc *)
- Bash(sleep *)
- Bash(until *)
- Bash(date *)
- Read
- Grep
- Glob
argument-hint: "[--branch NAME] [--image NAME] [--filter PATTERN] [--network NETWORK] [--fork FORK] [--dottrace]"
---

# Gas Benchmark Pipeline

End-to-end pipeline: build diag Docker image, trigger gas-benchmarks repricing workflow, wait for completion, and analyze results (logs, timings, dotTrace XML).

## Interactive mode (no arguments)

When called without arguments (`/gas-benchmark`), do NOT proceed with defaults. Instead, interactively gather the required information:

1. **Show available releases** and ask the user to pick one:
```
gh api repos/NethermindEth/gas-benchmarks/releases?per_page=15 \
--jq '.[] | "- `" + .tag_name + "` " + (if .draft then "(draft)" else "" end) + " — " + .name'
```
Ask: "Which release should I use for test data?"

2. **Ask for the image**: "Which Nethermind Docker image? (e.g., `nethermindeth/nethermind:bal-devnet-6`) Or should I build one from a branch?"

3. **Ask for the network**: "Which network? (`perf-devnet-3`, `jochemnet`, `mainnet`)"

4. **Ask for filter**: "Any test filter? (e.g., `bloated`, or leave empty for all tests)"

5. **Ask about dotTrace**: "Do you want dotTrace profiling? (requires building a diag image, adds ~2min to build)"

Then proceed with the resolved values.

When called WITH arguments, parse them and proceed directly — only ask if something essential is missing or ambiguous.

## Argument parsing

Parse `$ARGUMENTS` for these flags:

| Flag | Default | Description |
|------|---------|-------------|
| `--branch` | current git branch | Nethermind branch to build the Docker image from |
| `--image` | (built from branch) | Skip Docker build; use this pre-built image directly |
| `--filter` | (none) | Test filter pattern passed to repricing workflow |
| `--network` | `perf-devnet-3` | Network name (perf-devnet-3, jochemnet, mainnet) |
| `--fork` | `amsterdam` | Fork name (amsterdam, osaka) |
| `--dottrace` | (ask user) | Enable dotTrace profiling — builds diag image, passes diagnostics flags |
| `--release` | (discovered) | Override release tag — skips interactive selection |
| `--gas-benchmarks-ref` | (discovered) | Override gas-benchmarks branch — skips discovery |

## Phase 0 — Discover gas-benchmarks branch and release

### Step 0a: Resolve the release

If `--release` was provided, use it. Otherwise (in non-interactive mode), discover the latest release for the fork:
```
gh api repos/NethermindEth/gas-benchmarks/releases?per_page=15 \
--jq '[.[] | select(.tag_name | startswith("<fork>"))] | first | .tag_name'
```
Verify the release has data for the requested network:
```
gh release view <tag> --repo NethermindEth/gas-benchmarks --json assets --jq '.assets[].name'
```
Look for `generated-tests-stateful-<network>.tar.gz`.

### Step 0b: Find the gas-benchmarks branch

If `--gas-benchmarks-ref` was provided, use it. Otherwise, **extract from the release notes**:

1. The release body contains a `**Branch:**` field that records which gas-benchmarks branch generated the test data. Parse it:
```
gh release view <tag> --repo NethermindEth/gas-benchmarks --json body --jq '.body' \
| grep -oP '(?<=\*\*Branch:\*\* ).*' | tr -d '`' | xargs
```
On Windows/Git Bash where `grep -P` may not work:
```
gh release view <tag> --repo NethermindEth/gas-benchmarks --json body --jq '.body' \
| grep "Branch:" | sed 's/.*Branch:\*\* *//; s/`//g' | xargs
```

2. If the release notes don't contain a branch field, fall back to listing branches:
```
gh api repos/NethermindEth/gas-benchmarks/branches?per_page=100 \
--jq '.[].name' | grep -E "devnets/bal|stateful-generator"
```
Ask the user which branch to use.

3. Verify the workflow exists on the chosen branch:
```
gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> --jq '.name' 2>/dev/null
```

### Step 0c: Discover workflow inputs

Read the workflow YAML on the chosen branch to learn which inputs it supports:
```
gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \
--jq '.content' | base64 -d | head -60
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — head -60 may truncate workflow input discovery

The head -60 cap is dangerous: GitHub Actions workflow files typically define their on.workflow_dispatch.inputs block after frontmatter, comments, and the on: key — often past line 60 for moderately complex workflows. If the inputs block is truncated, the skill discovers zero inputs and Phase 2 silently omits required flags (like release_tag, genesis_file, or diagnostics_mode), causing the triggered run to fail with confusing errors.

Fix: remove the head limit entirely, or use a larger cap like head -120. The file is a small YAML blob fetched via API — there's no cost to reading it fully.

Suggested change
--jq '.content' | base64 -d | head -60
gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \
--jq '.content' | base64 -d

```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — base64 -d is not portable to macOS

base64 -d is the Linux/GNU form; macOS uses base64 -D. For cross-platform support, use:

base64 -d 2>/dev/null || base64 -D

Or note in the skill that this step assumes a Linux runner (which may be fine given the CI context).

Note which of these inputs exist: `release_tag`, `genesis_file`, `runner`, `diagnostics_mode`, `diagnostics_xml`. Only pass flags the workflow declares.

### Step 0d: Determine genesis file

Map network to genesis filename:
- `perf-devnet-3` → `generator-amsterdam-perf-devnet-3.json`
- `jochemnet` → `generator-amsterdam-jochemnet.json`
- `mainnet` → (no genesis_file flag)

### Step 0e: Confirm with user

Before proceeding, show the resolved configuration:
```
Release: <tag>
Gas-benchmarks: <branch>
Network: <network>
Image: <image or "will build from <branch>">
Filter: <filter or "none (all tests)">
dotTrace: <yes/no>
```
Ask: "Proceed?"

## Phase 1 — Docker image

Skip if `--image` is provided.

1. Determine the Nethermind branch (from `--branch` or `git branch --show-current`).
2. Determine Dockerfile based on dotTrace:
- dotTrace enabled → `Dockerfile.diag`, tag suffix `-diag`
- dotTrace disabled → regular `Dockerfile`, no suffix
3. Compute tag: `<branch-name>-diag` (if diag) or `<branch-name>` (if regular).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — branch names containing / produce invalid Docker image tags

Docker image tags cannot contain /. A branch like feature/my-branch or devnets/bal-devnet-6 would generate a tag like feature/my-branch, which Docker rejects with invalid reference format.

The skill should sanitize the branch name when computing the tag:

TAG=$(echo "<branch-name>" | tr '/' '-')
# e.g., feature/my-branch → feature-my-branch

Add this normalization step before constructing the tag (Step 3), and surface the final tag to the user in the Phase 0e confirmation so they can verify it before the build starts.

4. Capture timestamp, then trigger the docker build:
```
BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
MSYS_NO_PATHCONV=1 gh workflow run publish-docker.yml \
--ref <branch> \
-f image-name=nethermind \
-f tag=<tag> \
-f dockerfile=<dockerfile> \
-f build-config=release
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — run ID for Docker build not captured explicitly

After gh workflow run publish-docker.yml ... fires, Step 5 says "Poll until complete: gh run view <run-id>" but no step shows how <run-id> is obtained. The same race condition present in Phase 3 applies here. Consider adding an explicit step (with the same --created >$BEFORE approach) immediately after the trigger.

5. Wait ~10s, then find the run ID using the timestamp to avoid race conditions:
```
gh run list --workflow=publish-docker.yml --limit 5 --json databaseId,createdAt \
--jq '[.[] | select(.createdAt > "<BEFORE>")] | first | .databaseId'
```
6. Poll until complete: `gh run view <run-id> --json status,conclusion`
7. If build fails, fetch logs and report the error. Stop.
8. Final image: `nethermindeth/nethermind:<tag>`

## Phase 2 — Trigger repricing workflow

Capture timestamp before triggering: `BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ)`

Build the workflow trigger using only the inputs the workflow accepts (from Step 0c):
```
MSYS_NO_PATHCONV=1 gh workflow run repricing-nethermind.yml \
--repo NethermindEth/gas-benchmarks \
--ref <gas-benchmarks-ref> \
-f test="repricings_stateful/<network>" \
-f fork="<fork>" \
-f release_tag="<release>" \
-f genesis_file="<genesis-file>" \
-f filter="<filter>" \
-f 'runner=["stateful-generator"]' \
-f 'images={"nethermind":"<image>"}'
```

Only add diagnostics flags when `--dottrace` is set AND image is a diag build:
```
-f diagnostics_mode="dottrace" \
-f diagnostics_xml="true"
```

**Critical:** Do NOT pass `diagnostics_mode=dottrace` if the image was not built with `Dockerfile.diag` — the container will crash with `exec: dottrace: not found`.

Report the run URL to the user immediately after triggering.

## Phase 3 — Wait for completion

1. Find the run ID using the timestamp captured before triggering (same approach as Phase 1 step 5):
```
gh run list --repo NethermindEth/gas-benchmarks --workflow=repricing-nethermind.yml \
--limit 5 --json databaseId,createdAt \
--jq '[.[] | select(.createdAt > "<BEFORE>")] | first | .databaseId'
```
2. Poll: `gh run view <run-id> --repo NethermindEth/gas-benchmarks --json status,conclusion` every 30 seconds.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — race condition: wrong run may be analyzed

gh run list --limit 1 picks up the most-recently-created run on that workflow, but there is no synchronization point between gh workflow run (Phase 2) and this listing. If another engineer (or a scheduled job) triggers repricing-nethermind.yml in the same window, this step will poll the wrong run and produce a completely incorrect analysis.

A safer pattern is to capture the creation timestamp just before triggering, then filter:

BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
gh workflow run repricing-nethermind.yml ...
# wait a few seconds for GitHub to register the run
sleep 10
gh run list --repo NethermindEth/gas-benchmarks \
  --workflow=repricing-nethermind.yml \
  --created ">$BEFORE" \
  --limit 1 \
  --json databaseId --jq '.[0].databaseId'

Or at minimum add --actor @me to narrow to runs triggered by the current token.

3. **Timeout after 2 hours** (240 polls). If exceeded, report "timed out" and provide the run URL for manual inspection. Stop.
4. Report to the user when the run completes with success or failure.

## Phase 4 — Analyze results
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — no upper bound on the polling loop

Phase 3 polls every 30 seconds with no documented maximum wait time. A workflow that gets stuck in a GitHub Actions queue or hangs mid-run will cause the skill to loop indefinitely, consuming the entire conversation context window.

Add an explicit timeout (e.g., 2 hours = 240 iterations at 30 s each) and break out with a clear message if exceeded:

Max wait: 2 hours. After 240 polls with no terminal status, report "timed out" and provide the run URL for manual inspection.


**THIS PHASE IS MANDATORY. Always run it in full, even if the workflow reported success. Never skip or abbreviate it. A "success" workflow conclusion does NOT mean the blocks processed correctly — Nethermind exceptions can occur mid-run without failing the workflow.**

### 4a. Exception scan (NEVER SKIP)

Fetch job logs: `gh run view --job=<job-id> --repo NethermindEth/gas-benchmarks --log`

Strip ANSI escape codes: `sed 's/\x1b\[[0-9;]*m//g'`

Scan for ALL of these patterns. Report every match with the full log line:
```
grep -iE "Exception|Invalid Block|InvalidBlock|Rejected invalid" | grep -v "node-exporter\|pip install\|apt-get\|npm warn\|orphan process\|docker-compose\|nuget\.org"
```
Note: do NOT exclude `dotnet` — real Nethermind exceptions contain .NET runtime frames.

**Any match means the run has issues.** Classify:
- `HeaderGasUsedMismatch` → gas schedule mismatch between image and test data (wrong branch/fork)
- `InvalidBlockLevelAccessListHash` → BAL pre-state corruption (code bug)
- `InvalidBlockLevelAccessListException` → address/slot not in BAL (missing BAL entries)
- `Rejected invalid block ... reason: block is a part of an invalid chain` → cascade from earlier failure
- Any other `Exception` → report verbatim

**Always report the exception summary in the final report, even when there are zero exceptions.** Write "Exceptions: none" explicitly.

**Confirm shutdown:** grep for `Nethermind is shut down` — if absent, the node crashed or was killed.

### 4b. Timing analysis

Extract all `Processed` lines. For each test block (blocks with `Gas gwei` in the log line):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — vague log extraction instructions for timing and block stats

Phases 4b and 4c say "extract all Processed lines" and "extract Block stats lines" but give no grep pattern. An agent interpreting these against real Nethermind logs may get confused or extract the wrong lines, especially across log versions.

Add concrete patterns, for example:

# 4b: timing lines
grep "Processed" logs.txt | grep "Gas gwei"

# 4c: block stats
grep -E "\bsload\b|\bsstore\b|\bcreate\b" logs.txt

Even approximate patterns are better than none — the agent can adapt if the log format shifts, but it needs a starting point.

- Report block number, processing time (ms), slot time (ms)
- Identify which test scenario each block belongs to (match against preceding `[TESTING]` log lines)

Sort by processing time descending. Report top 10 heaviest blocks.

### 4c. Block stats

Extract `Block` stats lines (with `sload`, `sstore`, `create` counts).
Report sload/sstore/create counts for the heaviest test blocks.

### 4d. Opcode tracing comparison

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — no download command for opcode tracing JSON

Phase 4d says to "download opcodes_tracing-stateful-<network>.json from the relevant release(s)" but doesn't give a concrete command. An agent seeing this will have to guess — likely reaching for gh release download, which requires the asset name exactly. Provide the command:

gh release download <tag> --repo NethermindEth/gas-benchmarks \
  --pattern "opcodes_tracing-stateful-<network>.json" \
  --output opcodes_tracing.json

Also clarify when this step is triggered: is it always run, or only when explicitly comparing two runs?

When comparing runs, download `opcodes_tracing-stateful-<network>.json` from the relevant release(s) and compare opcode counts for the specific test to confirm the workload is identical.

### 4e. dotTrace analysis

If dotTrace XML artifacts exist:
1. Download: `gh run download <run-id> --repo NethermindEth/gas-benchmarks -n "repricing-nethermind-dottrace-xml-<run-id>"`
2. Analyze: `bash scripts/dottrace-report.sh top <report.xml> 20`
3. Compare: `bash scripts/dottrace-report.sh compare <baseline.xml> <new.xml> 20`
4. **Never load full XML into context** — files are 50-70MB.

## Phase 5 — Report

```
| Metric | Value |
|--------|-------|
| Branch | ... |
| Image | ... |
| Gas-benchmarks ref | ... |
| Release | ... |
| Run URL | ... |
| Status | success/failure |
| Test block | #N |
| Processing time | X ms |
| sstore count | N |
| sload count | N |
| Exceptions | none / list |
```

If comparing against a baseline, include both timings and the speedup ratio.

## Common filter formats

To discover the exact filter format, check a previous run's logs:
```
gh run view <run-id> --repo NethermindEth/gas-benchmarks --log 2>/dev/null | grep "EFFECTIVE_FILTER"
```
Or check the test fixtures in the release archive. Examples:
- `test_sstore_bloated[10GB-fork_Amsterdam-benchmark_test-cache_strategy_CacheStrategy.NO_CACHE-existing_slots_True-write_new_value_False-benchmark_300M`
- `bloated` (matches all bloated tests)
- (empty = run all tests)
1 change: 1 addition & 0 deletions .claude/skills/gas-benchmark
Loading