-
Notifications
You must be signed in to change notification settings - Fork 689
feat: add gas-benchmark skill for automated repricing benchmarks #11526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 5 commits
cb8969d
359429e
6cee76c
b6fed01
a17ce11
ce0ad9c
35bb336
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,290 @@ | ||
| --- | ||
| name: gas-benchmark | ||
| description: Build a diag Docker image, run gas-benchmarks repricing workflow, and analyze results including dotTrace XML reports. Use when asked to "run benchmarks", "trigger gas benchmarks", "benchmark this branch", or "profile block processing". | ||
| allowed-tools: | ||
| - Bash(gh run *) | ||
| - Bash(gh workflow run *) | ||
| - Bash(gh release *) | ||
| - Bash(gh api repos/NethermindEth/*) | ||
| - Bash(git branch *) | ||
| - Bash(git log *) | ||
| - Bash(git status *) | ||
| - Bash(cd *) | ||
| - Bash(ls *) | ||
| - Bash(mkdir *) | ||
| - Bash(find *) | ||
| - Bash(unzip *) | ||
| - Bash(cat *) | ||
| - Bash(wc *) | ||
| - Bash(sleep *) | ||
| - Bash(until *) | ||
| - Bash(date *) | ||
| - Read | ||
| - Grep | ||
| - Glob | ||
| argument-hint: "[--branch NAME] [--image NAME] [--filter PATTERN] [--network NETWORK] [--fork FORK] [--dottrace]" | ||
| --- | ||
|
|
||
| # Gas Benchmark Pipeline | ||
|
|
||
| End-to-end pipeline: build diag Docker image, trigger gas-benchmarks repricing workflow, wait for completion, and analyze results (logs, timings, dotTrace XML). | ||
|
|
||
| ## Interactive mode (no arguments) | ||
|
|
||
| When called without arguments (`/gas-benchmark`), do NOT proceed with defaults. Instead, interactively gather the required information: | ||
|
|
||
| 1. **Show available releases** and ask the user to pick one: | ||
| ``` | ||
| gh api repos/NethermindEth/gas-benchmarks/releases?per_page=15 \ | ||
| --jq '.[] | "- `" + .tag_name + "` " + (if .draft then "(draft)" else "" end) + " — " + .name' | ||
| ``` | ||
| Ask: "Which release should I use for test data?" | ||
|
|
||
| 2. **Ask for the image**: "Which Nethermind Docker image? (e.g., `nethermindeth/nethermind:bal-devnet-6`) Or should I build one from a branch?" | ||
|
|
||
| 3. **Ask for the network**: "Which network? (`perf-devnet-3`, `jochemnet`, `mainnet`)" | ||
|
|
||
| 4. **Ask for filter**: "Any test filter? (e.g., `bloated`, or leave empty for all tests)" | ||
|
|
||
| 5. **Ask about dotTrace**: "Do you want dotTrace profiling? (requires building a diag image, adds ~2min to build)" | ||
|
|
||
| Then proceed with the resolved values. | ||
|
|
||
| When called WITH arguments, parse them and proceed directly — only ask if something essential is missing or ambiguous. | ||
|
|
||
| ## Argument parsing | ||
|
|
||
| Parse `$ARGUMENTS` for these flags: | ||
|
|
||
| | Flag | Default | Description | | ||
| |------|---------|-------------| | ||
| | `--branch` | current git branch | Nethermind branch to build the Docker image from | | ||
| | `--image` | (built from branch) | Skip Docker build; use this pre-built image directly | | ||
| | `--filter` | (none) | Test filter pattern passed to repricing workflow | | ||
| | `--network` | `perf-devnet-3` | Network name (perf-devnet-3, jochemnet, mainnet) | | ||
| | `--fork` | `amsterdam` | Fork name (amsterdam, osaka) | | ||
| | `--dottrace` | (ask user) | Enable dotTrace profiling — builds diag image, passes diagnostics flags | | ||
| | `--release` | (discovered) | Override release tag — skips interactive selection | | ||
| | `--gas-benchmarks-ref` | (discovered) | Override gas-benchmarks branch — skips discovery | | ||
|
|
||
| ## Phase 0 — Discover gas-benchmarks branch and release | ||
|
|
||
| ### Step 0a: Resolve the release | ||
|
|
||
| If `--release` was provided, use it. Otherwise (in non-interactive mode), discover the latest release for the fork: | ||
| ``` | ||
| gh api repos/NethermindEth/gas-benchmarks/releases?per_page=15 \ | ||
| --jq '[.[] | select(.tag_name | startswith("<fork>"))] | first | .tag_name' | ||
| ``` | ||
| Verify the release has data for the requested network: | ||
| ``` | ||
| gh release view <tag> --repo NethermindEth/gas-benchmarks --json assets --jq '.assets[].name' | ||
| ``` | ||
| Look for `generated-tests-stateful-<network>.tar.gz`. | ||
|
|
||
| ### Step 0b: Find the gas-benchmarks branch | ||
|
|
||
| If `--gas-benchmarks-ref` was provided, use it. Otherwise, **extract from the release notes**: | ||
|
|
||
| 1. The release body contains a `**Branch:**` field that records which gas-benchmarks branch generated the test data. Parse it: | ||
| ``` | ||
| gh release view <tag> --repo NethermindEth/gas-benchmarks --json body --jq '.body' \ | ||
| | grep -oP '(?<=\*\*Branch:\*\* ).*' | tr -d '`' | xargs | ||
| ``` | ||
| On Windows/Git Bash where `grep -P` may not work: | ||
| ``` | ||
| gh release view <tag> --repo NethermindEth/gas-benchmarks --json body --jq '.body' \ | ||
| | grep "Branch:" | sed 's/.*Branch:\*\* *//; s/`//g' | xargs | ||
| ``` | ||
|
|
||
| 2. If the release notes don't contain a branch field, fall back to listing branches: | ||
| ``` | ||
| gh api repos/NethermindEth/gas-benchmarks/branches?per_page=100 \ | ||
| --jq '.[].name' | grep -E "devnets/bal|stateful-generator" | ||
| ``` | ||
| Ask the user which branch to use. | ||
|
|
||
| 3. Verify the workflow exists on the chosen branch: | ||
| ``` | ||
| gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> --jq '.name' 2>/dev/null | ||
| ``` | ||
|
|
||
| ### Step 0c: Discover workflow inputs | ||
|
|
||
| Read the workflow YAML on the chosen branch to learn which inputs it supports: | ||
| ``` | ||
| gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \ | ||
| --jq '.content' | base64 -d | head -60 | ||
| ``` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Low —
base64 -d 2>/dev/null || base64 -DOr note in the skill that this step assumes a Linux runner (which may be fine given the CI context). |
||
| Note which of these inputs exist: `release_tag`, `genesis_file`, `runner`, `diagnostics_mode`, `diagnostics_xml`. Only pass flags the workflow declares. | ||
|
|
||
| ### Step 0d: Determine genesis file | ||
|
|
||
| Map network to genesis filename: | ||
| - `perf-devnet-3` → `generator-amsterdam-perf-devnet-3.json` | ||
| - `jochemnet` → `generator-amsterdam-jochemnet.json` | ||
| - `mainnet` → (no genesis_file flag) | ||
|
|
||
| ### Step 0e: Confirm with user | ||
|
|
||
| Before proceeding, show the resolved configuration: | ||
| ``` | ||
| Release: <tag> | ||
| Gas-benchmarks: <branch> | ||
| Network: <network> | ||
| Image: <image or "will build from <branch>"> | ||
| Filter: <filter or "none (all tests)"> | ||
| dotTrace: <yes/no> | ||
| ``` | ||
| Ask: "Proceed?" | ||
|
|
||
| ## Phase 1 — Docker image | ||
|
|
||
| Skip if `--image` is provided. | ||
|
|
||
| 1. Determine the Nethermind branch (from `--branch` or `git branch --show-current`). | ||
| 2. Determine Dockerfile based on dotTrace: | ||
| - dotTrace enabled → `Dockerfile.diag`, tag suffix `-diag` | ||
| - dotTrace disabled → regular `Dockerfile`, no suffix | ||
| 3. Compute tag: `<branch-name>-diag` (if diag) or `<branch-name>` (if regular). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Medium — branch names containing Docker image tags cannot contain The skill should sanitize the branch name when computing the tag: TAG=$(echo "<branch-name>" | tr '/' '-')
# e.g., feature/my-branch → feature-my-branchAdd this normalization step before constructing the tag (Step 3), and surface the final tag to the user in the Phase 0e confirmation so they can verify it before the build starts. |
||
| 4. Capture timestamp, then trigger the docker build: | ||
| ``` | ||
| BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ) | ||
| MSYS_NO_PATHCONV=1 gh workflow run publish-docker.yml \ | ||
| --ref <branch> \ | ||
| -f image-name=nethermind \ | ||
| -f tag=<tag> \ | ||
| -f dockerfile=<dockerfile> \ | ||
| -f build-config=release | ||
| ``` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Low — run ID for Docker build not captured explicitly After |
||
| 5. Wait ~10s, then find the run ID using the timestamp to avoid race conditions: | ||
| ``` | ||
| gh run list --workflow=publish-docker.yml --limit 5 --json databaseId,createdAt \ | ||
| --jq '[.[] | select(.createdAt > "<BEFORE>")] | first | .databaseId' | ||
| ``` | ||
| 6. Poll until complete: `gh run view <run-id> --json status,conclusion` | ||
| 7. If build fails, fetch logs and report the error. Stop. | ||
| 8. Final image: `nethermindeth/nethermind:<tag>` | ||
|
|
||
| ## Phase 2 — Trigger repricing workflow | ||
|
|
||
| Capture timestamp before triggering: `BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ)` | ||
|
|
||
| Build the workflow trigger using only the inputs the workflow accepts (from Step 0c): | ||
| ``` | ||
| MSYS_NO_PATHCONV=1 gh workflow run repricing-nethermind.yml \ | ||
| --repo NethermindEth/gas-benchmarks \ | ||
| --ref <gas-benchmarks-ref> \ | ||
| -f test="repricings_stateful/<network>" \ | ||
| -f fork="<fork>" \ | ||
| -f release_tag="<release>" \ | ||
| -f genesis_file="<genesis-file>" \ | ||
| -f filter="<filter>" \ | ||
| -f 'runner=["stateful-generator"]' \ | ||
| -f 'images={"nethermind":"<image>"}' | ||
| ``` | ||
|
|
||
| Only add diagnostics flags when `--dottrace` is set AND image is a diag build: | ||
| ``` | ||
| -f diagnostics_mode="dottrace" \ | ||
| -f diagnostics_xml="true" | ||
| ``` | ||
|
|
||
| **Critical:** Do NOT pass `diagnostics_mode=dottrace` if the image was not built with `Dockerfile.diag` — the container will crash with `exec: dottrace: not found`. | ||
|
|
||
| Report the run URL to the user immediately after triggering. | ||
|
|
||
| ## Phase 3 — Wait for completion | ||
|
|
||
| 1. Find the run ID using the timestamp captured before triggering (same approach as Phase 1 step 5): | ||
| ``` | ||
| gh run list --repo NethermindEth/gas-benchmarks --workflow=repricing-nethermind.yml \ | ||
| --limit 5 --json databaseId,createdAt \ | ||
| --jq '[.[] | select(.createdAt > "<BEFORE>")] | first | .databaseId' | ||
| ``` | ||
| 2. Poll: `gh run view <run-id> --repo NethermindEth/gas-benchmarks --json status,conclusion` every 30 seconds. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. High — race condition: wrong run may be analyzed
A safer pattern is to capture the creation timestamp just before triggering, then filter: BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
gh workflow run repricing-nethermind.yml ...
# wait a few seconds for GitHub to register the run
sleep 10
gh run list --repo NethermindEth/gas-benchmarks \
--workflow=repricing-nethermind.yml \
--created ">$BEFORE" \
--limit 1 \
--json databaseId --jq '.[0].databaseId'Or at minimum add |
||
| 3. **Timeout after 2 hours** (240 polls). If exceeded, report "timed out" and provide the run URL for manual inspection. Stop. | ||
| 4. Report to the user when the run completes with success or failure. | ||
|
|
||
| ## Phase 4 — Analyze results | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Medium — no upper bound on the polling loop Phase 3 polls every 30 seconds with no documented maximum wait time. A workflow that gets stuck in a GitHub Actions queue or hangs mid-run will cause the skill to loop indefinitely, consuming the entire conversation context window. Add an explicit timeout (e.g., 2 hours = 240 iterations at 30 s each) and break out with a clear message if exceeded: |
||
|
|
||
| **THIS PHASE IS MANDATORY. Always run it in full, even if the workflow reported success. Never skip or abbreviate it. A "success" workflow conclusion does NOT mean the blocks processed correctly — Nethermind exceptions can occur mid-run without failing the workflow.** | ||
|
|
||
| ### 4a. Exception scan (NEVER SKIP) | ||
|
|
||
| Fetch job logs: `gh run view --job=<job-id> --repo NethermindEth/gas-benchmarks --log` | ||
|
|
||
| Strip ANSI escape codes: `sed 's/\x1b\[[0-9;]*m//g'` | ||
|
|
||
| Scan for ALL of these patterns. Report every match with the full log line: | ||
| ``` | ||
| grep -iE "Exception|Invalid Block|InvalidBlock|Rejected invalid" | grep -v "node-exporter\|pip install\|apt-get\|npm warn\|orphan process\|docker-compose\|nuget\.org" | ||
| ``` | ||
| Note: do NOT exclude `dotnet` — real Nethermind exceptions contain .NET runtime frames. | ||
|
|
||
| **Any match means the run has issues.** Classify: | ||
| - `HeaderGasUsedMismatch` → gas schedule mismatch between image and test data (wrong branch/fork) | ||
| - `InvalidBlockLevelAccessListHash` → BAL pre-state corruption (code bug) | ||
| - `InvalidBlockLevelAccessListException` → address/slot not in BAL (missing BAL entries) | ||
| - `Rejected invalid block ... reason: block is a part of an invalid chain` → cascade from earlier failure | ||
| - Any other `Exception` → report verbatim | ||
|
|
||
| **Always report the exception summary in the final report, even when there are zero exceptions.** Write "Exceptions: none" explicitly. | ||
|
|
||
| **Confirm shutdown:** grep for `Nethermind is shut down` — if absent, the node crashed or was killed. | ||
|
|
||
| ### 4b. Timing analysis | ||
|
|
||
| Extract all `Processed` lines. For each test block (blocks with `Gas gwei` in the log line): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Low — vague log extraction instructions for timing and block stats Phases 4b and 4c say "extract all Add concrete patterns, for example: # 4b: timing lines
grep "Processed" logs.txt | grep "Gas gwei"
# 4c: block stats
grep -E "\bsload\b|\bsstore\b|\bcreate\b" logs.txtEven approximate patterns are better than none — the agent can adapt if the log format shifts, but it needs a starting point. |
||
| - Report block number, processing time (ms), slot time (ms) | ||
| - Identify which test scenario each block belongs to (match against preceding `[TESTING]` log lines) | ||
|
|
||
| Sort by processing time descending. Report top 10 heaviest blocks. | ||
|
|
||
| ### 4c. Block stats | ||
|
|
||
| Extract `Block` stats lines (with `sload`, `sstore`, `create` counts). | ||
| Report sload/sstore/create counts for the heaviest test blocks. | ||
|
|
||
| ### 4d. Opcode tracing comparison | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Low — no download command for opcode tracing JSON Phase 4d says to "download gh release download <tag> --repo NethermindEth/gas-benchmarks \
--pattern "opcodes_tracing-stateful-<network>.json" \
--output opcodes_tracing.jsonAlso clarify when this step is triggered: is it always run, or only when explicitly comparing two runs? |
||
| When comparing runs, download `opcodes_tracing-stateful-<network>.json` from the relevant release(s) and compare opcode counts for the specific test to confirm the workload is identical. | ||
|
|
||
| ### 4e. dotTrace analysis | ||
|
|
||
| If dotTrace XML artifacts exist: | ||
| 1. Download: `gh run download <run-id> --repo NethermindEth/gas-benchmarks -n "repricing-nethermind-dottrace-xml-<run-id>"` | ||
| 2. Analyze: `bash scripts/dottrace-report.sh top <report.xml> 20` | ||
| 3. Compare: `bash scripts/dottrace-report.sh compare <baseline.xml> <new.xml> 20` | ||
| 4. **Never load full XML into context** — files are 50-70MB. | ||
|
|
||
| ## Phase 5 — Report | ||
|
|
||
| ``` | ||
| | Metric | Value | | ||
| |--------|-------| | ||
| | Branch | ... | | ||
| | Image | ... | | ||
| | Gas-benchmarks ref | ... | | ||
| | Release | ... | | ||
| | Run URL | ... | | ||
| | Status | success/failure | | ||
| | Test block | #N | | ||
| | Processing time | X ms | | ||
| | sstore count | N | | ||
| | sload count | N | | ||
| | Exceptions | none / list | | ||
| ``` | ||
|
|
||
| If comparing against a baseline, include both timings and the speedup ratio. | ||
|
|
||
| ## Common filter formats | ||
|
|
||
| To discover the exact filter format, check a previous run's logs: | ||
| ``` | ||
| gh run view <run-id> --repo NethermindEth/gas-benchmarks --log 2>/dev/null | grep "EFFECTIVE_FILTER" | ||
| ``` | ||
| Or check the test fixtures in the release archive. Examples: | ||
| - `test_sstore_bloated[10GB-fork_Amsterdam-benchmark_test-cache_strategy_CacheStrategy.NO_CACHE-existing_slots_True-write_new_value_False-benchmark_300M` | ||
| - `bloated` (matches all bloated tests) | ||
| - (empty = run all tests) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| ../../.agents/skills/gas-benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Medium —
head -60may truncate workflow input discoveryThe
head -60cap is dangerous: GitHub Actions workflow files typically define theiron.workflow_dispatch.inputsblock after frontmatter, comments, and theon:key — often past line 60 for moderately complex workflows. If the inputs block is truncated, the skill discovers zero inputs and Phase 2 silently omits required flags (likerelease_tag,genesis_file, ordiagnostics_mode), causing the triggered run to fail with confusing errors.Fix: remove the
headlimit entirely, or use a larger cap likehead -120. The file is a small YAML blob fetched via API — there's no cost to reading it fully.