Skip to content

perf(gnovm): parallelize test suites and remove byte-access allocation hotspots#5800

Closed
thehowl wants to merge 10 commits into
gnolang:masterfrom
thehowl:dev/morgan/gnovm-ci-speedup
Closed

perf(gnovm): parallelize test suites and remove byte-access allocation hotspots#5800
thehowl wants to merge 10 commits into
gnolang:masterfrom
thehowl:dev/morgan/gnovm-ci-speedup

Conversation

@thehowl

@thehowl thehowl commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary

ci / gnovm takes ~14 minutes wall time, gating every PR that touches gnovm/**, tm2/** or examples/**. This PR attacks it from two sides: it restructures the gnovm test suites for parallelism (no test content changes) with a new gno test -jobs flag, and — since the main / test job turned out to be bound by total CPU work, not scheduling — it profiles the actual workload and eliminates the dominant source of redundant work in the VM itself: heap-boxed byte-element access (a benefit on-chain as well, not just in CI).

CI results from this PR's own runs (3 samples): stdlibs / test 8m33s → 5m21s / 5m30s / 5m35s, examples gno-checks / test 3m44s → 3m11s / 3m09s, and main / test unchanged in medianpkg/gnolang 685.7s / 710.6s / 899.6s against a master baseline stable at 707.9–721.5s (4 runs). The main / test outcome is itself the key research finding: on a 4-vCPU runner that job is bound by total CPU work (~2700 instrumented-interpreter CPU-seconds), not by scheduling — go test's package-level concurrency was already packing the cores. Its minutes can only come from doing less work; the measured levers are in the follow-ups. On many-core dev machines the restructure does cut wall time (pkg/gnolang 283.7s → 245.0s on 16 cores) since there the bottleneck was serialization, not CPU.

Where the time goes (research)

Job breakdown of a representative master run (27282103940):

job wall time
main / test 13m43s
stdlibs / test 8m33s
main / lint 1m49s
everything else ≤1m10s

main / test (go test -covermode=set -coverpkg=gnovm/... ./...): pkg/gnolang alone is 707.9s; second place is cmd/gno at 88s. Within pkg/gnolang virtually all time is interpreted Gno execution in two test functions:

  • TestStdlibs — the heavy suites (bytes 172s, strconv 105s, regexp 61s, bufio 54s, sort 49s; local long-mode numbers) were already parallel with their own stores, but every other stdlib ran sequentially on a shared store in the parent test body — and Go only starts parallel subtests after the parent body returns. The sequential walk (including math/overflow at 63s) both ran single-threaded and delayed the start of the parallel heavy hitters.
  • TestFiles — 2344 filetests, of which 2339 ran sequentially on one goroutine; only the 5 *_long.gno files were parallel.

stdlibs / test: gno test ./... over gnovm/stdlibs is fully sequential (single-threaded VM, one package at a time): bytes 174s, strconv 86s, math/overflow 60s, regexp 35s, bufio 31s, … ≈455s of test time on one core of a 4-vCPU runner. Per gnovm/Makefile, this is not redundant with TestStdlibs (it also runs xxx_test integration packages), so it needs to get faster, not deleted.

Changes

  • TestFiles: non-long filetests now run as parallel subtests, drawing a TestOptions (store) from a GOMAXPROCS-sized pool, so stores — and the stdlib packages already loaded into them — are reused across tests, just split N ways. -update-golden-tests keeps the previous fully-sequential single-store behavior (deterministic walk-order writes).
  • TestStdlibs: every stdlib package runs as a parallel subtest with its own store, removing the sequential shared-store walk and the hardcoded special-case package lists. The small tests/stdlibs walk stays sequential on one store.
  • gotypecheck.go: gnoBuiltinsCache was lazily populated without synchronization — a latent data race that becomes load-bearing now that type-checks run concurrently from the start. The cache is now built eagerly at package init and is read-only afterwards (no lock needed).
  • gno test -jobs N (default 1 = behavior unchanged): tests up to N packages in parallel, each worker owning a store/typecheck-cache reused across the packages it runs. Per-package output is buffered and printed in package order as results complete (out/err kept on their original streams), so logs stay readable and deterministic. -failfast stops scheduling new packages after the first failure; in-flight packages finish. Incompatible with the interactive -debug mode.
  • _ci-gno.yml: pass -jobs 4 (runners have 4 vCPUs) to the gno test step — used by the gnovm stdlibs job and the examples job.

Measurements

All runs in long mode (no -short); pass/fail sets verified identical before/after at every step (2966 pass / 11 fail — the 11 are pre-existing golden typecheck mismatches against go1.26 locally; CI uses the go.mod toolchain and is green). 4-core runs are taskset-pinned to mirror the ubuntu-latest runners.

benchmark before after
CI: stdlibs / test job 8m33s 5m21s, 5m30s, 5m35s (3 runs)
CI: examples gno-checks / test job 3m44s 3m11s, 3m09s (2 runs)
CI: main / test pkg/gnolang package time 707.9–721.5s (4 master runs) 685.7s, 710.6s, 899.6s (3 runs — see below)
local: go test ./pkg/gnolang/, 16 cores, no coverage 283.7s 245.0s
local: go test -covermode=set -coverpkg=gnovm/... ./pkg/gnolang/, taskset 4 cores 600.0s 546.2s
local: gno test gnovm/stdlibs/..., taskset 4 cores 455s (sequential) 270s (-jobs 4)
local: gno test examples/..., taskset 4 cores 129s (sequential) 129s (-jobs 4, all 220 testable packages pass)

Why main / test doesn't move on CI even though the same binary improves elsewhere: a 4-vCPU runner has to execute ~2700 CPU-seconds of coverage-instrumented interpreted Gno regardless of how it's scheduled (floor ≈ 11 min), and go test -p was already overlapping pkg/gnolang's underused phases with cmd/gno/pkg/parser. The restructure removes the serialization (which is what binds on many-core machines and is why stdlibs / test — single-threaded before — drops by 3m12s), but on 4 cores main / test sits at the CPU floor. Cutting it further requires reducing the work itself — see follow-ups.

A second measured caveat: GnoVM tests are allocation-heavy, so N concurrent VMs in one process slow each other down well beyond simple core-sharing (GC assist / memory pressure). With -jobs 4 on 4 cores, per-package wall time inflates 2–3x (bytes 131s solo → 270s; examples total CPU 129s → 400s). That's why stdlibs nets a large win (its sequential baseline wasted 3 of 4 cores) while the locally-fast examples suite nets none, and why -jobs deliberately defaults to 1. It is also the plausible mechanism behind main / test's one slow sample (899.6s, vs 685.7s/710.6s and a master baseline stable within ~2%): with 4 VM-heavy subtests running concurrently the job is more sensitive to the runner's memory subsystem. Review pushes will accumulate more samples; if the tail shows up again, capping go test -parallel for the gnovm job (keeping the structural fix while limiting concurrent VMs) is a one-line mitigation.

VM profiling: where the CPU actually goes

Since spreading the work can't cut main / test on a 4-vCPU runner, the next step was pprof on the dominant workloads. CPU profile of TestStdlibs/bytes: ~50% of all CPU is Go GC + malloc machinery; the opcode loop itself is ~30%. Allocation profile: 67% of all heap allocations (694M objects in that one suite) came from ArrayValue.GetPointerAtIndexInt2 materializing a *TypedValue + boxed DataByteValue per byte accessed:

  • the copy() builtin allocated two boxes per byte copied (554M objects — the code even had a TODO: consider an optimization if dstv.Data != nil);
  • for i, c := range byteslice allocated three objects per iteration (the index TypedValue escaping through GetPointerAtIndex(&iv), plus the view box) that Deref immediately threw away;
  • b[i] reads in doOpIndex1 did the same box-then-Deref dance.

The fix (last commit) adds TypedValue.GetValueAtIntIndex — a read-only fast path mirroring GetPointerAtIndex's checks and panics for strings and Data-backed arrays/slices — used by doOpIndex1 and the range loop, and gives copy() direct byte copies when both sides are Data-backed (or the source is a string). This allocation pressure is also what made concurrent VMs degrade each other (the GC-contention caveat above), so it compounds with the parallelism work.

Gas is unchanged: the view boxes were raw Go allocations, never charged to the VM allocator, and CPU gas for copy() was already charged before the per-element loop. Verified empirically: all 2344 filetest goldens (including Gas: and MAXALLOC-sensitive alloc tests) pass byte-identical, plus the gno.land/pkg/sdk/vm gas tests and the txtar integration suite.

measurement before after
TestStdlibs/bytes solo 151.5s 105.2s (−31%)
full pkg/gnolang long mode, 16 cores 245.0s 184.6s (−25%)
heap objects allocated (bytes suite) 1.03G 0.30G (−71%)
BenchmarkOpIndex1_ByteArray 185.7ns 130.0ns (−30%)
4-core + coverage CI simulation 546.2s 509.7s
CI: main / test pkg/gnolang 707.9–721.5s (master, 4 runs) 646.2s (1 run so far)

(untouched paths — List arrays, maps, range-over-int-slices — benchmark flat: OpRangeIter_1000 30.9µs → 30.2µs, OpIndex1_MapHit_100 187.0ns → 187.7ns)

Block recycling: removing the next 45%

After the byte-access fixes, NewBlock was 45% of remaining allocations (135M objects in the bytes suite: 75M scope blocks from doOpExec — if/for/range/switch — and 60M call blocks from doOpCall). The key enabler is an invariant the heap-items design already established: closures capture HeapItemValues, not blocks (doOpFuncLit sets Parent: nil), pointers to locals are heap-promoted by the preprocessor, and stacktraces/frames store locations and indices — so a runtime block provably dies when discarded from the machine's block stack.

The last commit adds a small per-machine pool (acquireBlock/releaseBlock) on top of that invariant, routed through every block discard site. Three block populations that also travel the stack are excluded at release: node-owned static blocks (pushed by Eval/RunStatement flows — identified by static-block identity), file/package blocks (referenced by FuncValue.Parent), and defer-site blocks (Defer.Parent is visited by the VM's GC until the defer runs; doOpDefer marks them via a flag tucked into bodyStmt's trailing padding, keeping unsafe.Sizeof(Block{}) — and the _allocBlock gas constant — unchanged). Panic unwinding skips pooling entirely as cheap conservatism.

Gas and VM-GC accounting are unchanged: acquireBlock charges AllocateBlock exactly like Allocator.NewBlock, and pooled (zeroed) blocks are unreachable from GC roots, just like dead blocks today. Same verification battery as before: all 2344 filetest goldens byte-identical (incl. Gas:/Realm:/Storage:/MAXALLOC tests), vm Gas tests, txtar, all 220 example packages, cmd/gno suite.

measurement before pooling after
heap objects, bytes suite 300M (1.03G on master) 165M (−84% vs master)
TestStdlibs/bytes solo 105.2s (151.5s master) 94.9s
full pkg/gnolang long mode, 16 cores 184.6s (283.7s master) 154.0s (−46% vs master)
4-core + coverage CI simulation 509.7s (600.0s master) 424.9s, 436.9s (−28% vs master)
CI: stdlibs / test job 5m21–5m35s (8m33s master) 4m02s, 4m19s (−53% vs master)
CI: main / test pkg/gnolang 646.2s (707.9–721.5s master) 631.1s, 936.3s (2 runs)

The main / test CI numbers stay tail-prone (the 936s sample sits next to a same-run stdlibs / test at its fastest ever and a normal cmd/gno, pointing at a slow/noisy runner amplified by 4-way VM contention — same pattern as the earlier 899.6s sample); the controlled 4-core simulations and the coverage-free stdlibs / test job show the real effect. With ~425s of instrumented work remaining in the simulation, the print-only coverage tax (≈1.24x) is now the largest single lever left on main / test.

Round 3: coverage off, call-arg pooling, linear interface-array copies

Three more changes, attacking the remaining main / test time:

  • Coverage instrumentation is now skipped for gnovm (new coverage input on _ci-go.yml, default true so gno.land/tm2/misc/contribs are untouched). Since chore(ci): remove Codecov from CI workflows #5795 the percentages were print-only; on gnovm they cost a measured ~1.24x on the tests plus an instrumented rebuild of everything.
  • popCopyArgs reuses a per-machine scratch buffer on the call path (30M arg-slice allocations in the bytes suite — the top site after block pooling). doOpDefer's args escape into the Defer and keep allocating.
  • Copying arrays with interface-kind element types now shares the held values instead of deep-copying them (ArrayValue.Copy), making x = [1]any{x} chains linear like Go (which copies the sealed boxes' headers). Soundness: interface-slot contents are sealed — every write into an interface slot copies on entry, interface-held values are not addressable, and type-assertion extraction copies again on assignment. This one is intentionally gas-visible for the affected patterns: exactly one golden changes across the 2344 filetests — gas/nested_alloc.gno (built to measure this exact pattern) drops from 8,559,690,088 to 17,013,825 gas; recurse1.gno runs in 0.01s instead of ~23s with its output golden unchanged, and every realm/alloc/interrealm golden stays byte-identical.

Verification battery repeated in full: filetest suite (canonical pass/fail set), vm Gas tests, txtar, all 220 example packages, cmd/gno.

measurement round 2 round 3
full pkg/gnolang long mode, 16 cores 154.0s 134.7s (283.7s master, −53%)
4-core CI simulation, CI config (with coverage then, without now) 424.9s 273.4s (600.0s master config, −54%)

Round-3 CI results

Two post-push runs of ci / gnovm:

run 1 (failed on an env-var bug in the new no-coverage branch, since fixed) run 2 (green)
main / test pkg/gnolang 264.5s 389.7s
stdlibs / test job 4m20s
whole ci / gnovm workflow 8m39s

pkg/gnolang now sits at 264–390s on CI (708–722s on master, median ≈ −55%), with the familiar runner-quality spread; the local 4-core simulation (273.4s) matches the good-runner samples. cmd/gno also dropped 88–97s → 72s without instrumentation. Whole-workflow wall time lands at ~6–8.5 minutes depending on runner luck, vs a stable ~14 minutes on master.

Round 4: bigint sharing, per-op escape fixes, type re-boxing

Fresh profiles after round 3 pointed at the next allocation tier; four more gas-neutral changes (all goldens byte-identical, full battery repeated):

  • BigintValue/BigdecValue.Copy share the underlying value — they are immutable at runtime (arithmetic writes fresh receivers); 24M allocations, mostly untyped-const operands copied at declaration sites.
  • doOpValueDecl escape fix — its working TypedValue heap-escaped once per declaration via ConvertUntypedTo(&tv); at runtime only untyped bools reach that path, now retyped directly (16.6M).
  • doOpConvert escape fix — same pattern via ConvertTo/IsReadonly; now uses a machine-owned scratch slot whose address is free (11.7M).
  • constTypeExpr evaluation re-boxed the type per eval (every conversion evaluates one); the boxed form is now cached on the node at preprocess time, with a read-only fallback for store-loaded nodes (13.5M).

One negative result worth recording: carrying the block pool through Machine.Release/machinePool measured 10–25% slower on parallel workloads (extra live heap across GC cycles, lost cache locality) with no benefit for machine-churn workloads — Machine.Release now documents why the pools are deliberately dropped.

Same-session A/B (laptop thermals make cross-session walls incomparable; this is back-to-back): full pkg/gnolang long mode 173.9s → 157.1s (−10%), bytes suite solo 94.9s → 83.5s.

Round-4 CI result

master this run
pkg/gnolang package time 708–722s 256.1s (−64%)
cmd/gno package time 88–97s 53.8s
main / test job 13m43s 6m00s
stdlibs / test job 8m33s 3m54s
whole ci / gnovm workflow ~14m 6m18s

(single good-runner sample; the runner-quality band observed across this PR suggests ~6–8m for the workflow)

Follow-ups (measured, but out of scope here)

  • The next allocation tier after this round: GetPointerToFromTV (26M objects), doOpValueDecl (16.5M) and math/big.NewInt (16M) in the bytes suite.
  • StructValue.Copy could get the same interface-field sharing as arrays (per-field static types; it was not the profiled quadratic case, so left out).
  • transcribe (preprocessing) is 29% of the TestFiles short-walk CPU — per-filetest typecheck+preprocess of imports; a Store-level preprocessed-stdlib cache would help every consumer (filetests, gno test, keeper cold paths).
  • Byte writes (b[i] = x) still box (17M objects via PopAsPointer2) — the pointer protocol spans multiple ops, so a write-side fast path needs more design.

TestFiles ran its 2339 non-long filetests sequentially on a single
goroutine; they now run as parallel subtests, drawing a TestOptions
(with its store) from a GOMAXPROCS-sized pool so loaded packages are
still reused across tests. -update-golden-tests keeps the fully
sequential single-store behavior.

TestStdlibs ran most stdlib suites sequentially on a shared store in
the parent test body, which also delayed the parallel heavy suites
(bytes, strconv, ...) until the walk finished, since parallel subtests
only start once the parent body returns. Every stdlib package now runs
as a parallel subtest with its own store.

gnoBuiltinsCache was lazily populated from TypeCheckMemPackage without
synchronization; with type-checks now running concurrently from the
start, that latent race becomes load-bearing. The cache is now built
eagerly at package init and read-only afterwards.
@Gno2D2

Gno2D2 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):
  • IGNORE the bot requirements for this PR (force green CI check)
Read More

🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers.

✅ Automated Checks (for Contributors):

🟢 Maintainers must be able to edit this pull request (more info)

☑️ Contributor Actions:
  1. Fix any issues flagged by automated checks.
  2. Follow the Contributor Checklist to ensure your PR is ready for review.
    • Add new tests, or document why they are unnecessary.
    • Provide clear examples/screenshots, if necessary.
    • Update documentation, if required.
    • Ensure no breaking changes, or include BREAKING CHANGE notes.
    • Link related issues/PRs, where applicable.
☑️ Reviewer Actions:
  1. Complete manual checks for the PR, including the guidelines and additional checks if applicable.
📚 Resources:
Debug
Automated Checks
Maintainers must be able to edit this pull request (more info)

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 The pull request was created from a fork (head branch repo: thehowl/gno)

Then

🟢 Requirement satisfied
└── 🟢 Maintainer can modify this pull request

Manual Checks
**IGNORE** the bot requirements for this PR (force green CI check)

If

🟢 Condition met
└── 🟢 On every pull request

Can be checked by

  • Any user with comment edit permission

gno test runs packages sequentially on a single-threaded VM; on the
gnovm stdlibs CI job this is ~455s of test time on one core of a
4-vCPU runner (bytes 174s, strconv 86s, math/overflow 60s, ...).

-jobs N (default 1: behavior unchanged) tests up to N packages in
parallel. Each worker owns a TestOptions/store, reused across the
packages it runs; per-package output is buffered and printed in
package order as results complete, so runs remain readable and
deterministic. Incompatible with the interactive -debug mode.

The CI gno-test step (gnovm stdlibs and examples jobs) now passes
-jobs 4 to match the runner's 4 vCPUs.
@thehowl thehowl force-pushed the dev/morgan/gnovm-ci-speedup branch from b4235ac to 4cdc7de Compare June 10, 2026 17:51
@thehowl thehowl marked this pull request as draft June 10, 2026 19:06
Profiling the gnovm test suites (dominated by interpreted Gno) showed
~50% of CPU in Go GC/malloc, with 67% of all heap allocations (694M
objects in the bytes stdlib suite alone) coming from
ArrayValue.GetPointerAtIndexInt2 materializing a *TypedValue +
DataByteValue box per byte accessed:

- the copy() builtin allocated two boxes per byte copied;
- range over a byte slice allocated three objects per iteration (the
  index TypedValue escaping through GetPointerAtIndex, plus the view
  box) that Deref immediately discarded;
- b[i] reads in doOpIndex1 did the same box-then-Deref dance.

Add TypedValue.GetValueAtIntIndex, a read-only fast path mirroring
GetPointerAtIndex's checks and panics for strings and Data-backed
arrays/slices, and use it in doOpIndex1 and the range loop. Give the
copy() builtin direct byte copies when both sides are Data-backed (or
the source is a string); bounds, readonly checks, DidUpdate and CPU gas
are unchanged (charged before the loop, as before), and Go's copy is
overlap-safe so the backward-copy setup only remains for the List
fallback.

Gas is unchanged: the view boxes were raw Go allocations, never charged
to the VM allocator. All 2344 filetest goldens (including Gas: and
MAXALLOC-sensitive alloc tests) pass unmodified, as do the gno.land vm
Gas tests and the txtar integration suite.

bytes stdlib suite: 151.5s -> 105.2s; full pkg/gnolang long mode:
245.0s -> 184.6s; allocated objects in the bytes suite: 1.03G -> 0.30G;
BenchmarkOpIndex1_ByteArray: 185.7ns -> 130.0ns.
@thehowl thehowl changed the title perf(gnovm): parallelize test suites to cut CI time perf(gnovm): parallelize test suites and remove byte-access allocation hotspots Jun 10, 2026

@davd-gzl davd-gzl left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid design and a real CI win. Two blockers, both reproduced on the current head (13064df):

  • go test -race ./gnovm/pkg/gnolang: 0 races on master, 27 on this branch. Culprits: the global enabled debug bool and an *Allocator shared across pool stores. Repro below, breakdown in the full review.
  • gno test -jobs -1 hangs forever with no output. Inline comment with fix and repro.

Minor, inline: the -jobs parallel path has zero test coverage; in-order output buffering nit. Also -update-golden-tests isn't forced sequential with -jobs >1, unlike TestFiles (poolSize=1 under *withSync); flagging in case unintentional.

race repro
# from a local clone of gnolang/gno:
gh pr checkout 5800 -R gnolang/gno
GNOROOT=$PWD go test -race -short -run 'TestFiles/a' ./gnovm/pkg/gnolang/ 2>&1 | grep -c 'DATA RACE'
27

Same invocation on master:

0

Full review: https://github.com/samouraiworld/gno-agent-workspace/blob/main/reviews/pr/5xxx/5800-parallelize-test-suites/1-4cdc7de8e/claude-opus-4-8_davd-gzl.md

(AI Agent)

(I know it's still in draft, I wanted to test a new workflow! I hope that can help!)

Comment thread gnovm/cmd/gno/test.go
Comment on lines +281 to +285
jobs := cmd.jobs
if jobs == 0 {
jobs = runtime.GOMAXPROCS(0)
}
jobs = min(jobs, len(pkgs))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard only catches jobs == 0: a negative value slips through, spawns zero workers, and the collector waits forever. Reject jobs < 0 here.

repro
# from a local clone of gnolang/gno:
gh pr checkout 5800 -R gnolang/gno
GNOROOT=$PWD timeout 25 go run ./gnovm/cmd/gno test -C examples -jobs -1 \
  ./gno.land/p/moul/once/ ./gno.land/p/moul/fifo/
echo "exit=$?"  # 124 = hung (timed out)
exit=124

(AI Agent)

Comment thread gnovm/cmd/gno/test.go
Comment on lines +277 to +285
} else {
// Parallel run: cmd.jobs workers, each with its own store. The
// output of each package is buffered, and printed in package order
// as results come in.
jobs := cmd.jobs
if jobs == 0 {
jobs = runtime.GOMAXPROCS(0)
}
jobs = min(jobs, len(pkgs))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parallel path has no test (existing txtars all run the default -jobs 1). Two txtar cases in testdata/test/ would cover it: run a few fixture packages with -jobs 4 and assert the same package-ordered output a sequential run prints; run -jobs -1 and assert an immediate error instead of the current hang.

(AI Agent)

Comment thread gnovm/cmd/gno/test.go
Comment on lines +338 to 349
for i := range results {
res := &results[i]
<-res.done
if res.out.Len() > 0 {
_, _ = io.Out().Write(res.out.Bytes())
}
if res.errOut.Len() > 0 {
_, _ = io.Err().Write(res.errOut.Bytes())
}
buildErrCount += res.buildErrs
testErrCount += res.testErrs
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: output drains in index order, so a slow first package buffers everything behind it in memory. Fine tradeoff for determinism.

(AI Agent)

thehowl added 7 commits June 11, 2026 00:48
After the byte-access fixes, NewBlock was the dominant allocation site
(45% of remaining heap objects in the bytes suite: 75M per-scope blocks
from doOpExec — if/for/range/switch/block — and 60M call blocks from
doOpCall, each also allocating a Values slice).

With closures capturing heap items rather than blocks (doOpFuncLit sets
Parent=nil and copies Captures), a runtime block provably dies when it
is discarded from the machine's block stack. acquireBlock/releaseBlock
implement a small per-machine pool on top of that invariant; all block
discard sites (OpPopBlock, GotoJump, PopFrameAndReset/Return,
PeekFrameAndContinueFor/Range) route through it. Blocks are zeroed on
release so they retain no references.

Skipped from pooling, in releaseBlock:
- node-owned static blocks and file/package blocks, which also travel
  the block stack (Eval/RunStatement flows push static blocks; file
  blocks are referenced by FuncValue.Parent) — identified by Source
  type and static-block identity;
- defer-site blocks: Defer.Parent is visited by the garbage collector
  until the defer runs, so doOpDefer marks them via a flag stored in
  bodyStmt's trailing padding, keeping unsafe.Sizeof(Block{}) — and the
  _allocBlock gas constant — unchanged;
- anything while a panic is unwinding, as cheap conservatism.

Gas and VM-GC accounting are unchanged: acquireBlock charges
AllocateBlock exactly like Allocator.NewBlock, and pooled blocks are
unreachable from GC roots just like dead blocks today. Verified: all
2344 filetest goldens (Gas:, Realm:, Storage:, MAXALLOC alloc tests)
byte-identical, vm Gas tests, txtar suite, examples (220 packages),
cmd/gno suite.

bytes suite heap objects: 165M (from 300M; 1.03G before the byte-access
fixes); bytes suite solo: 105.2s -> 94.9s; full pkg/gnolang long mode:
184.6s -> 154.0s; 4-core+coverage CI simulation: 509.7s -> 424.9s
(600.0s on master).
Since gnolang#5795 the per-module coverage percentages are only printed to the
job log; nothing uploads or tracks them. The instrumentation costs a
measured ~1.24x on the gnovm interpreter-heavy tests (TestStdlibs/sort:
22.1s -> 27.5s) plus an instrumented rebuild of every package.

Add a 'coverage' input to the reusable Go CI workflow (default true, so
gno.land/tm2/misc/contribs keep their current behavior) and opt gnovm
out of it.
popCopyArgs allocated an args slice per function call (30M objects in
the bytes stdlib suite, the top allocation site after block pooling).
doOpCall consumes the args immediately — they are copied into the call
block before any further ops — so it now hands popCopyArgs a reusable
per-machine scratch buffer, cleared after each use. doOpDefer's args
escape into the Defer and keep allocating fresh slices.
Copying an array whose element type is an interface deep-copied each
element's held value, making chains like x = [1]any{x} quadratic where
Go is linear (Go copies the sealed boxes' headers). ArrayValue.Copy now
shares the element values for interface-kind element types.

This is sound because interface-slot contents are sealed: every write
into an interface slot copies on entry (TypedValue.Assign), interface-
held values are not addressable (no lvalue chain can root in one, per
Go's rules which gno follows), and extraction via type assertion copies
again on assignment. Sharing therefore can never expose a mutable
alias.

Unlike the previous optimizations this is intentionally gas-visible for
the affected patterns: the copies no longer happen, so allocation gas
drops accordingly. Exactly one golden changes across the 2344 filetests
— gas/nested_alloc.gno (built to measure this exact pattern) drops from
8,559,690,088 to 17,013,825 gas. recurse1.gno runs in 0.01s instead of
~23s with its output golden unchanged; every realm, alloc and
interrealm golden is byte-identical. vm Gas tests, txtar suite,
examples (220 packages) and cmd/gno all pass.
The coverage dir env vars (TXTARCOVERDIR, GOCOVERDIR, COVERDIR) also
steer testscript-based suites: cmd/gno's Test_Scripts fails when
TXTARCOVERDIR points at a directory that was never created. With
coverage off, leave them empty so those suites skip coverage
collection, matching a plain local run.
BigintValue and BigdecValue are immutable at runtime: all arithmetic
writes into fresh receivers and conversions only read. Copying them
allocated a fresh big.Int/apd.Decimal per copy — 24M allocations in the
bytes stdlib suite, mostly from untyped-const operands copied at
declaration sites. Share the underlying value instead. Neither Copy
ever charged the allocator, so this is gas-neutral.
Three more allocation sites found by profiling the bytes stdlib suite,
together ~42M objects (of 135M total):

- doOpValueDecl let its working TypedValue escape to the heap once per
  declaration executed, because its address went into ConvertUntypedTo.
  At runtime only untyped bools (from comparisons) reach that path, so
  retype directly; the preprocess-stage conversion moves to a by-value
  helper. (16.6M)

- doOpConvert's working value escaped the same way via ConvertTo and
  IsReadonly. Use a machine-owned scratch slot: a field's address is
  free, and the op is single-threaded and self-contained. (11.7M)

- Evaluating a constTypeExpr re-boxed the type into a TypeValue
  interface per evaluation (every conversion evaluates one). Cache the
  boxed form on the node at preprocess time; nodes loaded from the
  store fall back to boxing per eval (the cache is not persisted and is
  never lazily filled at runtime, since nodes can be shared across
  machines). (13.5M)

Also documents on Machine.Release why blockPool/callArgsScratch are
deliberately not carried through the machine pool: measured to hurt
parallel workloads via extra live heap and lost cache locality, without
helping machine-churn workloads (sync.Pool eviction discards them).

Full verification battery: filetest suite canonical (all goldens
byte-identical), vm Gas tests, txtar, 220 example packages, cmd/gno.
Same-session A/B: full pkg/gnolang long mode 173.9s -> 157.1s (-10%);
bytes suite solo 94.9s -> 83.5s.
@thehowl

thehowl commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Superseded by the stacked split (merge in order): #5811#5812#5813#5814#5815#5816. This branch served as the performance testbed; final measured result of the full stack: ci / gnovm ~14m → 6m18s, pkg/gnolang test time −64%, VM heap allocations −84% on the heaviest suite. All research, profiling data and per-round measurements remain in this PR's description and commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🚀 ci 📦 🤖 gnovm Issues or PRs gnovm related

Projects

Development

Successfully merging this pull request may close these issues.

3 participants