perf(gnovm): parallelize test suites and remove byte-access allocation hotspots by thehowl · Pull Request #5800 · gnolang/gno

thehowl · 2026-06-10T17:37:14Z

Summary

ci / gnovm takes ~14 minutes wall time, gating every PR that touches gnovm/**, tm2/** or examples/**. This PR attacks it from two sides: it restructures the gnovm test suites for parallelism (no test content changes) with a new gno test -jobs flag, and — since the main / test job turned out to be bound by total CPU work, not scheduling — it profiles the actual workload and eliminates the dominant source of redundant work in the VM itself: heap-boxed byte-element access (a benefit on-chain as well, not just in CI).

CI results from this PR's own runs (3 samples): stdlibs / test 8m33s → 5m21s / 5m30s / 5m35s, examples gno-checks / test 3m44s → 3m11s / 3m09s, and main / test unchanged in median — pkg/gnolang 685.7s / 710.6s / 899.6s against a master baseline stable at 707.9–721.5s (4 runs). The main / test outcome is itself the key research finding: on a 4-vCPU runner that job is bound by total CPU work (~2700 instrumented-interpreter CPU-seconds), not by scheduling — go test's package-level concurrency was already packing the cores. Its minutes can only come from doing less work; the measured levers are in the follow-ups. On many-core dev machines the restructure does cut wall time (pkg/gnolang 283.7s → 245.0s on 16 cores) since there the bottleneck was serialization, not CPU.

Where the time goes (research)

Job breakdown of a representative master run (27282103940):

job	wall time
main / test	13m43s
stdlibs / test	8m33s
main / lint	1m49s
everything else	≤1m10s

main / test (go test -covermode=set -coverpkg=gnovm/... ./...): pkg/gnolang alone is 707.9s; second place is cmd/gno at 88s. Within pkg/gnolang virtually all time is interpreted Gno execution in two test functions:

TestStdlibs — the heavy suites (bytes 172s, strconv 105s, regexp 61s, bufio 54s, sort 49s; local long-mode numbers) were already parallel with their own stores, but every other stdlib ran sequentially on a shared store in the parent test body — and Go only starts parallel subtests after the parent body returns. The sequential walk (including math/overflow at 63s) both ran single-threaded and delayed the start of the parallel heavy hitters.
TestFiles — 2344 filetests, of which 2339 ran sequentially on one goroutine; only the 5 *_long.gno files were parallel.

stdlibs / test: gno test ./... over gnovm/stdlibs is fully sequential (single-threaded VM, one package at a time): bytes 174s, strconv 86s, math/overflow 60s, regexp 35s, bufio 31s, … ≈455s of test time on one core of a 4-vCPU runner. Per gnovm/Makefile, this is not redundant with TestStdlibs (it also runs xxx_test integration packages), so it needs to get faster, not deleted.

Changes

TestFiles: non-long filetests now run as parallel subtests, drawing a TestOptions (store) from a GOMAXPROCS-sized pool, so stores — and the stdlib packages already loaded into them — are reused across tests, just split N ways. -update-golden-tests keeps the previous fully-sequential single-store behavior (deterministic walk-order writes).
TestStdlibs: every stdlib package runs as a parallel subtest with its own store, removing the sequential shared-store walk and the hardcoded special-case package lists. The small tests/stdlibs walk stays sequential on one store.
gotypecheck.go: gnoBuiltinsCache was lazily populated without synchronization — a latent data race that becomes load-bearing now that type-checks run concurrently from the start. The cache is now built eagerly at package init and is read-only afterwards (no lock needed).
gno test -jobs N (default 1 = behavior unchanged): tests up to N packages in parallel, each worker owning a store/typecheck-cache reused across the packages it runs. Per-package output is buffered and printed in package order as results complete (out/err kept on their original streams), so logs stay readable and deterministic. -failfast stops scheduling new packages after the first failure; in-flight packages finish. Incompatible with the interactive -debug mode.
_ci-gno.yml: pass -jobs 4 (runners have 4 vCPUs) to the gno test step — used by the gnovm stdlibs job and the examples job.

Measurements

All runs in long mode (no -short); pass/fail sets verified identical before/after at every step (2966 pass / 11 fail — the 11 are pre-existing golden typecheck mismatches against go1.26 locally; CI uses the go.mod toolchain and is green). 4-core runs are taskset-pinned to mirror the ubuntu-latest runners.

benchmark	before	after
CI: `stdlibs / test` job	8m33s	5m21s, 5m30s, 5m35s (3 runs)
CI: examples `gno-checks / test` job	3m44s	3m11s, 3m09s (2 runs)
CI: `main / test` `pkg/gnolang` package time	707.9–721.5s (4 master runs)	685.7s, 710.6s, 899.6s (3 runs — see below)
local: `go test ./pkg/gnolang/`, 16 cores, no coverage	283.7s	245.0s
local: `go test -covermode=set -coverpkg=gnovm/... ./pkg/gnolang/`, `taskset` 4 cores	600.0s	546.2s
local: `gno test gnovm/stdlibs/...`, `taskset` 4 cores	455s (sequential)	270s (`-jobs 4`)
local: `gno test examples/...`, `taskset` 4 cores	129s (sequential)	129s (`-jobs 4`, all 220 testable packages pass)

Why main / test doesn't move on CI even though the same binary improves elsewhere: a 4-vCPU runner has to execute ~2700 CPU-seconds of coverage-instrumented interpreted Gno regardless of how it's scheduled (floor ≈ 11 min), and go test -p was already overlapping pkg/gnolang's underused phases with cmd/gno/pkg/parser. The restructure removes the serialization (which is what binds on many-core machines and is why stdlibs / test — single-threaded before — drops by 3m12s), but on 4 cores main / test sits at the CPU floor. Cutting it further requires reducing the work itself — see follow-ups.

A second measured caveat: GnoVM tests are allocation-heavy, so N concurrent VMs in one process slow each other down well beyond simple core-sharing (GC assist / memory pressure). With -jobs 4 on 4 cores, per-package wall time inflates 2–3x (bytes 131s solo → 270s; examples total CPU 129s → 400s). That's why stdlibs nets a large win (its sequential baseline wasted 3 of 4 cores) while the locally-fast examples suite nets none, and why -jobs deliberately defaults to 1. It is also the plausible mechanism behind main / test's one slow sample (899.6s, vs 685.7s/710.6s and a master baseline stable within ~2%): with 4 VM-heavy subtests running concurrently the job is more sensitive to the runner's memory subsystem. Review pushes will accumulate more samples; if the tail shows up again, capping go test -parallel for the gnovm job (keeping the structural fix while limiting concurrent VMs) is a one-line mitigation.

VM profiling: where the CPU actually goes

Since spreading the work can't cut main / test on a 4-vCPU runner, the next step was pprof on the dominant workloads. CPU profile of TestStdlibs/bytes: ~50% of all CPU is Go GC + malloc machinery; the opcode loop itself is ~30%. Allocation profile: 67% of all heap allocations (694M objects in that one suite) came from ArrayValue.GetPointerAtIndexInt2 materializing a *TypedValue + boxed DataByteValue per byte accessed:

the copy() builtin allocated two boxes per byte copied (554M objects — the code even had a TODO: consider an optimization if dstv.Data != nil);
for i, c := range byteslice allocated three objects per iteration (the index TypedValue escaping through GetPointerAtIndex(&iv), plus the view box) that Deref immediately threw away;
b[i] reads in doOpIndex1 did the same box-then-Deref dance.

The fix (last commit) adds TypedValue.GetValueAtIntIndex — a read-only fast path mirroring GetPointerAtIndex's checks and panics for strings and Data-backed arrays/slices — used by doOpIndex1 and the range loop, and gives copy() direct byte copies when both sides are Data-backed (or the source is a string). This allocation pressure is also what made concurrent VMs degrade each other (the GC-contention caveat above), so it compounds with the parallelism work.

Gas is unchanged: the view boxes were raw Go allocations, never charged to the VM allocator, and CPU gas for copy() was already charged before the per-element loop. Verified empirically: all 2344 filetest goldens (including Gas: and MAXALLOC-sensitive alloc tests) pass byte-identical, plus the gno.land/pkg/sdk/vm gas tests and the txtar integration suite.

measurement	before	after
`TestStdlibs/bytes` solo	151.5s	105.2s (−31%)
full `pkg/gnolang` long mode, 16 cores	245.0s	184.6s (−25%)
heap objects allocated (bytes suite)	1.03G	0.30G (−71%)
`BenchmarkOpIndex1_ByteArray`	185.7ns	130.0ns (−30%)
4-core + coverage CI simulation	546.2s	509.7s
CI: `main / test` `pkg/gnolang`	707.9–721.5s (master, 4 runs)	646.2s (1 run so far)

(untouched paths — List arrays, maps, range-over-int-slices — benchmark flat: OpRangeIter_1000 30.9µs → 30.2µs, OpIndex1_MapHit_100 187.0ns → 187.7ns)

Block recycling: removing the next 45%

After the byte-access fixes, NewBlock was 45% of remaining allocations (135M objects in the bytes suite: 75M scope blocks from doOpExec — if/for/range/switch — and 60M call blocks from doOpCall). The key enabler is an invariant the heap-items design already established: closures capture HeapItemValues, not blocks (doOpFuncLit sets Parent: nil), pointers to locals are heap-promoted by the preprocessor, and stacktraces/frames store locations and indices — so a runtime block provably dies when discarded from the machine's block stack.

The last commit adds a small per-machine pool (acquireBlock/releaseBlock) on top of that invariant, routed through every block discard site. Three block populations that also travel the stack are excluded at release: node-owned static blocks (pushed by Eval/RunStatement flows — identified by static-block identity), file/package blocks (referenced by FuncValue.Parent), and defer-site blocks (Defer.Parent is visited by the VM's GC until the defer runs; doOpDefer marks them via a flag tucked into bodyStmt's trailing padding, keeping unsafe.Sizeof(Block{}) — and the _allocBlock gas constant — unchanged). Panic unwinding skips pooling entirely as cheap conservatism.

Gas and VM-GC accounting are unchanged: acquireBlock charges AllocateBlock exactly like Allocator.NewBlock, and pooled (zeroed) blocks are unreachable from GC roots, just like dead blocks today. Same verification battery as before: all 2344 filetest goldens byte-identical (incl. Gas:/Realm:/Storage:/MAXALLOC tests), vm Gas tests, txtar, all 220 example packages, cmd/gno suite.

measurement	before pooling	after
heap objects, bytes suite	300M (1.03G on master)	165M (−84% vs master)
`TestStdlibs/bytes` solo	105.2s (151.5s master)	94.9s
full `pkg/gnolang` long mode, 16 cores	184.6s (283.7s master)	154.0s (−46% vs master)
4-core + coverage CI simulation	509.7s (600.0s master)	424.9s, 436.9s (−28% vs master)
CI: `stdlibs / test` job	5m21–5m35s (8m33s master)	4m02s, 4m19s (−53% vs master)
CI: `main / test` `pkg/gnolang`	646.2s (707.9–721.5s master)	631.1s, 936.3s (2 runs)

The main / test CI numbers stay tail-prone (the 936s sample sits next to a same-run stdlibs / test at its fastest ever and a normal cmd/gno, pointing at a slow/noisy runner amplified by 4-way VM contention — same pattern as the earlier 899.6s sample); the controlled 4-core simulations and the coverage-free stdlibs / test job show the real effect. With ~425s of instrumented work remaining in the simulation, the print-only coverage tax (≈1.24x) is now the largest single lever left on main / test.

Round 3: coverage off, call-arg pooling, linear interface-array copies

Three more changes, attacking the remaining main / test time:

Coverage instrumentation is now skipped for gnovm (new coverage input on _ci-go.yml, default true so gno.land/tm2/misc/contribs are untouched). Since chore(ci): remove Codecov from CI workflows #5795 the percentages were print-only; on gnovm they cost a measured ~1.24x on the tests plus an instrumented rebuild of everything.
popCopyArgs reuses a per-machine scratch buffer on the call path (30M arg-slice allocations in the bytes suite — the top site after block pooling). doOpDefer's args escape into the Defer and keep allocating.
Copying arrays with interface-kind element types now shares the held values instead of deep-copying them (ArrayValue.Copy), making x = [1]any{x} chains linear like Go (which copies the sealed boxes' headers). Soundness: interface-slot contents are sealed — every write into an interface slot copies on entry, interface-held values are not addressable, and type-assertion extraction copies again on assignment. This one is intentionally gas-visible for the affected patterns: exactly one golden changes across the 2344 filetests — gas/nested_alloc.gno (built to measure this exact pattern) drops from 8,559,690,088 to 17,013,825 gas; recurse1.gno runs in 0.01s instead of ~23s with its output golden unchanged, and every realm/alloc/interrealm golden stays byte-identical.

Verification battery repeated in full: filetest suite (canonical pass/fail set), vm Gas tests, txtar, all 220 example packages, cmd/gno.

measurement	round 2	round 3
full `pkg/gnolang` long mode, 16 cores	154.0s	134.7s (283.7s master, −53%)
4-core CI simulation, CI config (with coverage then, without now)	424.9s	273.4s (600.0s master config, −54%)

Round-3 CI results

Two post-push runs of ci / gnovm:

	run 1 (failed on an env-var bug in the new no-coverage branch, since fixed)	run 2 (green)
`main / test` `pkg/gnolang`	264.5s	389.7s
`stdlibs / test` job	—	4m20s
whole `ci / gnovm` workflow	—	8m39s

pkg/gnolang now sits at 264–390s on CI (708–722s on master, median ≈ −55%), with the familiar runner-quality spread; the local 4-core simulation (273.4s) matches the good-runner samples. cmd/gno also dropped 88–97s → 72s without instrumentation. Whole-workflow wall time lands at ~6–8.5 minutes depending on runner luck, vs a stable ~14 minutes on master.

Round 4: bigint sharing, per-op escape fixes, type re-boxing

Fresh profiles after round 3 pointed at the next allocation tier; four more gas-neutral changes (all goldens byte-identical, full battery repeated):

BigintValue/BigdecValue.Copy share the underlying value — they are immutable at runtime (arithmetic writes fresh receivers); 24M allocations, mostly untyped-const operands copied at declaration sites.
doOpValueDecl escape fix — its working TypedValue heap-escaped once per declaration via ConvertUntypedTo(&tv); at runtime only untyped bools reach that path, now retyped directly (16.6M).
doOpConvert escape fix — same pattern via ConvertTo/IsReadonly; now uses a machine-owned scratch slot whose address is free (11.7M).
constTypeExpr evaluation re-boxed the type per eval (every conversion evaluates one); the boxed form is now cached on the node at preprocess time, with a read-only fallback for store-loaded nodes (13.5M).

One negative result worth recording: carrying the block pool through Machine.Release/machinePool measured 10–25% slower on parallel workloads (extra live heap across GC cycles, lost cache locality) with no benefit for machine-churn workloads — Machine.Release now documents why the pools are deliberately dropped.

Same-session A/B (laptop thermals make cross-session walls incomparable; this is back-to-back): full pkg/gnolang long mode 173.9s → 157.1s (−10%), bytes suite solo 94.9s → 83.5s.

Round-4 CI result

	master	this run
`pkg/gnolang` package time	708–722s	256.1s (−64%)
`cmd/gno` package time	88–97s	53.8s
`main / test` job	13m43s	6m00s
`stdlibs / test` job	8m33s	3m54s
whole `ci / gnovm` workflow	~14m	6m18s

(single good-runner sample; the runner-quality band observed across this PR suggests ~6–8m for the workflow)

Follow-ups (measured, but out of scope here)

The next allocation tier after this round: GetPointerToFromTV (26M objects), doOpValueDecl (16.5M) and math/big.NewInt (16M) in the bytes suite.
StructValue.Copy could get the same interface-field sharing as arrays (per-field static types; it was not the profiled quadratic case, so left out).
transcribe (preprocessing) is 29% of the TestFiles short-walk CPU — per-filetest typecheck+preprocess of imports; a Store-level preprocessed-stdlib cache would help every consumer (filetests, gno test, keeper cold paths).
Byte writes (b[i] = x) still box (17M objects via PopAsPointer2) — the pointer protocol spans multiple ops, so a write-side fast path needs more design.

TestFiles ran its 2339 non-long filetests sequentially on a single goroutine; they now run as parallel subtests, drawing a TestOptions (with its store) from a GOMAXPROCS-sized pool so loaded packages are still reused across tests. -update-golden-tests keeps the fully sequential single-store behavior. TestStdlibs ran most stdlib suites sequentially on a shared store in the parent test body, which also delayed the parallel heavy suites (bytes, strconv, ...) until the walk finished, since parallel subtests only start once the parent body returns. Every stdlib package now runs as a parallel subtest with its own store. gnoBuiltinsCache was lazily populated from TypeCheckMemPackage without synchronization; with type-checks now running concurrently from the start, that latent race becomes load-bearing. The cache is now built eagerly at package init and read-only afterwards.

Gno2D2 · 2026-06-10T17:38:04Z

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):

IGNORE the bot requirements for this PR (force green CI check)

✅ Automated Checks (for Contributors):

🟢 Maintainers must be able to edit this pull request (more info)

☑️ Contributor Actions:

Fix any issues flagged by automated checks.
Follow the Contributor Checklist to ensure your PR is ready for review.
- Add new tests, or document why they are unnecessary.
- Provide clear examples/screenshots, if necessary.
- Update documentation, if required.
- Ensure no breaking changes, or include BREAKING CHANGE notes.
- Link related issues/PRs, where applicable.

☑️ Reviewer Actions:

Complete manual checks for the PR, including the guidelines and additional checks if applicable.

📚 Resources:

Debug

Automated Checks
Maintainers must be able to edit this pull request (more info)
If
🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 The pull request was created from a fork (head branch repo: thehowl/gno)
Then
🟢 Requirement satisfied
└── 🟢 Maintainer can modify this pull request
Manual Checks
**IGNORE** the bot requirements for this PR (force green CI check)
If
🟢 Condition met
└── 🟢 On every pull request
Can be checked by

Any user with comment edit permission

gno test runs packages sequentially on a single-threaded VM; on the gnovm stdlibs CI job this is ~455s of test time on one core of a 4-vCPU runner (bytes 174s, strconv 86s, math/overflow 60s, ...). -jobs N (default 1: behavior unchanged) tests up to N packages in parallel. Each worker owns a TestOptions/store, reused across the packages it runs; per-package output is buffered and printed in package order as results complete, so runs remain readable and deterministic. Incompatible with the interactive -debug mode. The CI gno-test step (gnovm stdlibs and examples jobs) now passes -jobs 4 to match the runner's 4 vCPUs.

Profiling the gnovm test suites (dominated by interpreted Gno) showed ~50% of CPU in Go GC/malloc, with 67% of all heap allocations (694M objects in the bytes stdlib suite alone) coming from ArrayValue.GetPointerAtIndexInt2 materializing a *TypedValue + DataByteValue box per byte accessed: - the copy() builtin allocated two boxes per byte copied; - range over a byte slice allocated three objects per iteration (the index TypedValue escaping through GetPointerAtIndex, plus the view box) that Deref immediately discarded; - b[i] reads in doOpIndex1 did the same box-then-Deref dance. Add TypedValue.GetValueAtIntIndex, a read-only fast path mirroring GetPointerAtIndex's checks and panics for strings and Data-backed arrays/slices, and use it in doOpIndex1 and the range loop. Give the copy() builtin direct byte copies when both sides are Data-backed (or the source is a string); bounds, readonly checks, DidUpdate and CPU gas are unchanged (charged before the loop, as before), and Go's copy is overlap-safe so the backward-copy setup only remains for the List fallback. Gas is unchanged: the view boxes were raw Go allocations, never charged to the VM allocator. All 2344 filetest goldens (including Gas: and MAXALLOC-sensitive alloc tests) pass unmodified, as do the gno.land vm Gas tests and the txtar integration suite. bytes stdlib suite: 151.5s -> 105.2s; full pkg/gnolang long mode: 245.0s -> 184.6s; allocated objects in the bytes suite: 1.03G -> 0.30G; BenchmarkOpIndex1_ByteArray: 185.7ns -> 130.0ns.

davd-gzl

Solid design and a real CI win. Two blockers, both reproduced on the current head (13064df):

go test -race ./gnovm/pkg/gnolang: 0 races on master, 27 on this branch. Culprits: the global enabled debug bool and an *Allocator shared across pool stores. Repro below, breakdown in the full review.
gno test -jobs -1 hangs forever with no output. Inline comment with fix and repro.

Minor, inline: the -jobs parallel path has zero test coverage; in-order output buffering nit. Also -update-golden-tests isn't forced sequential with -jobs >1, unlike TestFiles (poolSize=1 under *withSync); flagging in case unintentional.

race repro

# from a local clone of gnolang/gno:
gh pr checkout 5800 -R gnolang/gno
GNOROOT=$PWD go test -race -short -run 'TestFiles/a' ./gnovm/pkg/gnolang/ 2>&1 | grep -c 'DATA RACE'

Same invocation on master:

Full review: https://github.com/samouraiworld/gno-agent-workspace/blob/main/reviews/pr/5xxx/5800-parallelize-test-suites/1-4cdc7de8e/claude-opus-4-8_davd-gzl.md

(AI Agent)

(I know it's still in draft, I wanted to test a new workflow! I hope that can help!)

davd-gzl · 2026-06-10T21:42:14Z

+		jobs := cmd.jobs
+		if jobs == 0 {
+			jobs = runtime.GOMAXPROCS(0)
+		}
+		jobs = min(jobs, len(pkgs))


This guard only catches jobs == 0: a negative value slips through, spawns zero workers, and the collector waits forever. Reject jobs < 0 here.

repro

# from a local clone of gnolang/gno: gh pr checkout 5800 -R gnolang/gno GNOROOT=$PWD timeout 25 go run ./gnovm/cmd/gno test -C examples -jobs -1 \ ./gno.land/p/moul/once/ ./gno.land/p/moul/fifo/ echo "exit=$?" # 124 = hung (timed out)

exit=124

(AI Agent)

davd-gzl · 2026-06-10T21:42:14Z

+	} else {
+		// Parallel run: cmd.jobs workers, each with its own store. The
+		// output of each package is buffered, and printed in package order
+		// as results come in.
+		jobs := cmd.jobs
+		if jobs == 0 {
+			jobs = runtime.GOMAXPROCS(0)
+		}
+		jobs = min(jobs, len(pkgs))


The parallel path has no test (existing txtars all run the default -jobs 1). Two txtar cases in testdata/test/ would cover it: run a few fixture packages with -jobs 4 and assert the same package-ordered output a sequential run prints; run -jobs -1 and assert an immediate error instead of the current hang.

(AI Agent)

davd-gzl · 2026-06-10T21:42:14Z

+		for i := range results {
+			res := &results[i]
+			<-res.done
+			if res.out.Len() > 0 {
+				_, _ = io.Out().Write(res.out.Bytes())
+			}
+			if res.errOut.Len() > 0 {
+				_, _ = io.Err().Write(res.errOut.Bytes())
+			}
+			buildErrCount += res.buildErrs
+			testErrCount += res.testErrs
 		}


Nit: output drains in index order, so a slow first package buffers everything behind it in memory. Fine tradeoff for determinism.

(AI Agent)

After the byte-access fixes, NewBlock was the dominant allocation site (45% of remaining heap objects in the bytes suite: 75M per-scope blocks from doOpExec — if/for/range/switch/block — and 60M call blocks from doOpCall, each also allocating a Values slice). With closures capturing heap items rather than blocks (doOpFuncLit sets Parent=nil and copies Captures), a runtime block provably dies when it is discarded from the machine's block stack. acquireBlock/releaseBlock implement a small per-machine pool on top of that invariant; all block discard sites (OpPopBlock, GotoJump, PopFrameAndReset/Return, PeekFrameAndContinueFor/Range) route through it. Blocks are zeroed on release so they retain no references. Skipped from pooling, in releaseBlock: - node-owned static blocks and file/package blocks, which also travel the block stack (Eval/RunStatement flows push static blocks; file blocks are referenced by FuncValue.Parent) — identified by Source type and static-block identity; - defer-site blocks: Defer.Parent is visited by the garbage collector until the defer runs, so doOpDefer marks them via a flag stored in bodyStmt's trailing padding, keeping unsafe.Sizeof(Block{}) — and the _allocBlock gas constant — unchanged; - anything while a panic is unwinding, as cheap conservatism. Gas and VM-GC accounting are unchanged: acquireBlock charges AllocateBlock exactly like Allocator.NewBlock, and pooled blocks are unreachable from GC roots just like dead blocks today. Verified: all 2344 filetest goldens (Gas:, Realm:, Storage:, MAXALLOC alloc tests) byte-identical, vm Gas tests, txtar suite, examples (220 packages), cmd/gno suite. bytes suite heap objects: 165M (from 300M; 1.03G before the byte-access fixes); bytes suite solo: 105.2s -> 94.9s; full pkg/gnolang long mode: 184.6s -> 154.0s; 4-core+coverage CI simulation: 509.7s -> 424.9s (600.0s on master).

Since gnolang#5795 the per-module coverage percentages are only printed to the job log; nothing uploads or tracks them. The instrumentation costs a measured ~1.24x on the gnovm interpreter-heavy tests (TestStdlibs/sort: 22.1s -> 27.5s) plus an instrumented rebuild of every package. Add a 'coverage' input to the reusable Go CI workflow (default true, so gno.land/tm2/misc/contribs keep their current behavior) and opt gnovm out of it.

popCopyArgs allocated an args slice per function call (30M objects in the bytes stdlib suite, the top allocation site after block pooling). doOpCall consumes the args immediately — they are copied into the call block before any further ops — so it now hands popCopyArgs a reusable per-machine scratch buffer, cleared after each use. doOpDefer's args escape into the Defer and keep allocating fresh slices.

Copying an array whose element type is an interface deep-copied each element's held value, making chains like x = [1]any{x} quadratic where Go is linear (Go copies the sealed boxes' headers). ArrayValue.Copy now shares the element values for interface-kind element types. This is sound because interface-slot contents are sealed: every write into an interface slot copies on entry (TypedValue.Assign), interface- held values are not addressable (no lvalue chain can root in one, per Go's rules which gno follows), and extraction via type assertion copies again on assignment. Sharing therefore can never expose a mutable alias. Unlike the previous optimizations this is intentionally gas-visible for the affected patterns: the copies no longer happen, so allocation gas drops accordingly. Exactly one golden changes across the 2344 filetests — gas/nested_alloc.gno (built to measure this exact pattern) drops from 8,559,690,088 to 17,013,825 gas. recurse1.gno runs in 0.01s instead of ~23s with its output golden unchanged; every realm, alloc and interrealm golden is byte-identical. vm Gas tests, txtar suite, examples (220 packages) and cmd/gno all pass.

The coverage dir env vars (TXTARCOVERDIR, GOCOVERDIR, COVERDIR) also steer testscript-based suites: cmd/gno's Test_Scripts fails when TXTARCOVERDIR points at a directory that was never created. With coverage off, leave them empty so those suites skip coverage collection, matching a plain local run.

BigintValue and BigdecValue are immutable at runtime: all arithmetic writes into fresh receivers and conversions only read. Copying them allocated a fresh big.Int/apd.Decimal per copy — 24M allocations in the bytes stdlib suite, mostly from untyped-const operands copied at declaration sites. Share the underlying value instead. Neither Copy ever charged the allocator, so this is gas-neutral.

Three more allocation sites found by profiling the bytes stdlib suite, together ~42M objects (of 135M total): - doOpValueDecl let its working TypedValue escape to the heap once per declaration executed, because its address went into ConvertUntypedTo. At runtime only untyped bools (from comparisons) reach that path, so retype directly; the preprocess-stage conversion moves to a by-value helper. (16.6M) - doOpConvert's working value escaped the same way via ConvertTo and IsReadonly. Use a machine-owned scratch slot: a field's address is free, and the op is single-threaded and self-contained. (11.7M) - Evaluating a constTypeExpr re-boxed the type into a TypeValue interface per evaluation (every conversion evaluates one). Cache the boxed form on the node at preprocess time; nodes loaded from the store fall back to boxing per eval (the cache is not persisted and is never lazily filled at runtime, since nodes can be shared across machines). (13.5M) Also documents on Machine.Release why blockPool/callArgsScratch are deliberately not carried through the machine pool: measured to hurt parallel workloads via extra live heap and lost cache locality, without helping machine-churn workloads (sync.Pool eviction discards them). Full verification battery: filetest suite canonical (all goldens byte-identical), vm Gas tests, txtar, 220 example packages, cmd/gno. Same-session A/B: full pkg/gnolang long mode 173.9s -> 157.1s (-10%); bytes suite solo 94.9s -> 83.5s.

thehowl · 2026-06-11T16:39:59Z

Superseded by the stacked split (merge in order): #5811 → #5812 → #5813 → #5814 → #5815 → #5816. This branch served as the performance testbed; final measured result of the full stack: ci / gnovm ~14m → 6m18s, pkg/gnolang test time −64%, VM heap allocations −84% on the heaviest suite. All research, profiling data and per-round measurements remain in this PR's description and commits.

github-project-automation Bot added this to 🧙‍♂️Gno.land development Jun 10, 2026

github-project-automation Bot moved this to Triage in 🧙‍♂️Gno.land development Jun 10, 2026

github-actions Bot assigned thehowl Jun 10, 2026

github-actions Bot added 📦 🤖 gnovm Issues or PRs gnovm related 🚀 ci labels Jun 10, 2026

thehowl force-pushed the dev/morgan/gnovm-ci-speedup branch from b4235ac to 4cdc7de Compare June 10, 2026 17:51

thehowl marked this pull request as draft June 10, 2026 19:06

thehowl changed the title ~~perf(gnovm): parallelize test suites to cut CI time~~ perf(gnovm): parallelize test suites and remove byte-access allocation hotspots Jun 10, 2026

davd-gzl suggested changes Jun 10, 2026

View reviewed changes

thehowl added 7 commits June 11, 2026 00:48

thehowl closed this Jun 11, 2026

github-project-automation Bot moved this from Triage to Done in 🧙‍♂️Gno.land development Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(gnovm): parallelize test suites and remove byte-access allocation hotspots#5800

perf(gnovm): parallelize test suites and remove byte-access allocation hotspots#5800
thehowl wants to merge 10 commits into
gnolang:masterfrom
thehowl:dev/morgan/gnovm-ci-speedup

thehowl commented Jun 10, 2026 •

edited

Loading

Uh oh!

Gno2D2 commented Jun 10, 2026 •

edited

Loading

✅ Automated Checks (for Contributors):

☑️ Contributor Actions:

☑️ Reviewer Actions:

📚 Resources:

If

Then

If

Can be checked by

Uh oh!

davd-gzl left a comment •

edited

Loading

Uh oh!

davd-gzl Jun 10, 2026

Uh oh!

davd-gzl Jun 10, 2026

Uh oh!

davd-gzl Jun 10, 2026

Uh oh!

thehowl commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thehowl commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Where the time goes (research)

Changes

Measurements

VM profiling: where the CPU actually goes

Block recycling: removing the next 45%

Round 3: coverage off, call-arg pooling, linear interface-array copies

Round-3 CI results

Round 4: bigint sharing, per-op escape fixes, type re-boxing

Round-4 CI result

Follow-ups (measured, but out of scope here)

Uh oh!

Gno2D2 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛠 PR Checks Summary

Manual Checks (for Reviewers):

✅ Automated Checks (for Contributors):

☑️ Contributor Actions:

☑️ Reviewer Actions:

📚 Resources:

If

Then

If

Can be checked by

Uh oh!

davd-gzl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davd-gzl Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

davd-gzl Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

davd-gzl Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

thehowl commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thehowl commented Jun 10, 2026 •

edited

Loading

Gno2D2 commented Jun 10, 2026 •

edited

Loading

davd-gzl left a comment •

edited

Loading