perf(gnovm): parallelize test suites and remove byte-access allocation hotspots#5800
perf(gnovm): parallelize test suites and remove byte-access allocation hotspots#5800thehowl wants to merge 10 commits into
Conversation
TestFiles ran its 2339 non-long filetests sequentially on a single goroutine; they now run as parallel subtests, drawing a TestOptions (with its store) from a GOMAXPROCS-sized pool so loaded packages are still reused across tests. -update-golden-tests keeps the fully sequential single-store behavior. TestStdlibs ran most stdlib suites sequentially on a shared store in the parent test body, which also delayed the parallel heavy suites (bytes, strconv, ...) until the walk finished, since parallel subtests only start once the parent body returns. Every stdlib package now runs as a parallel subtest with its own store. gnoBuiltinsCache was lazily populated from TypeCheckMemPackage without synchronization; with type-checks now running concurrently from the start, that latent race becomes load-bearing. The cache is now built eagerly at package init and read-only afterwards.
🛠 PR Checks SummaryAll Automated Checks passed. ✅ Manual Checks (for Reviewers):
Read More🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers. ✅ Automated Checks (for Contributors):🟢 Maintainers must be able to edit this pull request (more info) ☑️ Contributor Actions:
☑️ Reviewer Actions:
📚 Resources:Debug
|
gno test runs packages sequentially on a single-threaded VM; on the gnovm stdlibs CI job this is ~455s of test time on one core of a 4-vCPU runner (bytes 174s, strconv 86s, math/overflow 60s, ...). -jobs N (default 1: behavior unchanged) tests up to N packages in parallel. Each worker owns a TestOptions/store, reused across the packages it runs; per-package output is buffered and printed in package order as results complete, so runs remain readable and deterministic. Incompatible with the interactive -debug mode. The CI gno-test step (gnovm stdlibs and examples jobs) now passes -jobs 4 to match the runner's 4 vCPUs.
b4235ac to
4cdc7de
Compare
Profiling the gnovm test suites (dominated by interpreted Gno) showed ~50% of CPU in Go GC/malloc, with 67% of all heap allocations (694M objects in the bytes stdlib suite alone) coming from ArrayValue.GetPointerAtIndexInt2 materializing a *TypedValue + DataByteValue box per byte accessed: - the copy() builtin allocated two boxes per byte copied; - range over a byte slice allocated three objects per iteration (the index TypedValue escaping through GetPointerAtIndex, plus the view box) that Deref immediately discarded; - b[i] reads in doOpIndex1 did the same box-then-Deref dance. Add TypedValue.GetValueAtIntIndex, a read-only fast path mirroring GetPointerAtIndex's checks and panics for strings and Data-backed arrays/slices, and use it in doOpIndex1 and the range loop. Give the copy() builtin direct byte copies when both sides are Data-backed (or the source is a string); bounds, readonly checks, DidUpdate and CPU gas are unchanged (charged before the loop, as before), and Go's copy is overlap-safe so the backward-copy setup only remains for the List fallback. Gas is unchanged: the view boxes were raw Go allocations, never charged to the VM allocator. All 2344 filetest goldens (including Gas: and MAXALLOC-sensitive alloc tests) pass unmodified, as do the gno.land vm Gas tests and the txtar integration suite. bytes stdlib suite: 151.5s -> 105.2s; full pkg/gnolang long mode: 245.0s -> 184.6s; allocated objects in the bytes suite: 1.03G -> 0.30G; BenchmarkOpIndex1_ByteArray: 185.7ns -> 130.0ns.
There was a problem hiding this comment.
Solid design and a real CI win. Two blockers, both reproduced on the current head (13064df):
go test -race ./gnovm/pkg/gnolang: 0 races on master, 27 on this branch. Culprits: the globalenableddebug bool and an*Allocatorshared across pool stores. Repro below, breakdown in the full review.gno test -jobs -1hangs forever with no output. Inline comment with fix and repro.
Minor, inline: the -jobs parallel path has zero test coverage; in-order output buffering nit. Also -update-golden-tests isn't forced sequential with -jobs >1, unlike TestFiles (poolSize=1 under *withSync); flagging in case unintentional.
race repro
# from a local clone of gnolang/gno:
gh pr checkout 5800 -R gnolang/gno
GNOROOT=$PWD go test -race -short -run 'TestFiles/a' ./gnovm/pkg/gnolang/ 2>&1 | grep -c 'DATA RACE'27
Same invocation on master:
0
(AI Agent)
(I know it's still in draft, I wanted to test a new workflow! I hope that can help!)
| jobs := cmd.jobs | ||
| if jobs == 0 { | ||
| jobs = runtime.GOMAXPROCS(0) | ||
| } | ||
| jobs = min(jobs, len(pkgs)) |
There was a problem hiding this comment.
This guard only catches jobs == 0: a negative value slips through, spawns zero workers, and the collector waits forever. Reject jobs < 0 here.
repro
# from a local clone of gnolang/gno:
gh pr checkout 5800 -R gnolang/gno
GNOROOT=$PWD timeout 25 go run ./gnovm/cmd/gno test -C examples -jobs -1 \
./gno.land/p/moul/once/ ./gno.land/p/moul/fifo/
echo "exit=$?" # 124 = hung (timed out)exit=124
(AI Agent)
| } else { | ||
| // Parallel run: cmd.jobs workers, each with its own store. The | ||
| // output of each package is buffered, and printed in package order | ||
| // as results come in. | ||
| jobs := cmd.jobs | ||
| if jobs == 0 { | ||
| jobs = runtime.GOMAXPROCS(0) | ||
| } | ||
| jobs = min(jobs, len(pkgs)) |
There was a problem hiding this comment.
The parallel path has no test (existing txtars all run the default -jobs 1). Two txtar cases in testdata/test/ would cover it: run a few fixture packages with -jobs 4 and assert the same package-ordered output a sequential run prints; run -jobs -1 and assert an immediate error instead of the current hang.
(AI Agent)
| for i := range results { | ||
| res := &results[i] | ||
| <-res.done | ||
| if res.out.Len() > 0 { | ||
| _, _ = io.Out().Write(res.out.Bytes()) | ||
| } | ||
| if res.errOut.Len() > 0 { | ||
| _, _ = io.Err().Write(res.errOut.Bytes()) | ||
| } | ||
| buildErrCount += res.buildErrs | ||
| testErrCount += res.testErrs | ||
| } |
There was a problem hiding this comment.
Nit: output drains in index order, so a slow first package buffers everything behind it in memory. Fine tradeoff for determinism.
(AI Agent)
After the byte-access fixes, NewBlock was the dominant allocation site
(45% of remaining heap objects in the bytes suite: 75M per-scope blocks
from doOpExec — if/for/range/switch/block — and 60M call blocks from
doOpCall, each also allocating a Values slice).
With closures capturing heap items rather than blocks (doOpFuncLit sets
Parent=nil and copies Captures), a runtime block provably dies when it
is discarded from the machine's block stack. acquireBlock/releaseBlock
implement a small per-machine pool on top of that invariant; all block
discard sites (OpPopBlock, GotoJump, PopFrameAndReset/Return,
PeekFrameAndContinueFor/Range) route through it. Blocks are zeroed on
release so they retain no references.
Skipped from pooling, in releaseBlock:
- node-owned static blocks and file/package blocks, which also travel
the block stack (Eval/RunStatement flows push static blocks; file
blocks are referenced by FuncValue.Parent) — identified by Source
type and static-block identity;
- defer-site blocks: Defer.Parent is visited by the garbage collector
until the defer runs, so doOpDefer marks them via a flag stored in
bodyStmt's trailing padding, keeping unsafe.Sizeof(Block{}) — and the
_allocBlock gas constant — unchanged;
- anything while a panic is unwinding, as cheap conservatism.
Gas and VM-GC accounting are unchanged: acquireBlock charges
AllocateBlock exactly like Allocator.NewBlock, and pooled blocks are
unreachable from GC roots just like dead blocks today. Verified: all
2344 filetest goldens (Gas:, Realm:, Storage:, MAXALLOC alloc tests)
byte-identical, vm Gas tests, txtar suite, examples (220 packages),
cmd/gno suite.
bytes suite heap objects: 165M (from 300M; 1.03G before the byte-access
fixes); bytes suite solo: 105.2s -> 94.9s; full pkg/gnolang long mode:
184.6s -> 154.0s; 4-core+coverage CI simulation: 509.7s -> 424.9s
(600.0s on master).
Since gnolang#5795 the per-module coverage percentages are only printed to the job log; nothing uploads or tracks them. The instrumentation costs a measured ~1.24x on the gnovm interpreter-heavy tests (TestStdlibs/sort: 22.1s -> 27.5s) plus an instrumented rebuild of every package. Add a 'coverage' input to the reusable Go CI workflow (default true, so gno.land/tm2/misc/contribs keep their current behavior) and opt gnovm out of it.
popCopyArgs allocated an args slice per function call (30M objects in the bytes stdlib suite, the top allocation site after block pooling). doOpCall consumes the args immediately — they are copied into the call block before any further ops — so it now hands popCopyArgs a reusable per-machine scratch buffer, cleared after each use. doOpDefer's args escape into the Defer and keep allocating fresh slices.
Copying an array whose element type is an interface deep-copied each
element's held value, making chains like x = [1]any{x} quadratic where
Go is linear (Go copies the sealed boxes' headers). ArrayValue.Copy now
shares the element values for interface-kind element types.
This is sound because interface-slot contents are sealed: every write
into an interface slot copies on entry (TypedValue.Assign), interface-
held values are not addressable (no lvalue chain can root in one, per
Go's rules which gno follows), and extraction via type assertion copies
again on assignment. Sharing therefore can never expose a mutable
alias.
Unlike the previous optimizations this is intentionally gas-visible for
the affected patterns: the copies no longer happen, so allocation gas
drops accordingly. Exactly one golden changes across the 2344 filetests
— gas/nested_alloc.gno (built to measure this exact pattern) drops from
8,559,690,088 to 17,013,825 gas. recurse1.gno runs in 0.01s instead of
~23s with its output golden unchanged; every realm, alloc and
interrealm golden is byte-identical. vm Gas tests, txtar suite,
examples (220 packages) and cmd/gno all pass.
The coverage dir env vars (TXTARCOVERDIR, GOCOVERDIR, COVERDIR) also steer testscript-based suites: cmd/gno's Test_Scripts fails when TXTARCOVERDIR points at a directory that was never created. With coverage off, leave them empty so those suites skip coverage collection, matching a plain local run.
BigintValue and BigdecValue are immutable at runtime: all arithmetic writes into fresh receivers and conversions only read. Copying them allocated a fresh big.Int/apd.Decimal per copy — 24M allocations in the bytes stdlib suite, mostly from untyped-const operands copied at declaration sites. Share the underlying value instead. Neither Copy ever charged the allocator, so this is gas-neutral.
Three more allocation sites found by profiling the bytes stdlib suite, together ~42M objects (of 135M total): - doOpValueDecl let its working TypedValue escape to the heap once per declaration executed, because its address went into ConvertUntypedTo. At runtime only untyped bools (from comparisons) reach that path, so retype directly; the preprocess-stage conversion moves to a by-value helper. (16.6M) - doOpConvert's working value escaped the same way via ConvertTo and IsReadonly. Use a machine-owned scratch slot: a field's address is free, and the op is single-threaded and self-contained. (11.7M) - Evaluating a constTypeExpr re-boxed the type into a TypeValue interface per evaluation (every conversion evaluates one). Cache the boxed form on the node at preprocess time; nodes loaded from the store fall back to boxing per eval (the cache is not persisted and is never lazily filled at runtime, since nodes can be shared across machines). (13.5M) Also documents on Machine.Release why blockPool/callArgsScratch are deliberately not carried through the machine pool: measured to hurt parallel workloads via extra live heap and lost cache locality, without helping machine-churn workloads (sync.Pool eviction discards them). Full verification battery: filetest suite canonical (all goldens byte-identical), vm Gas tests, txtar, 220 example packages, cmd/gno. Same-session A/B: full pkg/gnolang long mode 173.9s -> 157.1s (-10%); bytes suite solo 94.9s -> 83.5s.
|
Superseded by the stacked split (merge in order): #5811 → #5812 → #5813 → #5814 → #5815 → #5816. This branch served as the performance testbed; final measured result of the full stack: |
Summary
ci / gnovmtakes ~14 minutes wall time, gating every PR that touchesgnovm/**,tm2/**orexamples/**. This PR attacks it from two sides: it restructures the gnovm test suites for parallelism (no test content changes) with a newgno test -jobsflag, and — since themain / testjob turned out to be bound by total CPU work, not scheduling — it profiles the actual workload and eliminates the dominant source of redundant work in the VM itself: heap-boxed byte-element access (a benefit on-chain as well, not just in CI).CI results from this PR's own runs (3 samples):
stdlibs / test8m33s → 5m21s / 5m30s / 5m35s, examplesgno-checks / test3m44s → 3m11s / 3m09s, andmain / testunchanged in median —pkg/gnolang685.7s / 710.6s / 899.6s against a master baseline stable at 707.9–721.5s (4 runs). Themain / testoutcome is itself the key research finding: on a 4-vCPU runner that job is bound by total CPU work (~2700 instrumented-interpreter CPU-seconds), not by scheduling —go test's package-level concurrency was already packing the cores. Its minutes can only come from doing less work; the measured levers are in the follow-ups. On many-core dev machines the restructure does cut wall time (pkg/gnolang283.7s → 245.0s on 16 cores) since there the bottleneck was serialization, not CPU.Where the time goes (research)
Job breakdown of a representative master run (27282103940):
main / test (
go test -covermode=set -coverpkg=gnovm/... ./...):pkg/gnolangalone is 707.9s; second place iscmd/gnoat 88s. Withinpkg/gnolangvirtually all time is interpreted Gno execution in two test functions:TestStdlibs— the heavy suites (bytes 172s, strconv 105s, regexp 61s, bufio 54s, sort 49s; local long-mode numbers) were already parallel with their own stores, but every other stdlib ran sequentially on a shared store in the parent test body — and Go only starts parallel subtests after the parent body returns. The sequential walk (including math/overflow at 63s) both ran single-threaded and delayed the start of the parallel heavy hitters.TestFiles— 2344 filetests, of which 2339 ran sequentially on one goroutine; only the 5*_long.gnofiles were parallel.stdlibs / test:
gno test ./...overgnovm/stdlibsis fully sequential (single-threaded VM, one package at a time): bytes 174s, strconv 86s, math/overflow 60s, regexp 35s, bufio 31s, … ≈455s of test time on one core of a 4-vCPU runner. Pergnovm/Makefile, this is not redundant withTestStdlibs(it also runsxxx_testintegration packages), so it needs to get faster, not deleted.Changes
TestFiles: non-long filetests now run as parallel subtests, drawing aTestOptions(store) from aGOMAXPROCS-sized pool, so stores — and the stdlib packages already loaded into them — are reused across tests, just split N ways.-update-golden-testskeeps the previous fully-sequential single-store behavior (deterministic walk-order writes).TestStdlibs: every stdlib package runs as a parallel subtest with its own store, removing the sequential shared-store walk and the hardcoded special-case package lists. The smalltests/stdlibswalk stays sequential on one store.gotypecheck.go:gnoBuiltinsCachewas lazily populated without synchronization — a latent data race that becomes load-bearing now that type-checks run concurrently from the start. The cache is now built eagerly at package init and is read-only afterwards (no lock needed).gno test -jobs N(default 1 = behavior unchanged): tests up to N packages in parallel, each worker owning a store/typecheck-cache reused across the packages it runs. Per-package output is buffered and printed in package order as results complete (out/err kept on their original streams), so logs stay readable and deterministic.-failfaststops scheduling new packages after the first failure; in-flight packages finish. Incompatible with the interactive-debugmode._ci-gno.yml: pass-jobs 4(runners have 4 vCPUs) to thegno teststep — used by the gnovm stdlibs job and the examples job.Measurements
All runs in long mode (no
-short); pass/fail sets verified identical before/after at every step (2966 pass / 11 fail — the 11 are pre-existing golden typecheck mismatches against go1.26 locally; CI uses the go.mod toolchain and is green). 4-core runs aretaskset-pinned to mirror the ubuntu-latest runners.stdlibs / testjobgno-checks / testjobmain / testpkg/gnolangpackage timego test ./pkg/gnolang/, 16 cores, no coveragego test -covermode=set -coverpkg=gnovm/... ./pkg/gnolang/,taskset4 coresgno test gnovm/stdlibs/...,taskset4 cores-jobs 4)gno test examples/...,taskset4 cores-jobs 4, all 220 testable packages pass)Why
main / testdoesn't move on CI even though the same binary improves elsewhere: a 4-vCPU runner has to execute ~2700 CPU-seconds of coverage-instrumented interpreted Gno regardless of how it's scheduled (floor ≈ 11 min), andgo test -pwas already overlappingpkg/gnolang's underused phases withcmd/gno/pkg/parser. The restructure removes the serialization (which is what binds on many-core machines and is whystdlibs / test— single-threaded before — drops by 3m12s), but on 4 coresmain / testsits at the CPU floor. Cutting it further requires reducing the work itself — see follow-ups.A second measured caveat: GnoVM tests are allocation-heavy, so N concurrent VMs in one process slow each other down well beyond simple core-sharing (GC assist / memory pressure). With
-jobs 4on 4 cores, per-package wall time inflates 2–3x (bytes131s solo → 270s; examples total CPU 129s → 400s). That's why stdlibs nets a large win (its sequential baseline wasted 3 of 4 cores) while the locally-fast examples suite nets none, and why-jobsdeliberately defaults to 1. It is also the plausible mechanism behindmain / test's one slow sample (899.6s, vs 685.7s/710.6s and a master baseline stable within ~2%): with 4 VM-heavy subtests running concurrently the job is more sensitive to the runner's memory subsystem. Review pushes will accumulate more samples; if the tail shows up again, cappinggo test -parallelfor the gnovm job (keeping the structural fix while limiting concurrent VMs) is a one-line mitigation.VM profiling: where the CPU actually goes
Since spreading the work can't cut
main / teston a 4-vCPU runner, the next step was pprof on the dominant workloads. CPU profile ofTestStdlibs/bytes: ~50% of all CPU is Go GC + malloc machinery; the opcode loop itself is ~30%. Allocation profile: 67% of all heap allocations (694M objects in that one suite) came fromArrayValue.GetPointerAtIndexInt2materializing a*TypedValue+ boxedDataByteValueper byte accessed:copy()builtin allocated two boxes per byte copied (554M objects — the code even had aTODO: consider an optimization if dstv.Data != nil);for i, c := range bytesliceallocated three objects per iteration (the indexTypedValueescaping throughGetPointerAtIndex(&iv), plus the view box) thatDerefimmediately threw away;b[i]reads indoOpIndex1did the same box-then-Deref dance.The fix (last commit) adds
TypedValue.GetValueAtIntIndex— a read-only fast path mirroringGetPointerAtIndex's checks and panics for strings and Data-backed arrays/slices — used bydoOpIndex1and the range loop, and givescopy()direct byte copies when both sides are Data-backed (or the source is a string). This allocation pressure is also what made concurrent VMs degrade each other (the GC-contention caveat above), so it compounds with the parallelism work.Gas is unchanged: the view boxes were raw Go allocations, never charged to the VM allocator, and CPU gas for
copy()was already charged before the per-element loop. Verified empirically: all 2344 filetest goldens (includingGas:and MAXALLOC-sensitive alloc tests) pass byte-identical, plus thegno.land/pkg/sdk/vmgas tests and the txtar integration suite.TestStdlibs/bytessolopkg/gnolanglong mode, 16 coresBenchmarkOpIndex1_ByteArraymain / testpkg/gnolang(untouched paths — List arrays, maps, range-over-int-slices — benchmark flat:
OpRangeIter_100030.9µs → 30.2µs,OpIndex1_MapHit_100187.0ns → 187.7ns)Block recycling: removing the next 45%
After the byte-access fixes,
NewBlockwas 45% of remaining allocations (135M objects in the bytes suite: 75M scope blocks fromdoOpExec— if/for/range/switch — and 60M call blocks fromdoOpCall). The key enabler is an invariant the heap-items design already established: closures captureHeapItemValues, not blocks (doOpFuncLitsetsParent: nil), pointers to locals are heap-promoted by the preprocessor, and stacktraces/frames store locations and indices — so a runtime block provably dies when discarded from the machine's block stack.The last commit adds a small per-machine pool (
acquireBlock/releaseBlock) on top of that invariant, routed through every block discard site. Three block populations that also travel the stack are excluded at release: node-owned static blocks (pushed byEval/RunStatementflows — identified by static-block identity), file/package blocks (referenced byFuncValue.Parent), and defer-site blocks (Defer.Parentis visited by the VM's GC until the defer runs;doOpDefermarks them via a flag tucked intobodyStmt's trailing padding, keepingunsafe.Sizeof(Block{})— and the_allocBlockgas constant — unchanged). Panic unwinding skips pooling entirely as cheap conservatism.Gas and VM-GC accounting are unchanged:
acquireBlockchargesAllocateBlockexactly likeAllocator.NewBlock, and pooled (zeroed) blocks are unreachable from GC roots, just like dead blocks today. Same verification battery as before: all 2344 filetest goldens byte-identical (incl.Gas:/Realm:/Storage:/MAXALLOC tests), vm Gas tests, txtar, all 220 example packages, cmd/gno suite.TestStdlibs/bytessolopkg/gnolanglong mode, 16 coresstdlibs / testjobmain / testpkg/gnolangThe
main / testCI numbers stay tail-prone (the 936s sample sits next to a same-runstdlibs / testat its fastest ever and a normalcmd/gno, pointing at a slow/noisy runner amplified by 4-way VM contention — same pattern as the earlier 899.6s sample); the controlled 4-core simulations and the coverage-freestdlibs / testjob show the real effect. With ~425s of instrumented work remaining in the simulation, the print-only coverage tax (≈1.24x) is now the largest single lever left onmain / test.Round 3: coverage off, call-arg pooling, linear interface-array copies
Three more changes, attacking the remaining
main / testtime:coverageinput on_ci-go.yml, defaulttrueso gno.land/tm2/misc/contribs are untouched). Since chore(ci): remove Codecov from CI workflows #5795 the percentages were print-only; on gnovm they cost a measured ~1.24x on the tests plus an instrumented rebuild of everything.popCopyArgsreuses a per-machine scratch buffer on the call path (30M arg-slice allocations in the bytes suite — the top site after block pooling).doOpDefer's args escape into theDeferand keep allocating.ArrayValue.Copy), makingx = [1]any{x}chains linear like Go (which copies the sealed boxes' headers). Soundness: interface-slot contents are sealed — every write into an interface slot copies on entry, interface-held values are not addressable, and type-assertion extraction copies again on assignment. This one is intentionally gas-visible for the affected patterns: exactly one golden changes across the 2344 filetests —gas/nested_alloc.gno(built to measure this exact pattern) drops from 8,559,690,088 to 17,013,825 gas;recurse1.gnoruns in 0.01s instead of ~23s with its output golden unchanged, and every realm/alloc/interrealm golden stays byte-identical.Verification battery repeated in full: filetest suite (canonical pass/fail set), vm Gas tests, txtar, all 220 example packages, cmd/gno.
pkg/gnolanglong mode, 16 coresRound-3 CI results
Two post-push runs of
ci / gnovm:main / testpkg/gnolangstdlibs / testjobci / gnovmworkflowpkg/gnolangnow sits at 264–390s on CI (708–722s on master, median ≈ −55%), with the familiar runner-quality spread; the local 4-core simulation (273.4s) matches the good-runner samples.cmd/gnoalso dropped 88–97s → 72s without instrumentation. Whole-workflow wall time lands at ~6–8.5 minutes depending on runner luck, vs a stable ~14 minutes on master.Round 4: bigint sharing, per-op escape fixes, type re-boxing
Fresh profiles after round 3 pointed at the next allocation tier; four more gas-neutral changes (all goldens byte-identical, full battery repeated):
ConvertUntypedTo(&tv); at runtime only untyped bools reach that path, now retyped directly (16.6M).ConvertTo/IsReadonly; now uses a machine-owned scratch slot whose address is free (11.7M).One negative result worth recording: carrying the block pool through
Machine.Release/machinePoolmeasured 10–25% slower on parallel workloads (extra live heap across GC cycles, lost cache locality) with no benefit for machine-churn workloads —Machine.Releasenow documents why the pools are deliberately dropped.Same-session A/B (laptop thermals make cross-session walls incomparable; this is back-to-back): full
pkg/gnolanglong mode 173.9s → 157.1s (−10%), bytes suite solo 94.9s → 83.5s.Round-4 CI result
pkg/gnolangpackage timecmd/gnopackage timemain / testjobstdlibs / testjobci / gnovmworkflow(single good-runner sample; the runner-quality band observed across this PR suggests ~6–8m for the workflow)
Follow-ups (measured, but out of scope here)
GetPointerToFromTV(26M objects),doOpValueDecl(16.5M) andmath/big.NewInt(16M) in the bytes suite.StructValue.Copycould get the same interface-field sharing as arrays (per-field static types; it was not the profiled quadratic case, so left out).transcribe(preprocessing) is 29% of the TestFiles short-walk CPU — per-filetest typecheck+preprocess of imports; a Store-level preprocessed-stdlib cache would help every consumer (filetests, gno test, keeper cold paths).b[i] = x) still box (17M objects viaPopAsPointer2) — the pointer protocol spans multiple ops, so a write-side fast path needs more design.