Replace per-block Mmap with pread, ~300x apply speedup on Darwin#790
Open
jverkoey wants to merge 1 commit intodrolbr:masterfrom
Open
Replace per-block Mmap with pread, ~300x apply speedup on Darwin#790jverkoey wants to merge 1 commit intodrolbr:masterfrom
jverkoey wants to merge 1 commit intodrolbr:masterfrom
Conversation
Profiling update_from_dir with macOS sample(1) for 30 seconds showed 350 of 361 on-CPU samples (97%) inside the __mmap syscall. The hot path is: File_Blocks::read_block_ -> Mmap::Mmap -> mmap called once per compressed block read. On macOS each mmap syscall costs ~0.25 ms of kernel overhead (virtual range alloc, page-table setup, fault-in, teardown on munmap). Across thousands of block reads per minute-diff the syscall tax dominates wall time and prevents apply from keeping pace with the 1-diff-per-minute fetch rate. Linux mmap is cheaper so this is invisible on Linux. For Overpass's access pattern there is no benefit to a memory mapping: each compressed block is read once, decompressed into a separate buffer, and never revisited. Replacing the mmap with pread into a heap buffer keeps the Mmap::ptr() interface pointer-compatible with every caller (Zlib and LZ4 Inflate) while eliminating the syscall tax. Linux performance is unaffected -- pread hits the same page cache that mmap would have. Measured effect on a single-diff apply (7076518, 6550 ops) against a live 291 GB database on Apple Silicon (M-series): before: 9 min 01 s (100% CPU, 97% samples in __mmap) after: 1.79 s (67% CPU, now compute-bound) ~300x speedup. The NO_COMPRESSION branch of File_Blocks::read_block_ already uses pread via data_file.read(); this brings the compressed path to parity.
This was referenced Apr 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Profiling
update_from_dirwith macOSsample(1)for 30 seconds on an Apple Silicon M-series showed 350 of 361 on-CPU samples (97%) inside the__mmapsyscall. The hot path is:called once per compressed block read. On macOS each
mmapsyscall costs ~0.25 ms of kernel overhead (virtual range alloc, page-table setup, fault-in, teardown onmunmap). Across thousands of block reads per minute-diff the syscall tax dominates wall time and prevents apply from keeping pace with the 1-diff-per-minute fetch rate. Linux mmap is cheaper so this is invisible on Linux.Analysis
For Overpass's access pattern there's no benefit to a memory mapping: each compressed block is read once, decompressed into a separate buffer by Zlib/LZ4 Inflate, and never revisited. The
Mmapclass exists only to own the read buffer for decompression. Replacing themmap/munmappair with apreadinto a heap buffer keepsMmap::ptr()pointer-compatible with every caller and eliminates the syscall tax. Linux performance is unaffected —preadhits the same page cache thatmmapwould have.Measured effect
Single-diff apply (diff 7076518, 6550 ops) against a live 291 GB database on Apple Silicon:
__mmap~300x speedup. The
NO_COMPRESSIONbranch ofFile_Blocks::read_block_already usespreadviadata_file.read(); this brings the compressed path to parity.Notes
Mmap::ptr()) unchanged — no caller needs adjusting.File_Erroron I/O failure with the same arguments.Companion PRs #788 (off64_t alias) and #789 (sun_len fix) address other Darwin-specific issues hit while bringing osm-3s up on Apple Silicon natively.