Skip to content

Replace per-block Mmap with pread, ~300x apply speedup on Darwin#790

Open
jverkoey wants to merge 1 commit intodrolbr:masterfrom
ClutchEngineering:pr-mmap-pread
Open

Replace per-block Mmap with pread, ~300x apply speedup on Darwin#790
jverkoey wants to merge 1 commit intodrolbr:masterfrom
ClutchEngineering:pr-mmap-pread

Conversation

@jverkoey
Copy link
Copy Markdown

Profiling update_from_dir with macOS sample(1) for 30 seconds on an Apple Silicon M-series showed 350 of 361 on-CPU samples (97%) inside the __mmap syscall. The hot path is:

File_Blocks::read_block_ -> Mmap::Mmap -> mmap

called once per compressed block read. On macOS each mmap syscall costs ~0.25 ms of kernel overhead (virtual range alloc, page-table setup, fault-in, teardown on munmap). Across thousands of block reads per minute-diff the syscall tax dominates wall time and prevents apply from keeping pace with the 1-diff-per-minute fetch rate. Linux mmap is cheaper so this is invisible on Linux.

Analysis

For Overpass's access pattern there's no benefit to a memory mapping: each compressed block is read once, decompressed into a separate buffer by Zlib/LZ4 Inflate, and never revisited. The Mmap class exists only to own the read buffer for decompression. Replacing the mmap/munmap pair with a pread into a heap buffer keeps Mmap::ptr() pointer-compatible with every caller and eliminates the syscall tax. Linux performance is unaffected — pread hits the same page cache that mmap would have.

Measured effect

Single-diff apply (diff 7076518, 6550 ops) against a live 291 GB database on Apple Silicon:

wall time CPU% notes
before 9 min 01 s 100% 97% samples in __mmap
after 1.79 s 67% now compute-bound

~300x speedup. The NO_COMPRESSION branch of File_Blocks::read_block_ already uses pread via data_file.read(); this brings the compressed path to parity.

Notes

  • Public interface (Mmap::ptr()) unchanged — no caller needs adjusting.
  • Exception behavior preserved: throws File_Error on I/O failure with the same arguments.
  • No Linux regression expected (pread and mmap both use the unified page cache), but would welcome Linux benchmarks from a reviewer with access.

Companion PRs #788 (off64_t alias) and #789 (sun_len fix) address other Darwin-specific issues hit while bringing osm-3s up on Apple Silicon natively.

Profiling update_from_dir with macOS sample(1) for 30 seconds showed
350 of 361 on-CPU samples (97%) inside the __mmap syscall. The hot
path is:

  File_Blocks::read_block_ -> Mmap::Mmap -> mmap

called once per compressed block read. On macOS each mmap syscall
costs ~0.25 ms of kernel overhead (virtual range alloc, page-table
setup, fault-in, teardown on munmap). Across thousands of block
reads per minute-diff the syscall tax dominates wall time and
prevents apply from keeping pace with the 1-diff-per-minute fetch
rate. Linux mmap is cheaper so this is invisible on Linux.

For Overpass's access pattern there is no benefit to a memory
mapping: each compressed block is read once, decompressed into a
separate buffer, and never revisited. Replacing the mmap with pread
into a heap buffer keeps the Mmap::ptr() interface
pointer-compatible with every caller (Zlib and LZ4 Inflate) while
eliminating the syscall tax. Linux performance is unaffected -- pread
hits the same page cache that mmap would have.

Measured effect on a single-diff apply (7076518, 6550 ops) against a
live 291 GB database on Apple Silicon (M-series):

  before: 9 min 01 s  (100% CPU, 97% samples in __mmap)
  after:  1.79 s      (67% CPU, now compute-bound)

~300x speedup. The NO_COMPRESSION branch of File_Blocks::read_block_
already uses pread via data_file.read(); this brings the compressed
path to parity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant