Optimize trie builder#15977
Conversation
|
I get some number on building index of 1M uuids. FINAL STATS (1 Million 16-byte UUIDs)
|
Can you explain the difference between |
|
romseygeek
left a comment
There was a problem hiding this comment.
So this is better on every measure? Nice! Thanks for picking this up. LGTM.
|
|
||
| * GITHUB#15970: Reduce memory usage of fields with long terms during segment merges. (Alan Woodward) | ||
|
|
||
| * GITHUB#15977: Speed up TrieBuilder and reduce its memory footprint by replacing the in-memory object tree with a compact prefix-coded byte buffer. (Guo Feng) |
There was a problem hiding this comment.
Let's merge this with the previous entry? My one is superseded now!
|
@gf2121 is this ready to be merged? |
|
This sounds awesome -- I'm not sure I'll have time for a close review so please don't wait for me. Does merging also use this same path (we don't have a merge-optimized Trie merging path or so)? This should be needlemoving in luceneutil -- we can watch nightly benchy after it goes in (hopefully no other massive change lands at the same time -- I swear it's better than chance how often this happens ;) ). |
This PR speeds up
TrieBuilderand reduces its memory footprint by replacing the in-memory object tree with a compact prefix-coded byte buffer during the building phase, and using a frontier-based approach during the saving phase.Previously,
TrieBuilderconstructed a large in-memory tree usingNodeobjects. This approach was memory-intensive (O(total nodes * ~120 bytes per node)) and caused massive object allocations, which made it incredibly slow when dealing with large terms.Main Changes
NodeandSaveFrameclasses with a sequentialByteBuffersDataOutputbuffer. Entries are now prefix-encoded (storing prefix length, suffix length, and suffix bytes) and appended sequentially.minKey) is stored separately. This allows theappend()method to re-encode only the first entry and bulk-copy the remaining bytes with zero per-entry overhead.saveNodes()using aFrontierNodearray bounded bymaxKeyDepth, rather than requiring the whole tree to exist in memory.statusmanagement because nothing got destroyed afterappendorsave.Memory & Performance Impact