feat(fts): thread DataFusion MemoryPool through inverted index build#7314
Draft
wjones127 wants to merge 1 commit into
Draft
feat(fts): thread DataFusion MemoryPool through inverted index build#7314wjones127 wants to merge 1 commit into
wjones127 wants to merge 1 commit into
Conversation
Previously the FTS builder used a bespoke per-worker memory watermark (`LANCE_FTS_PARTITION_SIZE` env var, default 2 GiB) to decide when to flush a posting-list partition. Each worker tracked usage independently and flushed when it crossed its private limit. This replaces that mechanism with a DataFusion `FairSpillPool` shared across all workers for a given build. Each worker holds a `MemoryReservation` and calls `try_grow` after each document; when the pool is exhausted `try_grow` returns `Err` which triggers a flush, keeping total build memory within the user-configured budget. Fixes lance-format#7304. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously the FTS builder used a bespoke per-worker memory watermark
(
LANCE_FTS_PARTITION_SIZEenv var, default 2 GiB) to decide when to flush aposting-list partition. Each worker tracked usage independently and flushed when it
crossed its private limit.
Changes
worker_memory_limit_bytesinIndexWorkerConfig/IndexWorkerwith aMemoryReservationdrawn from a sharedFairSpillPool.FairSpillPoolper build, sized to the total memory budget(
memory_limit_mb× 1 MiB, orLANCE_FTS_PARTITION_SIZE× number of workers asthe default), and pass it into every
IndexWorkerviaIndexWorkerConfig.try_grow/shrink. Whentry_growreturnsErrthe pool is exhausted — flush the current partition andfree the reservation.
error when
try_growfails on an otherwise-empty builder.Test plan
test_memory_pool_spills_on_tight_budget: constructs anIndexWorkerwith a72 KiB
FairSpillPool, processes 20 docs each contributing 300 unique tokens, andasserts that at least one completed partition was written (proving pool-triggered
spill occurred).
Closes #7304.