Summary
Lance internally supports precomputed RowAddrMask as a first-class prefilter input via DatasetPreFilter and the existing SelectionVectorToPrefilter FilterLoader. The Arrow-binary serialization format is already defined for inter-process mask passing. However, there is no Python API to provide such a mask directly to a scanner — all prefilter inputs must go through expression evaluation (SQL string or pyarrow / Substrait Expression).
For workloads where row membership is precomputed externally — tag/labeling systems, business-logic-defined cohorts, materialized bitmaps maintained by separate services — this forces a round trip through expression construction and evaluation that becomes a bottleneck at scale.
This issue requests adding a parameter to LanceDataset.scanner (and to_table, ScannerBuilder) that accepts a serialized RowAddrMask and feeds it directly into the existing prefilter machinery, bypassing expression evaluation.
Use case
We operate a Lance dataset at ~3B rows with an IVF vector index. A separate tag service is the source of truth for "datasets" — named subsets of rows identified by UUID — which mutate frequently and are managed independently of the main table. Encoding dataset membership as a column on the main table is not viable because:
- Frequent membership changes would trigger fragment rewrites and invalidate scalar indexes.
- The tag service is owned by a different system; coupling its updates to main-table writes is undesirable.
- Membership is often computed dynamically (rule-based, user-curated) and not always representable as a stable scalar value.
A typical query is "vector search top-K, restricted to rows in the union of datasets {A, B, C}, optionally with additional scalar predicates like city = 'NYC' AND ips_value < 100." The dataset union ranges from a few hundred thousand rows to ~80M rows out of 3B.
The tag service can efficiently produce a serialized RowAddrMask (or its constituent RowAddrTreeMap bytes) describing allowed row IDs. But pylance has no way to consume one.
Current workarounds and why they fall short
1. SQL _rowid IN (...) — Falls over at ~100k–1M values due to SQL parser cost. (Issue #4115 proposes treating _rowid and _rowaddr as having an implicit scalar index, which would help but doesn't address the broader case of providing arbitrary bitmaps without going through expression evaluation.)
2. Substrait singular_or_list — Pushes the ceiling to a few million values, but at 50M+ values:
- Substrait deserialization: multiple seconds, several GB of memory.
- Building DataFusion's internal hash set from 50M literals: more seconds and memory.
- Scalar index probe of 50M lookups: 10–30s even with an index on UUID.
- Total: 15–40s and 10+ GB of memory before vector search begins.
3. dataset.take(row_ids) + brute-force scoring — Bypasses the IVF index entirely. For 50M candidates, materialization plus scoring takes minutes.
4. Postfilter with overshoot — Workable for moderately selective masks; recall becomes probabilistic and the overshoot factor is hard to tune across varying mask sizes.
5. DuckDB hash semi-join — Excellent for filtering rows but cannot use the IVF vector index; still degenerates to brute force for the vector search step.
6. Custom Rust + PyO3 binding — What we're currently planning. It calls Dataset::open, deserializes a mask into RowAddrMask, and feeds it through DatasetPreFilter to VectorIndex::search. Works, but:
- Depends on Lance internal Rust APIs that aren't covered by any stability contract.
- Maintenance cost across Lance version upgrades (e.g., the recent
RowIdMask → RowAddrMask rename).
- Every team facing this problem ends up writing roughly the same 60–250 lines.
Why this should be easy to expose
The internals already do almost all the work:
RowAddrMask (in rust/lance-core/src/utils/mask.rs) is the canonical filter representation, supporting both AllowList and BlockList semantics.
RowAddrTreeMap is roaring-bitmap-backed and already serializes to a defined binary format (see mask.rs lines 112–179) — a 2-element BinaryArray containing optional BlockList bytes and optional AllowList bytes.
SelectionVectorToPrefilter (in rust/lance/src/io/exec/utils.rs) already implements FilterLoader for deserializing this exact format from a record batch stream. It's used for inter-process mask passing between execution nodes.
DatasetPreFilter (in rust/lance/src/index/prefilter.rs) already accepts any FilterLoader and composes its result with deletion masks via the existing AND/OR truth table on RowAddrMask.
The ask is to expose this existing serialization format as a Python parameter and route it through the existing SelectionVectorToPrefilter path. No new internal machinery is required.
Proposed API
A new keyword argument on LanceDataset.scanner (and corresponding methods like to_table, plus ScannerBuilder for the builder pattern):
ds.to_table(
nearest={"column": "vector", "q": query_vec, "k": 10},
row_addr_mask=mask_bytes, # serialized RowAddrMask (existing binary format)
filter="city = 'NYC'", # optional, combined with row_addr_mask
prefilter=True,
)
Semantics:
row_addr_mask accepts the existing serialized format already used internally (BinaryArray with optional BlockList bytes at index 0 and optional AllowList bytes at index 1).
- When provided, it is loaded via
SelectionVectorToPrefilter and combined with other filter sources (deletion vectors, expression-based filters) by the existing DatasetPreFilter logic.
- When
prefilter=True, it gates IVF traversal. When prefilter=False, it acts as a postfilter.
- Supports both allow-list and block-list semantics, matching the existing internal type.
Summary
Lance internally supports precomputed
RowAddrMaskas a first-class prefilter input viaDatasetPreFilterand the existingSelectionVectorToPrefilterFilterLoader. The Arrow-binary serialization format is already defined for inter-process mask passing. However, there is no Python API to provide such a mask directly to a scanner — all prefilter inputs must go through expression evaluation (SQL string or pyarrow / SubstraitExpression).For workloads where row membership is precomputed externally — tag/labeling systems, business-logic-defined cohorts, materialized bitmaps maintained by separate services — this forces a round trip through expression construction and evaluation that becomes a bottleneck at scale.
This issue requests adding a parameter to
LanceDataset.scanner(andto_table,ScannerBuilder) that accepts a serializedRowAddrMaskand feeds it directly into the existing prefilter machinery, bypassing expression evaluation.Use case
We operate a Lance dataset at ~3B rows with an IVF vector index. A separate tag service is the source of truth for "datasets" — named subsets of rows identified by UUID — which mutate frequently and are managed independently of the main table. Encoding
datasetmembership as a column on the main table is not viable because:A typical query is "vector search top-K, restricted to rows in the union of datasets {A, B, C}, optionally with additional scalar predicates like
city = 'NYC' AND ips_value < 100." The dataset union ranges from a few hundred thousand rows to ~80M rows out of 3B.The tag service can efficiently produce a serialized
RowAddrMask(or its constituentRowAddrTreeMapbytes) describing allowed row IDs. But pylance has no way to consume one.Current workarounds and why they fall short
1. SQL
_rowid IN (...)— Falls over at ~100k–1M values due to SQL parser cost. (Issue #4115 proposes treating_rowidand_rowaddras having an implicit scalar index, which would help but doesn't address the broader case of providing arbitrary bitmaps without going through expression evaluation.)2. Substrait
singular_or_list— Pushes the ceiling to a few million values, but at 50M+ values:3.
dataset.take(row_ids)+ brute-force scoring — Bypasses the IVF index entirely. For 50M candidates, materialization plus scoring takes minutes.4. Postfilter with overshoot — Workable for moderately selective masks; recall becomes probabilistic and the overshoot factor is hard to tune across varying mask sizes.
5. DuckDB hash semi-join — Excellent for filtering rows but cannot use the IVF vector index; still degenerates to brute force for the vector search step.
6. Custom Rust + PyO3 binding — What we're currently planning. It calls
Dataset::open, deserializes a mask intoRowAddrMask, and feeds it throughDatasetPreFiltertoVectorIndex::search. Works, but:RowIdMask→RowAddrMaskrename).Why this should be easy to expose
The internals already do almost all the work:
RowAddrMask(inrust/lance-core/src/utils/mask.rs) is the canonical filter representation, supporting both AllowList and BlockList semantics.RowAddrTreeMapis roaring-bitmap-backed and already serializes to a defined binary format (seemask.rslines 112–179) — a 2-elementBinaryArraycontaining optional BlockList bytes and optional AllowList bytes.SelectionVectorToPrefilter(inrust/lance/src/io/exec/utils.rs) already implementsFilterLoaderfor deserializing this exact format from a record batch stream. It's used for inter-process mask passing between execution nodes.DatasetPreFilter(inrust/lance/src/index/prefilter.rs) already accepts anyFilterLoaderand composes its result with deletion masks via the existing AND/OR truth table onRowAddrMask.The ask is to expose this existing serialization format as a Python parameter and route it through the existing
SelectionVectorToPrefilterpath. No new internal machinery is required.Proposed API
A new keyword argument on
LanceDataset.scanner(and corresponding methods liketo_table, plusScannerBuilderfor the builder pattern):Semantics:
row_addr_maskaccepts the existing serialized format already used internally (BinaryArraywith optional BlockList bytes at index 0 and optional AllowList bytes at index 1).SelectionVectorToPrefilterand combined with other filter sources (deletion vectors, expression-based filters) by the existingDatasetPreFilterlogic.prefilter=True, it gates IVF traversal. Whenprefilter=False, it acts as a postfilter.