Skip to content

Expose precomputed RowAddrMask as scanner prefilter input from Python API #6852

@JulianYG

Description

@JulianYG

Summary

Lance internally supports precomputed RowAddrMask as a first-class prefilter input via DatasetPreFilter and the existing SelectionVectorToPrefilter FilterLoader. The Arrow-binary serialization format is already defined for inter-process mask passing. However, there is no Python API to provide such a mask directly to a scanner — all prefilter inputs must go through expression evaluation (SQL string or pyarrow / Substrait Expression).

For workloads where row membership is precomputed externally — tag/labeling systems, business-logic-defined cohorts, materialized bitmaps maintained by separate services — this forces a round trip through expression construction and evaluation that becomes a bottleneck at scale.

This issue requests adding a parameter to LanceDataset.scanner (and to_table, ScannerBuilder) that accepts a serialized RowAddrMask and feeds it directly into the existing prefilter machinery, bypassing expression evaluation.

Use case

We operate a Lance dataset at ~3B rows with an IVF vector index. A separate tag service is the source of truth for "datasets" — named subsets of rows identified by UUID — which mutate frequently and are managed independently of the main table. Encoding dataset membership as a column on the main table is not viable because:

  1. Frequent membership changes would trigger fragment rewrites and invalidate scalar indexes.
  2. The tag service is owned by a different system; coupling its updates to main-table writes is undesirable.
  3. Membership is often computed dynamically (rule-based, user-curated) and not always representable as a stable scalar value.

A typical query is "vector search top-K, restricted to rows in the union of datasets {A, B, C}, optionally with additional scalar predicates like city = 'NYC' AND ips_value < 100." The dataset union ranges from a few hundred thousand rows to ~80M rows out of 3B.

The tag service can efficiently produce a serialized RowAddrMask (or its constituent RowAddrTreeMap bytes) describing allowed row IDs. But pylance has no way to consume one.

Current workarounds and why they fall short

1. SQL _rowid IN (...) — Falls over at ~100k–1M values due to SQL parser cost. (Issue #4115 proposes treating _rowid and _rowaddr as having an implicit scalar index, which would help but doesn't address the broader case of providing arbitrary bitmaps without going through expression evaluation.)

2. Substrait singular_or_list — Pushes the ceiling to a few million values, but at 50M+ values:

  • Substrait deserialization: multiple seconds, several GB of memory.
  • Building DataFusion's internal hash set from 50M literals: more seconds and memory.
  • Scalar index probe of 50M lookups: 10–30s even with an index on UUID.
  • Total: 15–40s and 10+ GB of memory before vector search begins.

3. dataset.take(row_ids) + brute-force scoring — Bypasses the IVF index entirely. For 50M candidates, materialization plus scoring takes minutes.

4. Postfilter with overshoot — Workable for moderately selective masks; recall becomes probabilistic and the overshoot factor is hard to tune across varying mask sizes.

5. DuckDB hash semi-join — Excellent for filtering rows but cannot use the IVF vector index; still degenerates to brute force for the vector search step.

6. Custom Rust + PyO3 binding — What we're currently planning. It calls Dataset::open, deserializes a mask into RowAddrMask, and feeds it through DatasetPreFilter to VectorIndex::search. Works, but:

  • Depends on Lance internal Rust APIs that aren't covered by any stability contract.
  • Maintenance cost across Lance version upgrades (e.g., the recent RowIdMaskRowAddrMask rename).
  • Every team facing this problem ends up writing roughly the same 60–250 lines.

Why this should be easy to expose

The internals already do almost all the work:

  • RowAddrMask (in rust/lance-core/src/utils/mask.rs) is the canonical filter representation, supporting both AllowList and BlockList semantics.
  • RowAddrTreeMap is roaring-bitmap-backed and already serializes to a defined binary format (see mask.rs lines 112–179) — a 2-element BinaryArray containing optional BlockList bytes and optional AllowList bytes.
  • SelectionVectorToPrefilter (in rust/lance/src/io/exec/utils.rs) already implements FilterLoader for deserializing this exact format from a record batch stream. It's used for inter-process mask passing between execution nodes.
  • DatasetPreFilter (in rust/lance/src/index/prefilter.rs) already accepts any FilterLoader and composes its result with deletion masks via the existing AND/OR truth table on RowAddrMask.

The ask is to expose this existing serialization format as a Python parameter and route it through the existing SelectionVectorToPrefilter path. No new internal machinery is required.

Proposed API

A new keyword argument on LanceDataset.scanner (and corresponding methods like to_table, plus ScannerBuilder for the builder pattern):

ds.to_table(
    nearest={"column": "vector", "q": query_vec, "k": 10},
    row_addr_mask=mask_bytes,       # serialized RowAddrMask (existing binary format)
    filter="city = 'NYC'",          # optional, combined with row_addr_mask
    prefilter=True,
)

Semantics:

  • row_addr_mask accepts the existing serialized format already used internally (BinaryArray with optional BlockList bytes at index 0 and optional AllowList bytes at index 1).
  • When provided, it is loaded via SelectionVectorToPrefilter and combined with other filter sources (deletion vectors, expression-based filters) by the existing DatasetPreFilter logic.
  • When prefilter=True, it gates IVF traversal. When prefilter=False, it acts as a postfilter.
  • Supports both allow-list and block-list semantics, matching the existing internal type.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions