Expose precomputed `RowAddrMask` as scanner prefilter input from Python API

## Summary

Lance internally supports precomputed `RowAddrMask` as a first-class prefilter input via `DatasetPreFilter` and the existing `SelectionVectorToPrefilter` `FilterLoader`. The Arrow-binary serialization format is already defined for inter-process mask passing. However, there is no Python API to provide such a mask directly to a scanner — all prefilter inputs must go through expression evaluation (SQL string or pyarrow / Substrait `Expression`).

For workloads where row membership is precomputed externally — tag/labeling systems, business-logic-defined cohorts, materialized bitmaps maintained by separate services — this forces a round trip through expression construction and evaluation that becomes a bottleneck at scale.

This issue requests adding a parameter to `LanceDataset.scanner` (and `to_table`, `ScannerBuilder`) that accepts a serialized `RowAddrMask` and feeds it directly into the existing prefilter machinery, bypassing expression evaluation.

## Use case

We operate a Lance dataset at ~3B rows with an IVF vector index. A separate tag service is the source of truth for "datasets" — named subsets of rows identified by UUID — which mutate frequently and are managed independently of the main table. Encoding `dataset` membership as a column on the main table is not viable because:

1. Frequent membership changes would trigger fragment rewrites and invalidate scalar indexes.
2. The tag service is owned by a different system; coupling its updates to main-table writes is undesirable.
3. Membership is often computed dynamically (rule-based, user-curated) and not always representable as a stable scalar value.

A typical query is "vector search top-K, restricted to rows in the union of datasets {A, B, C}, optionally with additional scalar predicates like `city = 'NYC' AND ips_value < 100`." The dataset union ranges from a few hundred thousand rows to ~80M rows out of 3B.

The tag service can efficiently produce a serialized `RowAddrMask` (or its constituent `RowAddrTreeMap` bytes) describing allowed row IDs. But pylance has no way to consume one.

## Current workarounds and why they fall short

**1. SQL `_rowid IN (...)`** — Falls over at ~100k–1M values due to SQL parser cost. (Issue [#4115](https://github.com/lancedb/lance/issues/4115) proposes treating `_rowid` and `_rowaddr` as having an implicit scalar index, which would help but doesn't address the broader case of providing arbitrary bitmaps without going through expression evaluation.)

**2. Substrait `singular_or_list`** — Pushes the ceiling to a few million values, but at 50M+ values:
- Substrait deserialization: multiple seconds, several GB of memory.
- Building DataFusion's internal hash set from 50M literals: more seconds and memory.
- Scalar index probe of 50M lookups: 10–30s even with an index on UUID.
- Total: 15–40s and 10+ GB of memory before vector search begins.

**3. `dataset.take(row_ids)` + brute-force scoring** — Bypasses the IVF index entirely. For 50M candidates, materialization plus scoring takes minutes.

**4. Postfilter with overshoot** — Workable for moderately selective masks; recall becomes probabilistic and the overshoot factor is hard to tune across varying mask sizes.

**5. DuckDB hash semi-join** — Excellent for filtering rows but cannot use the IVF vector index; still degenerates to brute force for the vector search step.

**6. Custom Rust + PyO3 binding** — What we're currently planning. It calls `Dataset::open`, deserializes a mask into `RowAddrMask`, and feeds it through `DatasetPreFilter` to `VectorIndex::search`. Works, but:
- Depends on Lance internal Rust APIs that aren't covered by any stability contract.
- Maintenance cost across Lance version upgrades (e.g., the recent `RowIdMask` → `RowAddrMask` rename).
- Every team facing this problem ends up writing roughly the same 60–250 lines.

## Why this should be easy to expose

The internals already do almost all the work:

- `RowAddrMask` (in `rust/lance-core/src/utils/mask.rs`) is the canonical filter representation, supporting both AllowList and BlockList semantics.
- `RowAddrTreeMap` is roaring-bitmap-backed and already serializes to a defined binary format (see `mask.rs` lines 112–179) — a 2-element `BinaryArray` containing optional BlockList bytes and optional AllowList bytes.
- `SelectionVectorToPrefilter` (in `rust/lance/src/io/exec/utils.rs`) already implements `FilterLoader` for deserializing this exact format from a record batch stream. It's used for inter-process mask passing between execution nodes.
- `DatasetPreFilter` (in `rust/lance/src/index/prefilter.rs`) already accepts any `FilterLoader` and composes its result with deletion masks via the existing AND/OR truth table on `RowAddrMask`.

The ask is to expose this existing serialization format as a Python parameter and route it through the existing `SelectionVectorToPrefilter` path. No new internal machinery is required.

## Proposed API

A new keyword argument on `LanceDataset.scanner` (and corresponding methods like `to_table`, plus `ScannerBuilder` for the builder pattern):

```python
ds.to_table(
    nearest={"column": "vector", "q": query_vec, "k": 10},
    row_addr_mask=mask_bytes,       # serialized RowAddrMask (existing binary format)
    filter="city = 'NYC'",          # optional, combined with row_addr_mask
    prefilter=True,
)
```

Semantics:
- `row_addr_mask` accepts the existing serialized format already used internally (`BinaryArray` with optional BlockList bytes at index 0 and optional AllowList bytes at index 1).
- When provided, it is loaded via `SelectionVectorToPrefilter` and combined with other filter sources (deletion vectors, expression-based filters) by the existing `DatasetPreFilter` logic.
- When `prefilter=True`, it gates IVF traversal. When `prefilter=False`, it acts as a postfilter.
- Supports both allow-list and block-list semantics, matching the existing internal type.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose precomputed `RowAddrMask` as scanner prefilter input from Python API #6852

Summary

Use case

Current workarounds and why they fall short

Why this should be easy to expose

Proposed API

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Expose precomputed RowAddrMask as scanner prefilter input from Python API #6852

Description

Summary

Use case

Current workarounds and why they fall short

Why this should be easy to expose

Proposed API

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Expose precomputed `RowAddrMask` as scanner prefilter input from Python API #6852