fix(rust): fixed IVF_RQ indexes ignoring the frag-reuse index#7217
Conversation
ad10713 to
c9a48a9
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Katomoto
left a comment
There was a problem hiding this comment.
Logic looks correct. One suggestion: the early return on line prevents the cleanup function from running — worth adding a finally block.
|
@Katomoto Thanks, unfortunately I don't understand exactly which part of my code you are talking about so if possible make a comment on that section or refference it in some capacity. |
|
LGTM |
…, deletions, physical remap + trim) Adds KNN-stability coverage for the defer_index_remap (fragment-reuse) window, parameterized over index type. Two helpers build an id+vec dataset and assert no stale/deleted row is ever returned and top-k stays consistent with the pre-compaction answer: - check_vector_defer_compaction: FRI window, with/without materialized deletions. - check_vector_remap_and_trim: merge-only, then physical remap_column_index + cleanup_frag_reuse_index, asserting the reuse index trims to zero versions. Crucially these project + fetch row content (a `take`), unlike the existing `test_read_ivf_*_with_defer_index_remap` tests which only count row_ids and so cannot observe a stale address. Passing: IVF_FLAT / IVF_SQ / IVF_RQ (deletions + remap/trim), IVF_HNSW_SQ (merge-only window + remap/trim, recall-tolerant). Reveals open bugs (marked #[ignore] as reproducers), all in the FRI window with lance-format#7217 applied: - IVF_PQ and IVF_HNSW_PQ: serve a STALE (compacted-away) row address — the take fails with "fragment does not exist". NOT deletion-specific (reproduces merge-only); only surfaces when row content is fetched. IVF_SQ/RQ/FLAT handle the identical path, so the cause is PQ-specific. - IVF_HNSW_SQ + materialized deletions: out-of-bounds panic in SQStorage::chunk — the HNSW graph addresses storage positionally and is not realigned after dropped rows. Merge-only HNSW_SQ is fine. Stacked on the IVF_RQ frag-reuse fix (lance-format#7217).
`ProductQuantizationStorage::new` remaps its PQ codes through the fragment-reuse index but left the `row_ids` field bound to the pre-remap addresses. After a deferred compaction (`defer_index_remap`), an IVF_PQ (or IVF_HNSW_PQ) index then returned stale, compacted-away row addresses, and the scanner take failed with "The input to a take operation specified fragment id N but this fragment does not exist". This happened even for merge-only compaction and was only observable when the query fetched row content; the existing `test_read_ivf_pq_index_v3_with_defer_index_remap` projects no columns (it counts row ids without taking), so it missed the bug. Refresh `row_ids` from the remapped batch alongside `pq_code`. Also adds correctness coverage across the full deferred-compaction lifecycle (fragment-reuse window, materialized deletions, and physical remap + reuse-index trim) for IVF_FLAT/PQ/SQ/RQ and merge-only IVF_HNSW_*, plus a scalar bitmap deletion case. Unlike the existing `test_read_*_with_defer_index_remap` tests, these project and fetch row content (a take), which is what surfaces stale addresses. Stacked on lance-format#7217 (IVF_RQ fragment-reuse fix); the tests assume that fix is present.
|
Follow-up: #7315 builds on this PR — it fixes a related IVF_PQ / IVF_HNSW_PQ stale-row-address bug under deferred compaction (PQ storage kept its pre-remap |
`ProductQuantizationStorage::new` remaps its PQ codes through the fragment-reuse index but left the `row_ids` field bound to the pre-remap addresses. After a deferred compaction (`defer_index_remap`), an IVF_PQ (or IVF_HNSW_PQ) index then returned stale, compacted-away row addresses, and the scanner take failed with "The input to a take operation specified fragment id N but this fragment does not exist". This happened even for merge-only compaction and was only observable when the query fetched row content; the existing `test_read_ivf_pq_index_v3_with_defer_index_remap` projects no columns (it counts row ids without taking), so it missed the bug. Refresh `row_ids` from the remapped batch alongside `pq_code`. Also adds correctness coverage across the full deferred-compaction lifecycle (fragment-reuse window, materialized deletions, and physical remap + reuse-index trim) for IVF_FLAT/PQ/SQ/RQ and merge-only IVF_HNSW_*, plus a scalar bitmap deletion case. Unlike the existing `test_read_*_with_defer_index_remap` tests, these project and fetch row content (a take), which is what surfaces stale addresses. Stacked on lance-format#7217 (IVF_RQ fragment-reuse fix); the tests assume that fix is present.
Closes #7216
After compaction with defer_index_remap=true, covering vector indexes are expected to remap their stored row addresses through the __lance_frag_reuse index at load time. The IVF_PQ storage does this, but
RabitQuantizationStorage::try_from_batch received the frag-reuse index as a parameter named _fri and discarded it. As a result an IVF_RQ index loaded from a debt-carrying version kept its pre-compaction row addresses, and any ANN query that fetched row content failed at the scanner take stage with take operation specified fragment id N but this fragment does not exist.
The IVF state cache reconstruction path in reconstruct_typed passed a hard-coded None for the frag-reuse index, so even with the storage fixed the remap was skipped whenever the index was rebuilt from a cached state. The cache key already includes the frag-reuse uuid, so cached states with and without remap debt stay distinct when the index is threaded through.
This PR makes the following changes:
In rust/lance-index/src/vector/bq/storage.rs, try_from_batch now builds a row id mapping for the partition from FragReuseIndex::remap_row_id, mirroring the PQ derivation, and applies the storage's existing
remap method. Reusing remap instead of porting the PQ inline rewrite is deliberate, because RQ codes live in a packed and permuted SIMD layout and remap already performs the layout-correct filter and repack,
including the multi-bit split ex code and factor columns. It is already pinned by the existing identity-remap regression tests. When no frag-reuse index is present, when its row id maps are empty, or when
no row in the partition is affected, the path stays a no-op. The generic remap_row_ids_record_batch helper used by the flat and SQ storages was considered and does not fit here, both because it assumes a two
column batch and because a plain row-wise take is not safe on the packed code layout.
In rust/lance/src/index/vector/ivf/v2.rs and rust/lance/src/index.rs, IvfStateEntry::reconstruct gains a frag-reuse parameter and the caller passes open_frag_reuse_index, which is already cached in the index
cache, closing the warm cache hole without adding load overhead.
In rust/lance/src/dataset/optimize.rs, a regression test test_read_ivf_rq_index_v3_with_defer_index_remap is added next to the PQ analog. It queries with a non-empty projection because the take stage where
this bug manifests is only exercised when row content is fetched. The existing PQ test uses an empty projection and never reaches that stage, which is why this went undetected.
Verification: with the storage remap reverted, the new test fails with the exact defect signature from scanner.rs. With the fix it passes. cargo test -p lance defer_index_remap passes (the new RQ test plus
the existing PQ tests), cargo test -p lance-index bq passes, and cargo fmt and cargo clippy -- -D warnings are clean on the touched crates.