Skip to content

fix(index): refresh PQ storage row ids after fragment-reuse remap#7315

Open
xuanyu-z wants to merge 1 commit into
lance-format:mainfrom
xuanyu-z:xuanyuzhan/test-vector-fri-deferred-coverage
Open

fix(index): refresh PQ storage row ids after fragment-reuse remap#7315
xuanyu-z wants to merge 1 commit into
lance-format:mainfrom
xuanyu-z:xuanyuzhan/test-vector-fri-deferred-coverage

Conversation

@xuanyu-z

@xuanyu-z xuanyu-z commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a correctness bug where IVF_PQ (and IVF_HNSW_PQ) return stale row addresses after a deferred compaction (defer_index_remap), and adds correctness coverage across the deferred-compaction lifecycle for the vector index types.

ProductQuantizationStorage::new remaps its PQ codes through the fragment-reuse index but left the row_ids field bound to the pre-remap addresses. So once an index is loaded from a version carrying a fragment-reuse index, an IVF_PQ / IVF_HNSW_PQ query returns compacted-away row addresses and the scanner take fails with:

The input to a take operation specified fragment id N but this fragment does not exist in the dataset

This happens even for merge-only compaction and is only observable when the query fetches row content (a take) — the existing test_read_ivf_pq_index_v3_with_defer_index_remap projects no columns (counts row ids without taking), so it didn't catch it. IVF_FLAT/SQ/RQ already refresh their row ids correctly. The fix refreshes row_ids from the remapped batch alongside pq_code.

Test coverage added

Parameterized helpers that, unlike the existing test_read_*_with_defer_index_remap tests, project and fetch row content so stale addresses surface:

  • check_vector_defer_compaction — fragment-reuse window correctness, with and without materialized deletions.
  • check_vector_remap_and_trim — merge-only, then physical remap_column_index + cleanup_frag_reuse_index, asserting the reuse index trims to zero versions and results stay consistent.

Covered (passing): IVF_FLAT, IVF_SQ, IVF_RQ, IVF_PQ across window/deletions/remap+trim; IVF_HNSW_SQ/PQ merge-only; a scalar bitmap deletion case.

Two known gaps are intentionally not turned into perpetually-ignored tests (they're noted in code comments instead): IVF_HNSW_* and the inverted/FTS index desync under materialized deletions because their positional internal structures (HNSW graph node ids; FTS num_tokens[doc_id]) are not realigned when the fragment-reuse drop removes rows. The HNSW case is the deferred #3993; the FTS case is a separate follow-up.

Testing

cargo test -p lance --lib defer_compaction   # 6 passed
cargo test -p lance --lib remap_and_trim      # 6 passed

@github-actions github-actions Bot added the bug Something isn't working label Jun 17, 2026
@github-actions github-actions Bot added the A-index Vector index, linalg, tokenizer label Jun 17, 2026
@xuanyu-z xuanyu-z marked this pull request as ready for review June 17, 2026 04:14

@jackye1995 jackye1995 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, pending rebase

@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 99.73190% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/optimize.rs 99.73% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

`ProductQuantizationStorage::new` remaps its PQ codes through the fragment-reuse
index but left the `row_ids` field bound to the pre-remap addresses. After a
deferred compaction (`defer_index_remap`), an IVF_PQ (or IVF_HNSW_PQ) index then
returned stale, compacted-away row addresses, and the scanner take failed with
"The input to a take operation specified fragment id N but this fragment does
not exist". This happened even for merge-only compaction and was only observable
when the query fetched row content; the existing
`test_read_ivf_pq_index_v3_with_defer_index_remap` projects no columns (it counts
row ids without taking), so it missed the bug. Refresh `row_ids` from the
remapped batch alongside `pq_code`.

Also adds correctness coverage across the full deferred-compaction lifecycle
(fragment-reuse window, materialized deletions, and physical remap + reuse-index
trim) for IVF_FLAT/PQ/SQ/RQ and merge-only IVF_HNSW_*, plus a scalar bitmap
deletion case. Unlike the existing `test_read_*_with_defer_index_remap` tests,
these project and fetch row content (a take), which is what surfaces stale
addresses.

Stacked on lance-format#7217 (IVF_RQ fragment-reuse fix); the tests assume that fix is
present.
@xuanyu-z xuanyu-z force-pushed the xuanyuzhan/test-vector-fri-deferred-coverage branch from 8eb2712 to 2df3470 Compare June 17, 2026 05:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants