fix: preserve rowid across data branch fallback changes#24174
fix: preserve rowid across data branch fallback changes#24174gouhongshen wants to merge 12 commits intomatrixorigin:mainfrom
Conversation
🔍 Multi-Agent Code Review (七人陪审团)
📊 概览
📝 总结PR 修复方向是对的 —— 跨 data branch GC 之后保持 rowid 端到端一致、并修正 fallback 的 snapshot TS 语义,确实命中了 #23751 的根因。但实现方式啃了至少 4 处既有边界:用 🧨 破坏性测试审判
🐶 补丁链分析诊断: 主体提交对影响面摸得不充分,后续 4 个补丁平均 47 分钟一个,节奏更接近"跑 CI 哪坏补一刀"而非回到设计层面修。其中 建议:
🔴 Must Fix (合并前必须修)1.
|
|
Thanks for the deep review. Pushed 16621ed addressing the actionable items; below is a per-item response. Adopted (this PR)
Declined / deferred (with reasoning)
Will pick up #2 in a follow-up once a clean |
🔁 Incremental Re-review —
|
| # | Original 🔴 | Status | Verification |
|---|---|---|---|
| 3 | snapshotTS vs currentTS 0 regression coverage |
✅ Fixed | tsFilteringTombstoner + TestMaterializedSnapshotDataSourcePersistedTombstoneFiltersBySnapshotTS — mutation-tested: reverting prod line 141 to &ds.currentTS makes the test fail with expected diff [0,1] → [0] |
| 4 | updateCN{Data,Tombstone}Batch NPE |
✅ Fixed | nil-batch / missing-pk / empty-vec guards + retainRowID && rowid==nil guard added; TestUpdateCN{Data,Tombstone}Batch_NilGuard covers all 4 paths; tests pass locally |
| 5 | validateLeadingRowID 0 单测 + hot-path alloc |
✅ Fixed | buildFields lazy closure (only runs on failure path → 0 alloc on green path); TestValidateLeadingRowID covers nil/empty/missing-rowid/wrong-type/length-mismatch/null/empty/valid (8 sub-tests); batchSampleRowsForLog OOB guard added |
| 1 | materialized OOM, no spill, no ctx cancel | ⚪ Deferred (acceptable) | Author: "preexisting design of materializedSnapshotDataSource, not a regression". Fair scope call for a bug-fix PR; multi-tenant DoS risk should still be tracked as follow-up issue |
| 2 | tae/blockio/read.go silent semantic change for non-appendable blocks |
⚪ Deferred (partial reasoning) | Author cites that early-return breaks rowid-only path on appendable blocks. ✔ But that doesn't address the original concern about non-appendable blocks silently gaining commit-ts filtering. Needs the BlockInfo.HasCommitTS opt-in plumbing as follow-up — confirm a tracking issue exists before merging |
| 6 | context.Value for retainRowID |
⚪ Deferred (justified) | Author cites repro_stale_read_3.sql: FileNotFound rebuild path drops original ctx, so a one-shot struct option silently loses policy. Counter-argument: a struct option threaded through the rebuild path would also work and would be type-safe. Acceptable as scope deferral, but the per-Next() ctx re-injection is still dead work today (NewChangesHandler only reads it once at construction). At minimum delete the dead reapply, or document the rebuild-path contract. |
| 7 | Reverse dep + unwrapTxnTable reflection |
⚪ Pre-existing | Confirmed: ScanSnapshotWithCurrentRanges was added in earlier commits. Acceptable scope deferral |
🧨 Destructive-test verdict — UPDATED
- Before:
不足 (Insufficient)— 0/9 hostile failure modes covered - After:
部分充分 (Partially Sufficient)— covers (1) hostile input, (2) boundary, (3) error injection, plus the most critical regression (snapshot vs current TS visibility) - Still missing (track as follow-up, not blocking):
getTupleColumnValue3-tuple-width direct unit testsBranchChangeHandle.Next()idempotency over multiple callsretBatchPool tombVecCnt mismatchpanic case- ctx cancel mid-stream / mpool leak
🐶 Patch-chain re-evaluation
16621ed is the 6th commit in 4h but unlike the previous 4, this one is purely additive test coverage + defensive guards on already-identified call sites — not another fix-on-fix patch. Pattern broken.
✅ Verification commands run locally
# All pass on PR head 16621ed
go test ./pkg/vm/engine/disttae/ -run TestMaterializedSnapshotDataSourcePersistedTombstoneFiltersBySnapshotTS -count=1
go test ./pkg/frontend/ -run TestValidateLeadingRowID -count=1
go test ./pkg/vm/engine/disttae/logtailreplay/ -run TestUpdateCN -count=1
# Mutation test on prod line 141 (snapshotTS → currentTS) → test correctly FAILS
# expected [0,1] vs actual [0] ← confirms genuine destructive coverage🎯 Final recommendation
🟢 Ready to merge with the following understanding:
- Blocking 🔴 items (3, 4, 5) are properly fixed and have real destructive test coverage — the mutation test on item 3 is particularly valuable.
- Deferred items (1, 2, 6, 7) need follow-up issues:
- Add README.md #1 materialized OOM / multi-tenant DoS hardening
- integrate mysql parser from pingcap #2
BlockInfo.HasCommitTSopt-in plumbing - (misc): parser demo #6 either delete the dead per-
Next()ctx reapply, or convert ctx-options to typedCollectChangesOptionsonce rebuild-path contract is documented - Translate branchless implementations of intersection/union to Go assembly #7 dedicated cleanup PR for
unwrapTxnTablereflection
- The 🟡 / 🟢 items from the original review remain as quality-of-life follow-ups.
Nice turn-around from author. The mutation-tested snapshot regression test is exactly the kind of destructive coverage we asked for.
|
I found one substantive correctness issue in In the new parallel snapshot path, That means the parallel snapshot scan can miss tombstone filtering and return rows that were already deleted. The serial path does not have this problem because it attaches tombstones before constructing the source. Suggested fix: attach tombstones before splitting, or explicitly re-attach tombstones to each shard after the split. |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Guard updateCNDataBatch / updateCNTombstoneBatch against nil/empty input that previously could NPE - Lazy-build validateLeadingRowID log fields so success-path no longer pre-allocates sample strings - Bound batchSampleRowsForLog by per-vector length to avoid out-of-bounds on length-mismatch failures - Add unit tests for validateLeadingRowID covering nil, empty, wrong type, length mismatch, null and empty rowid - Add unit tests for updateCN batch nil/empty guards - Strengthen materializedSnapshotDataSource fallback test with a TS-aware fake tombstoner so reverting from snapshotTS to currentTS would now fail Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add unit coverage for String / SetOrderBy / GetOrderBy / SetFilterZM on both remote-nil and remote-present branches. Lifts PR coverage above the 75% gate without behavioural change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…false Adds TestCDCSchema_NoRowIDWhenRetainRowIDFalse to guard the disttae NewChangesHandler/NewChangesHandlerWithPartitionStateRange path used by CDC and ISCP, which leaves retainRowID=false. The four subtests assert that updateTombstoneBatch, updateDataBatch, fillInDeleteBatch, and fillInInsertBatch never emit a T_Rowid vec or Row_ID attribute on that path, so position-indexed CDC sinks cannot regress silently if future changes flip the rowid plumbing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BlockListRelData.DataSlice copies the parent's tombstones pointer at split time. The previous order was: shards := splitSnapshotScanShards(relData, ...) // shards.tombstones = nil ... relData.AttachTombstones(tombstones) // only mutates parent scanSnapshotShardsParallel(shards, ...) // each shard sees nil so the parallel snapshot scan path silently skipped tombstone filtering and could surface rows that were already deleted. The serial path was unaffected because it attaches tombstones to relData before constructing the source. Move AttachTombstones above splitSnapshotScanShards so DataSlice propagates the pointer into every shard. Add TestSplitSnapshotScanShards_TombstoneInheritance to pin the ordering invariant. Reported in PR matrixorigin#24174 review comment 4302553554. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What type of PR is this?
Which issue(s) this PR fixes:
issue #23751
What this PR does / why we need it:
CollectChangesbatches so diff logic can match replayed rows precisely after GC and fallback readsBranchChangeHandle.Next()so handle rebuilds keep the same batch contractdiff_9GC repro and the new rowid propagation paths