feat(server): persist & backfill v1_hash on filters table#354
Open
mpecan wants to merge 3 commits intofeat/350-canonical-v1from
Open
feat(server): persist & backfill v1_hash on filters table#354mpecan wants to merge 3 commits intofeat/350-canonical-v1from
mpecan wants to merge 3 commits intofeat/350-canonical-v1from
Conversation
Adds the `v1_hash` column to the `filters` table (nullable; backfilled operationally) and ships a service-token-protected backfill endpoint that populates it for legacy rows by re-reading their TOML from R2 and applying `tokf_common::canonical_v1::hash` (ADR-0002). The publish path now computes v1 alongside `content_hash` and rejects new submissions whose v1 collides with an existing row, returning the original author and `is_new = false`. Stops new publishes from re-splitting canonically equivalent filters across content_hash variants — the going-forward fix for #350. Out of scope (explicit follow-up PR): - Dedup migration for existing duplicate rows - UNIQUE(v1_hash) constraint - v1_hash NOT NULL tightening Also dedupes test boilerplate by extracting `post_json` into filters/test_helpers.rs, which both backfill_tests and regenerate_tests now use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original `publish_v1_collision_returns_existing_author` test was
broken: alice's and bob's TOMLs (whitespace-only differences) parsed to
identical `FilterConfig` values and therefore identical `content_hash`,
so bob's submission was caught by the existing byte-duplicate path
rather than the new v1-collision branch. The test passed for the wrong
reason, leaving the headline behaviour of this PR untested.
Fixed the fixture to use `command = "x"` vs `command = ["x"]`, which
parses to different `CommandPattern` enum variants (different
content_hash) but canonical_v1 collapses both into the single-string
form (same v1_hash). This actually triggers the v1-collision branch
in `upsert_filter_record`.
Doing so surfaced a real bug: the response was returning the rejected
publisher's `content_hash` instead of the existing row's. Fixed by
threading the resolved hash back through `UpsertResult { author,
content_hash, is_new }`.
Adds two tests Agent 4 flagged as missing:
- `publish_stdlib_stores_v1_hash` — verifies stdlib publish populates v1
- `backfill_v1_reports_failure_when_r2_object_unparseable` — covers the
invalid-UTF-8 / unparseable-TOML failure path in `compute_and_store_v1`
Extracts `run_backfill_expecting_one_failure` to dedupe the two
backfill-failure tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleanup pass identified by three review agents:
- Extract `storage::get_utf8` helper to share the R2-fetch + UTF-8 decode
pattern; use it from `compute_and_store_v1`. Three older call sites
(search, regenerate, update_tests) keep the inline form for now —
scope-limited migration is a follow-up.
- Collapse `MAX_BACKFILL_ENTRIES` and `MAX_BACKFILL_V1_LIMIT` (both 500)
into one `MAX_BACKFILL_BATCH` constant.
- Add `expected_v1` test helper (accepts &[u8] or &str) to dedupe the
`tokf_common::canonical_v1::hash(std::str::from_utf8(...).unwrap()).unwrap()`
pattern across 5 test sites.
- Combine `backfill_v1_reports_failure_when_r2_object_{missing,unparseable}`
into a single parameterised test driving both `CorruptR2` modes.
- Trim verbose comments on `UpsertResult`, `upsert_filter_record`, and
the v1-collision-handoff narrative in `publish_filter` (kept the
WHY content, dropped the WHAT-restating prose).
- Tighten the migration SQL header.
- Reshape `publish_stores_v1_hash_on_new_row` to use the existing
`publish_filter_helper`, dropping similarity to other publish tests.
Quality gate: 267 server tests pass, fmt + clippy + dupes (0.4% exact /
0.3% near, both under 0.5% limit) all clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Persists the schema-independent canonical TOML hash (
v1_hash, ADR-0002) on thefilterstable and ships an operator-only backfill endpoint.What changes
20260428000000_add_v1_hash.sql: adds nullablev1_hash TEXT+ non-partial index. Nullable on purpose — existing rows backfill operationally;NOT NULLandUNIQUEare deferred to the dedup PR.content_hashinvalidate_and_prepare. Both publish paths (regular + stdlib) now store v1 on insert.v1_hashas an existing row but a differentcontent_hash, the request returns 200 OK with the existing row's author and content_hash, without inserting. Stops new publishes from re-splitting canonically equivalent filters acrosscontent_hashvariants — the going-forward fix for [BUG] Community filters cannot be installed #350.POST /api/filters/backfill-v1-hashes: service-token-protected, scan-and-compute. IteratesWHERE v1_hash IS NULLrows, fetches the TOML from R2, computes v1, writes it back. Per-row failures (missing R2 object, malformed TOML) surface infailed[]without aborting the batch.Out of scope (explicit follow-ups)
UNIQUE(v1_hash)constraintv1_hash NOT NULLtighteningThese all belong to the next PR after backfill has been run in production. Per-row decisions for that work were captured during planning: canonical row = oldest, stdlib breaks ties; non-canonical rows kept and pointed at the canonical row via the existing
successor_hashcolumn (no URLs broken).Operator runbook (for after deploy)
```sh
while true; do
out=$(curl -X POST -H "Authorization: Bearer $TOKF_SERVICE_TOKEN" \
$SERVER/api/filters/backfill-v1-hashes -d '{"limit":100}')
echo "$out"
[ "$(echo "$out" | jq .processed)" -eq 0 ] && break
done
```
Inspect any persistent entries in `failed[]` for manual triage (corrupt TOML in R2, missing object).
Test plan
Stack
This PR targets `feat/350-canonical-v1`. CI will run once #353 merges to main and this PR's base auto-rebases.
🤖 Generated with Claude Code