Skip to content

feat(server): persist & backfill v1_hash on filters table#354

Open
mpecan wants to merge 3 commits intofeat/350-canonical-v1from
feat/350-canonical-v1-backfill
Open

feat(server): persist & backfill v1_hash on filters table#354
mpecan wants to merge 3 commits intofeat/350-canonical-v1from
feat/350-canonical-v1-backfill

Conversation

@mpecan
Copy link
Copy Markdown
Owner

@mpecan mpecan commented Apr 29, 2026

Summary

Persists the schema-independent canonical TOML hash (v1_hash, ADR-0002) on the filters table and ships an operator-only backfill endpoint.

What changes

  • Migration 20260428000000_add_v1_hash.sql: adds nullable v1_hash TEXT + non-partial index. Nullable on purpose — existing rows backfill operationally; NOT NULL and UNIQUE are deferred to the dedup PR.
  • Publish computes v1 alongside content_hash in validate_and_prepare. Both publish paths (regular + stdlib) now store v1 on insert.
  • v1-collision rejection: when a new submission has the same v1_hash as an existing row but a different content_hash, the request returns 200 OK with the existing row's author and content_hash, without inserting. Stops new publishes from re-splitting canonically equivalent filters across content_hash variants — the going-forward fix for [BUG] Community filters cannot be installed #350.
  • Backfill endpoint POST /api/filters/backfill-v1-hashes: service-token-protected, scan-and-compute. Iterates WHERE v1_hash IS NULL rows, fetches the TOML from R2, computes v1, writes it back. Per-row failures (missing R2 object, malformed TOML) surface in failed[] without aborting the batch.

Out of scope (explicit follow-ups)

  • Dedup migration for existing duplicate rows
  • UNIQUE(v1_hash) constraint
  • v1_hash NOT NULL tightening

These all belong to the next PR after backfill has been run in production. Per-row decisions for that work were captured during planning: canonical row = oldest, stdlib breaks ties; non-canonical rows kept and pointed at the canonical row via the existing successor_hash column (no URLs broken).

Operator runbook (for after deploy)

```sh
while true; do
out=$(curl -X POST -H "Authorization: Bearer $TOKF_SERVICE_TOKEN" \
$SERVER/api/filters/backfill-v1-hashes -d '{"limit":100}')
echo "$out"
[ "$(echo "$out" | jq .processed)" -eq 0 ] && break
done
```

Inspect any persistent entries in `failed[]` for manual triage (corrupt TOML in R2, missing object).

Test plan

  • Migration applies cleanly to a fresh CockroachDB (`just db-reset && cargo sqlx migrate run --source crates/tokf-server/migrations`)
  • Publish v1 storage + collision rejection (DB integration tests, fixture genuinely triggers v1-collision branch via `command = "x"` vs `["x"]`)
  • Backfill: populates NULL rows, idempotent, respects limit, caps at MAX, requires service token, rejects invalid bearer, reports failure for missing R2 object, reports failure for unparseable TOML
  • Stdlib publish stores v1 (separate INSERT path)
  • `cargo fmt --check`, `cargo clippy --workspace --all-targets -- -D warnings`, `cargo dupes` all clean
  • Full server suite: 268 passed (was 266 before, +2 new tests)

Stack

This PR targets `feat/350-canonical-v1`. CI will run once #353 merges to main and this PR's base auto-rebases.

🤖 Generated with Claude Code

mpecan and others added 3 commits April 29, 2026 15:38
Adds the `v1_hash` column to the `filters` table (nullable; backfilled
operationally) and ships a service-token-protected backfill endpoint
that populates it for legacy rows by re-reading their TOML from R2 and
applying `tokf_common::canonical_v1::hash` (ADR-0002).

The publish path now computes v1 alongside `content_hash` and rejects
new submissions whose v1 collides with an existing row, returning the
original author and `is_new = false`. Stops new publishes from
re-splitting canonically equivalent filters across content_hash
variants — the going-forward fix for #350.

Out of scope (explicit follow-up PR):
- Dedup migration for existing duplicate rows
- UNIQUE(v1_hash) constraint
- v1_hash NOT NULL tightening

Also dedupes test boilerplate by extracting `post_json` into
filters/test_helpers.rs, which both backfill_tests and regenerate_tests
now use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original `publish_v1_collision_returns_existing_author` test was
broken: alice's and bob's TOMLs (whitespace-only differences) parsed to
identical `FilterConfig` values and therefore identical `content_hash`,
so bob's submission was caught by the existing byte-duplicate path
rather than the new v1-collision branch. The test passed for the wrong
reason, leaving the headline behaviour of this PR untested.

Fixed the fixture to use `command = "x"` vs `command = ["x"]`, which
parses to different `CommandPattern` enum variants (different
content_hash) but canonical_v1 collapses both into the single-string
form (same v1_hash). This actually triggers the v1-collision branch
in `upsert_filter_record`.

Doing so surfaced a real bug: the response was returning the rejected
publisher's `content_hash` instead of the existing row's. Fixed by
threading the resolved hash back through `UpsertResult { author,
content_hash, is_new }`.

Adds two tests Agent 4 flagged as missing:
- `publish_stdlib_stores_v1_hash` — verifies stdlib publish populates v1
- `backfill_v1_reports_failure_when_r2_object_unparseable` — covers the
  invalid-UTF-8 / unparseable-TOML failure path in `compute_and_store_v1`

Extracts `run_backfill_expecting_one_failure` to dedupe the two
backfill-failure tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleanup pass identified by three review agents:

- Extract `storage::get_utf8` helper to share the R2-fetch + UTF-8 decode
  pattern; use it from `compute_and_store_v1`. Three older call sites
  (search, regenerate, update_tests) keep the inline form for now —
  scope-limited migration is a follow-up.
- Collapse `MAX_BACKFILL_ENTRIES` and `MAX_BACKFILL_V1_LIMIT` (both 500)
  into one `MAX_BACKFILL_BATCH` constant.
- Add `expected_v1` test helper (accepts &[u8] or &str) to dedupe the
  `tokf_common::canonical_v1::hash(std::str::from_utf8(...).unwrap()).unwrap()`
  pattern across 5 test sites.
- Combine `backfill_v1_reports_failure_when_r2_object_{missing,unparseable}`
  into a single parameterised test driving both `CorruptR2` modes.
- Trim verbose comments on `UpsertResult`, `upsert_filter_record`, and
  the v1-collision-handoff narrative in `publish_filter` (kept the
  WHY content, dropped the WHAT-restating prose).
- Tighten the migration SQL header.
- Reshape `publish_stores_v1_hash_on_new_row` to use the existing
  `publish_filter_helper`, dropping similarity to other publish tests.

Quality gate: 267 server tests pass, fmt + clippy + dupes (0.4% exact /
0.3% near, both under 0.5% limit) all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant