fix: harden gateway partition creation against races#1974
fix: harden gateway partition creation against races#1974xmtp-coder-agent wants to merge 2 commits intoxmtp:mainfrom
Conversation
Summary by OctaneNew ContractsNo new contracts were added. Updated Contracts
🔗 Commit Hash: 871b5a7 |
Additional commit: global advisory lock in
|
Overview
Detailed findings
|
|
Reviewed the Octane findings. Summary — no code changes warranted:
|
|
The failing Test(Node) check on run 24415993135 is an unrelated pre-existing flake:
Pushed an empty commit (59e55e3) to re-trigger CI. The partition-creation work in this PR is green locally under |
Adds migration 00024 with a new `ensure_gateway_parts_v4` stored function
and four hardened L1/L2 helpers that address three defects in the v2/v3
partition-management path:
1. `pg_advisory_xact_lock` on (namespace, originator) and
(namespace, hashtext(originator:band_start)) serializes concurrent
callers racing on the same partition.
2. `pg_inherits` short-circuit is the authoritative "already attached"
check, replacing a regex match on `SQLERRM` that could silently swallow
unrelated errors and leave partitions unattached.
3. `make_meta_seq_subpart_v2` had a `format()` arity bug that produced
`seq_id_check CHECK (... >= _oid AND < _start)` instead of the intended
`>= _start AND < _end`. The v4 helpers build the correct predicate.
Production callers (`InsertGatewayEnvelopeWithChecks{Standalone,Transactional}`,
`InsertGatewayEnvelopeBatchV2Transactional`, and the partition-creation
worker) now route through V4. The legacy v2/v3 helpers remain in pg_proc
so migration-behavior tests continue to populate pre-rename databases
through their existing code paths.
Adds migration-level tests (`migration_00024_test.go`) for attachment,
idempotence, and V3 coexistence, plus Go-level tests
(`TestEnsureGatewayPartsV4_ConcurrentCreate`,
`TestInsertGatewayEnvelopeWithChecksStandalone_ConcurrentWithWarmup`)
that race 32 goroutines on the new helpers and assert single-copy
attachment with no "no partition of relation" errors.
Resolves xmtp#1967
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per-resource advisory locks in v4 do not prevent cross-oid deadlocks. PostgreSQL's ATTACH PARTITION on a sub-partitioned child propagates lock requirements up to the top parent AND across sibling L1 children (to validate partition-bound non-overlap). So a caller attaching L1 child `oA` (holding AccessExclusive on that child while requesting ShareRowExclusive on the top parent) can deadlock against a concurrent caller that already holds ShareRowExclusive on the top parent and now wants ShareUpdateExclusive on `oA` as part of its own ATTACH propagation. This reproduces as SQLSTATE 40P01 in TestCreateServer in CI. Fix: take a single GLOBAL `pg_advisory_xact_lock` at entry of ensure_gateway_parts_v4 so that at most one partition-creation flow runs concurrently across the cluster. Partition creation is a rare cold-path event (triggered only when the savepoint-retry path fires on the first insert for a new (oid, band)), so the global serialization has no meaningful throughput cost. The per-resource advisory locks inside the helpers remain as defense-in-depth for any direct callers. Verified by running `go test ./pkg/server/ -run TestCreateServer -count=20 -race`: all iterations pass; residual deadlock messages in the PG log occur at a dramatically lower rate and are transparently retried by `publish_worker.processBatchWithRetry`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
59e55e3 to
e141040
Compare
|
Rebased on main to pick up #1975 (TestCreateServer replication timeout alignment). The previous Test(Node) failure on run 24416630507 showed:
With #1975 now included, both |
Resolves #1967
Summary
Hardens
ensure_gateway_parts/make_*_partagainst three defects that let concurrent callers flake withERROR: no partition of relation "gateway_envelopes_meta" found for row (SQLSTATE 23514)— the failure reported in #1967.pg_advisory_xact_lock(namespace, key)serializes concurrentCREATE TABLE/ATTACH PARTITIONfor the same originator and band. The v3 helper had no cross-caller serialization.pg_inherits-based short-circuit. Replaces the regex match onSQLERRM ~ 'is already a partition'(a PL/pgSQL sub-transaction that also rolled back the precedingCREATE TABLE). Any other error matching that substring — or any future change in PostgreSQL's error text — became a silent no-op that left the partition unattached while the caller saw success.seq_id_checkCHECK predicate.make_meta_seq_subpart_v2passes four arguments into aformat()string with three placeholders; PostgreSQL silently drops the extra and the predicate ends up as>= _oid AND < _startinstead of>= _start AND < _end. Benign because the seed constraint is dropped immediately after a successful ATTACH, but an objective defect. V4 writes the correct predicate.Changes
pkg/db/migrations/00024_harden-partition-creation.up.sql— new migration addsmake_meta_originator_part_v3,make_meta_seq_subpart_v3,make_blob_originator_part_v4,make_blob_seq_subpart_v4,ensure_gateway_parts_v4. Legacy v2/v3 helpers remain inpg_procso migration-behavior tests (e.g.migration_00023_test.go) continue to populate pre-rename databases through their existing code paths.pkg/db/sqlc/partitions.sql+ regenerated bindings — newEnsureGatewayPartsV4query.InsertGatewayEnvelopeWithChecksStandalone,InsertGatewayEnvelopeWithChecksTransactional,InsertGatewayEnvelopeBatchV2Transactional, and the partition-creation worker (pkg/db/worker/worker.go).Testing
pkg/db/migrations/migration_00024_test.goverifies L1+L2 attachment viapg_inherits, absence of residualoid_check/seq_id_checkconstraints viapg_constraint, idempotence of double calls, and V3 coexistence.TestEnsureGatewayPartsV4_ConcurrentCreateraces 32 goroutines onEnsureGatewayPartsV4for the same(originator, band)and asserts exactly one L1 + one L2 pair on each of meta and blob.TestInsertGatewayEnvelopeWithChecksStandalone_ConcurrentWithWarmupexercises the short-circuit path with 16 concurrent inserters after a warmup insert (mirrors the existingTestInsertAndIncrementParallelpattern; note that PostgreSQL's intrinsic INSERT-vs-ATTACH lock ordering can deadlock independently of this code, which is why the warmup is required).Test plan
go test -race -count=10 ./pkg/db/...passesgo test -race -count=20 -run TestQueryTopicFromLastSeen ./pkg/api/...passes (originally flaky)go test -count=1 ./pkg/db/... ./pkg/api/...passesdev/lint-fixclean🤖 Generated with Claude Code
Note
Harden gateway partition creation against races with
ensure_gateway_parts_v4ensure_gateway_parts_v4, which uses advisory transaction locks andpg_inheritschecks to serialize concurrent partition creation and attachment forgateway_envelopes_metaandgateway_envelopes_blob.EnsureGatewayPartsV3withEnsureGatewayPartsV4in the retry path of single inserts, batch inserts, and the background partition pre-creation worker.Macroscope summarized 871b5a7. (Automatic summaries will resume when PR exits draft mode or review begins).