Skip to content

fix: harden gateway partition creation against races#1974

Draft
xmtp-coder-agent wants to merge 2 commits intoxmtp:mainfrom
xmtp-coder-agent:fix/issue-1967
Draft

fix: harden gateway partition creation against races#1974
xmtp-coder-agent wants to merge 2 commits intoxmtp:mainfrom
xmtp-coder-agent:fix/issue-1967

Conversation

@xmtp-coder-agent
Copy link
Copy Markdown
Collaborator

@xmtp-coder-agent xmtp-coder-agent commented Apr 14, 2026

Resolves #1967

Summary

Hardens ensure_gateway_parts/make_*_part against three defects that let concurrent callers flake with ERROR: no partition of relation "gateway_envelopes_meta" found for row (SQLSTATE 23514) — the failure reported in #1967.

  • Per-(originator, band) advisory locks. pg_advisory_xact_lock(namespace, key) serializes concurrent CREATE TABLE / ATTACH PARTITION for the same originator and band. The v3 helper had no cross-caller serialization.
  • pg_inherits-based short-circuit. Replaces the regex match on SQLERRM ~ 'is already a partition' (a PL/pgSQL sub-transaction that also rolled back the preceding CREATE TABLE). Any other error matching that substring — or any future change in PostgreSQL's error text — became a silent no-op that left the partition unattached while the caller saw success.
  • Correct seq_id_check CHECK predicate. make_meta_seq_subpart_v2 passes four arguments into a format() string with three placeholders; PostgreSQL silently drops the extra and the predicate ends up as >= _oid AND < _start instead of >= _start AND < _end. Benign because the seed constraint is dropped immediately after a successful ATTACH, but an objective defect. V4 writes the correct predicate.

Changes

  • pkg/db/migrations/00024_harden-partition-creation.up.sql — new migration adds make_meta_originator_part_v3, make_meta_seq_subpart_v3, make_blob_originator_part_v4, make_blob_seq_subpart_v4, ensure_gateway_parts_v4. Legacy v2/v3 helpers remain in pg_proc so migration-behavior tests (e.g. migration_00023_test.go) continue to populate pre-rename databases through their existing code paths.
  • pkg/db/sqlc/partitions.sql + regenerated bindings — new EnsureGatewayPartsV4 query.
  • Production callers routed through V4: InsertGatewayEnvelopeWithChecksStandalone, InsertGatewayEnvelopeWithChecksTransactional, InsertGatewayEnvelopeBatchV2Transactional, and the partition-creation worker (pkg/db/worker/worker.go).

Testing

  • pkg/db/migrations/migration_00024_test.go verifies L1+L2 attachment via pg_inherits, absence of residual oid_check / seq_id_check constraints via pg_constraint, idempotence of double calls, and V3 coexistence.
  • TestEnsureGatewayPartsV4_ConcurrentCreate races 32 goroutines on EnsureGatewayPartsV4 for the same (originator, band) and asserts exactly one L1 + one L2 pair on each of meta and blob.
  • TestInsertGatewayEnvelopeWithChecksStandalone_ConcurrentWithWarmup exercises the short-circuit path with 16 concurrent inserters after a warmup insert (mirrors the existing TestInsertAndIncrementParallel pattern; note that PostgreSQL's intrinsic INSERT-vs-ATTACH lock ordering can deadlock independently of this code, which is why the warmup is required).

Test plan

  • go test -race -count=10 ./pkg/db/... passes
  • go test -race -count=20 -run TestQueryTopicFromLastSeen ./pkg/api/... passes (originally flaky)
  • go test -count=1 ./pkg/db/... ./pkg/api/... passes
  • dev/lint-fix clean

🤖 Generated with Claude Code

Note

Harden gateway partition creation against races with ensure_gateway_parts_v4

  • Adds a new DB migration (00024_harden-partition-creation.up.sql) introducing ensure_gateway_parts_v4, which uses advisory transaction locks and pg_inherits checks to serialize concurrent partition creation and attachment for gateway_envelopes_meta and gateway_envelopes_blob.
  • Replaces all calls to EnsureGatewayPartsV3 with EnsureGatewayPartsV4 in the retry path of single inserts, batch inserts, and the background partition pre-creation worker.
  • Adds concurrency and idempotency tests verifying that racing goroutines produce exactly one L1 and one L2 partition, and that v4 coexists cleanly with existing v3-partitioned schemas.
  • Behavioral Change: partition creation on retry and background pre-creation now acquires an advisory lock per originator, which serializes concurrent callers but may increase contention under high parallelism.

Macroscope summarized 871b5a7. (Automatic summaries will resume when PR exits draft mode or review begins).

@octane-security-app
Copy link
Copy Markdown

Summary by Octane

New Contracts

No new contracts were added.

Updated Contracts

  • gateway_envelope.go: Updated to use EnsureGatewayPartsV4 instead of V3 in transaction and standalone operations.
  • gateway_envelope_batch.go: Updated the function to use EnsureGatewayPartsV4 instead of EnsureGatewayPartsV3 for partition creation.
  • db.go: Added support for a new SQL statement, ensureGatewayPartsV4, in the smart contract.
  • partitions.sql.go: The smart contract upgrades to V4, adding advisory locking, improved constraints, and error handling enhancements.
  • worker.go: The smart contract update changes the function call from EnsureGatewayPartsV3 to EnsureGatewayPartsV4.

🔗 Commit Hash: 871b5a7

@xmtp-coder-agent
Copy link
Copy Markdown
Collaborator Author

Additional commit: global advisory lock in ensure_gateway_parts_v4

Pushed 5c03acb in response to the flake recurrence reported on #1967 (a different test, TestCreateServer, failing with SQLSTATE 40P01 deadlock rather than 23514). Per-resource advisory locks plus pg_inherits short-circuit address the original no partition found symptom fully, but do not prevent a distinct cross-oid deadlock pattern that PG's multi-level ATTACH PARTITION can create:

  • Caller A in ensure for oid=A, holds AccessExclusive on its new oA L1 child, wants ShareRowExclusive on the top parent.
  • Caller B in ensure for oid=B, already holds ShareRowExclusive on the top parent (from its earlier step-1 ATTACH), now wants ShareUpdateExclusive on A's oA child for its own step-3 ATTACH's sibling-lock propagation.
  • Neither caller holds the same per-oid advisory key, so the v4 per-resource locks don't serialize them.

The added fix: take a single global pg_advisory_xact_lock(hashtext('xmtpd.ensure_gateway_parts')::bigint) at entry of ensure_gateway_parts_v4. This makes the four make_*_part steps atomic with respect to every other ensure_gateway_parts_v4 caller across the cluster. Cost is negligible because partition creation is a rare cold-path event (only on the first insert for each new (oid, band)).

Verified locally with go test ./pkg/server/ -run TestCreateServer -race -count=20 — all iterations pass. Residual ensure-vs-concurrent-INSERT deadlocks still appear at a much lower rate in the PG log but are transparently retried by publish_worker.processBatchWithRetry (3 attempts) and do not surface to callers or the test.

Full write-up on the issue: #1967 (comment).

@octane-security-app
Copy link
Copy Markdown

Overview

Vulnerabilities found: 3                                                                                
Severity breakdown: 1 Medium, 1 Low, 1 Informational
Warnings found: 4                                                                                

Detailed findings

pkg/db/gateway_envelope.go

  • Ungated EnsureGatewayPartsV4 migration dependency in reserved-topic replication path causes silent drop of payer reports/attestations. See more
  • Fast-path ALTER TABLE in EnsureGatewayPartsV4 within transactional insert causes transient partition-level stalls. See more

pkg/db/worker/worker.go

  • Switch to advisory-locked EnsureGatewayPartsV4 in startup DB worker causes API startup hang under lock contention. See more

Warnings

pkg/db/gateway_envelope_batch.go

  • Identifier truncation breaks EnsureGatewayPartsV4 idempotence in per-envelope batch retry, causing publish pipeline stall. See more
  • Non-deterministic transaction-scoped advisory lock acquisition in InsertGatewayEnvelopeBatchV2Transactional retry loop causes deadlock and transient availability degradation. See more
  • Switch to EnsureGatewayPartsV4 without DB migration 00024 in gateway publish path causes persistent publish stall and payer‑report freeze. See more

pkg/db/migrations/00024_harden-partition-creation.up.sql

  • 32-bit hashed advisory-lock keying in EnsureGatewayPartsV4 causes rare cross-originator lock contention. See more

🔗 Commit Hash: 871b5a7
🛡️ Octane Dashboard: All vulnerabilities

@xmtp-coder-agent
Copy link
Copy Markdown
Collaborator Author

Reviewed the Octane findings. Summary — no code changes warranted:

  • Reserved-topic silent drop (gateway_envelope.go): false positive. Reserved topics are routed through storeReservedEnvelope in pkg/sync/envelope_sink.go, which calls payerReportStore.StoreSync{edReport,edAttestation} — not InsertGatewayEnvelopeV3/EnsureGatewayPartsV4. No migration dependency on that path.
  • ALTER TABLE stalls / retry-loop non-determinism (transactional insert + batch retry): the residual ensure-vs-INSERT window is the known class documented in this PR's followup commit (5c03acb). It is mitigated by the global advisory lock and transparently retried by publish_worker.processBatchWithRetry (3 attempts). Full architectural isolation (ensure on a separate connection) is the deferred escalation documented in #1974 (comment) if CI flakiness resurfaces.
  • Worker startup hang (worker.go): false positive. Worker.Start logs and continues on runDBCheck error (worker.go:77-78: "Not stopping on this error"). The worker runs every 30 min by default and its advisory-lock contention with publish-path ensure is bounded by the short ensure flow.
  • Migration 00024 missing (gateway_envelope_batch.go warning): false positive. Migration 00024 is the central change in this PR.
  • Identifier truncation in batch retry: preexisting concern, not introduced here — v2/v3 helpers used the same gateway_envelopes_meta_o<oid>_s<start>_<end> format. For realistic (oid, band_start, band_end) values (oid ≤ 10 digits, band ≤ 1M-granularity sequence IDs), names stay within PostgreSQL's 63-byte NAMEDATALEN limit.
  • 32-bit hash collisions for the global advisory lock key: verified no collision on the live schema. hashtext('xmtpd.ensure_gateway_parts') = 739363097, distinct from the other advisory-lock keys in use (hashtext('migration_dead_letter_box_sequence') = -1258075077, hashtext('staged_originator_envelopes_sequence') = 1146398905).

@xmtp-coder-agent
Copy link
Copy Markdown
Collaborator Author

The failing Test(Node) check on run 24415993135 is an unrelated pre-existing flake:

Pushed an empty commit (59e55e3) to re-trigger CI. The partition-creation work in this PR is green locally under TestCreateServer -race -count=20.

xmtp-coder-agent and others added 2 commits April 14, 2026 19:00
Adds migration 00024 with a new `ensure_gateway_parts_v4` stored function
and four hardened L1/L2 helpers that address three defects in the v2/v3
partition-management path:

1. `pg_advisory_xact_lock` on (namespace, originator) and
   (namespace, hashtext(originator:band_start)) serializes concurrent
   callers racing on the same partition.
2. `pg_inherits` short-circuit is the authoritative "already attached"
   check, replacing a regex match on `SQLERRM` that could silently swallow
   unrelated errors and leave partitions unattached.
3. `make_meta_seq_subpart_v2` had a `format()` arity bug that produced
   `seq_id_check CHECK (... >= _oid AND < _start)` instead of the intended
   `>= _start AND < _end`. The v4 helpers build the correct predicate.

Production callers (`InsertGatewayEnvelopeWithChecks{Standalone,Transactional}`,
`InsertGatewayEnvelopeBatchV2Transactional`, and the partition-creation
worker) now route through V4. The legacy v2/v3 helpers remain in pg_proc
so migration-behavior tests continue to populate pre-rename databases
through their existing code paths.

Adds migration-level tests (`migration_00024_test.go`) for attachment,
idempotence, and V3 coexistence, plus Go-level tests
(`TestEnsureGatewayPartsV4_ConcurrentCreate`,
`TestInsertGatewayEnvelopeWithChecksStandalone_ConcurrentWithWarmup`)
that race 32 goroutines on the new helpers and assert single-copy
attachment with no "no partition of relation" errors.

Resolves xmtp#1967

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per-resource advisory locks in v4 do not prevent cross-oid deadlocks.
PostgreSQL's ATTACH PARTITION on a sub-partitioned child propagates lock
requirements up to the top parent AND across sibling L1 children (to
validate partition-bound non-overlap). So a caller attaching L1 child
`oA` (holding AccessExclusive on that child while requesting
ShareRowExclusive on the top parent) can deadlock against a concurrent
caller that already holds ShareRowExclusive on the top parent and now
wants ShareUpdateExclusive on `oA` as part of its own ATTACH propagation.
This reproduces as SQLSTATE 40P01 in TestCreateServer in CI.

Fix: take a single GLOBAL `pg_advisory_xact_lock` at entry of
ensure_gateway_parts_v4 so that at most one partition-creation flow runs
concurrently across the cluster. Partition creation is a rare cold-path
event (triggered only when the savepoint-retry path fires on the first
insert for a new (oid, band)), so the global serialization has no
meaningful throughput cost. The per-resource advisory locks inside the
helpers remain as defense-in-depth for any direct callers.

Verified by running `go test ./pkg/server/ -run TestCreateServer -count=20
-race`: all iterations pass; residual deadlock messages in the PG log
occur at a dramatically lower rate and are transparently retried by
`publish_worker.processBatchWithRetry`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@xmtp-coder-agent
Copy link
Copy Markdown
Collaborator Author

Rebased on main to pick up #1975 (TestCreateServer replication timeout alignment). The previous Test(Node) failure on run 24416630507 showed:

  • One ensure gateway parts: deadlock detected in publish-worker — transparently retried and succeeded in ~100ms via processBatchWithRetry (exactly the self-healing behavior this PR was designed to enable).
  • Test assertion failed at server_test.go:163 (first Eventually, 20s budget) — not on the deadlock path, but on cross-node replication lag.

With #1975 now included, both Eventually checks have matching 20s/500ms budgets. CI running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky CI Failure: TestQueryTopicFromLastSeen — no partition of relation gateway_envelopes_meta found for row

1 participant