Skip to content

feat(paradedb): expose replicationSlots on the CNPG Cluster#198

Open
philippemnoel wants to merge 3 commits into
mainfrom
feat/replication-slots-failover
Open

feat(paradedb): expose replicationSlots on the CNPG Cluster#198
philippemnoel wants to merge 3 commits into
mainfrom
feat/replication-slots-failover

Conversation

@philippemnoel
Copy link
Copy Markdown
Member

Ticket(s) Closed

  • Closes #

What

Exposes CloudNativePG's .spec.replicationSlots field on the ParadeDB Cluster chart via a new cluster.replicationSlots value.

Why

CNPG does not synchronize user-created logical replication slots to standbys by default, so they only exist on the current primary. When CNPG fails over (manual, upgrade-driven, or otherwise), CDC consumers that depend on those slots — e.g. Artie, Debezium — lose their replication position and stop streaming until the slot is recreated, usually requiring a re-snapshot.

PostgreSQL 17+ supports failover slots (sync_replication_slots = on, synchronized_standby_slots, slot-level failover = true). CNPG 1.22+ wires this up via .spec.replicationSlots.synchronizeReplicas, but the field was not surfaced by this chart, so chart users couldn't enable it without patching the rendered Cluster CR directly.

A real customer (momogood, PG 18 + Artie -> Snowflake) hit this after a paradedb version bump: the rolling upgrade switched the primary, the logical slot didn't follow, and CDC stopped.

How

  • Added a cluster.replicationSlots block in values.yaml (default {}, with a commented-out example covering highAvailability, updateInterval, and synchronizeReplicas).
  • Rendered it on the Cluster CR in templates/cluster.yaml using {{- with ... }}, so existing deployments that don't set the value are unaffected.

Users opting in additionally need:

  • PostgreSQL 17+ (already required for failover-slot syncing).
  • sync_replication_slots = on in cluster.postgresql.parameters.
  • Their CDC client to create the logical slot with failover = true.

Tests

  • helm template with the value unset produces output identical to main (no replicationSlots key on the Cluster CR).
  • helm template with cluster.replicationSlots.highAvailability.enabled=true and cluster.replicationSlots.synchronizeReplicas.enabled=true produces a valid Cluster CR with the expected block.
  • End-to-end on a PG 17+ cluster: create a logical slot with failover = true, trigger a CNPG switchover, confirm the slot is present on the new primary at the correct LSN.

Surfaces CNPG's replicationSlots field so chart users can enable
synchronization of user-created logical replication slots between the
primary and standbys. With PostgreSQL 17+ failover slots, this lets CDC
consumers (e.g. Artie, Debezium) survive a CNPG failover without
losing the slot.

The block is omitted entirely when the value is empty, so existing
deployments are unaffected.
Extends the existing non-default-configuration chainsaw test with a
replicationSlots block (highAvailability + synchronizeReplicas) and
asserts it appears verbatim on the rendered CNPG Cluster.
The pooler test's 20s assert timeout was too tight — the Deployment's
status.readyReplicas occasionally hasn't been populated by kube-controller
within that window, causing flaky failures unrelated to the chart change
under test. Peer tests use 1m–3m (paradedb-enterprise: 2m, replica: 3m).
Bump cleanup to 1m for consistency with the same peers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant