pgsql: add dynamic replication mode management for out-of-cluster standbys by y-ikeda-ha · Pull Request #2142 · ClusterLabs/resource-agents

y-ikeda-ha · 2026-03-31T02:50:37Z

Summary

This PR adds a new feature to the pgsql resource agent that dynamically manages the replication mode of PostgreSQL standbys connecting from outside the Pacemaker cluster, targeting multi-site disaster recovery (DR) use cases.

Motivation

In multi-site DR architectures where independent Pacemaker HA clusters run at separate sites, PostgreSQL data is replicated from the primary site to the DR site using synchronous streaming replication.

The current pgsql RA has no mechanism to manage replication connections from PostgreSQL instances outside the cluster. Administrators must manually change synchronous_standby_names to enable synchronous replication with the DR site. When the out-of-cluster synchronous standby disconnects, client transactions hang until an administrator manually reverts the configuration.

Solution

A new optional parameter external_standby_node_list is introduced.
When set, the RA automatically:

Detects connections from listed external nodes via pg_stat_replication during the monitor action
Adds connected nodes to synchronous_standby_names (using FIRST N (...) syntax when multiple sync targets exist)
Removes disconnected nodes from synchronous_standby_names, preventing client transaction hangs without administrator intervention

Use case

Normal operation:

[Primary Site]                       [DR Site]
  Pacemaker Cluster                    Pacemaker Cluster
  +-----------------------+            +-----------------------+
  | primary1      (PRI)   |  sync rep  | dr-standby1     (HS)  |
  | standby1      (HS)    | -------->  |         |             |
  +-----------------------+            |  async rep (cascade)  |
                                       |         v             |
                                       | dr-standby2     (HS)  |
                                       +-----------------------+

When dr-standby1 fails:

[Primary Site]                       [DR Site]
  Pacemaker Cluster                    Pacemaker Cluster
  +-----------------------+            +-----------------------+
  | primary1      (PRI)   |  sync rep  | dr-standby1  (FAILED) |
  | standby1      (HS)    | -------->  |                       |
  +-----------------------+     |      | dr-standby2     (HS)  |
                                |      +-----------------------+
                                |              ^
                                +--------------+

node_list="primary1 standby1"
external_standby_node_list="dr-standby1 dr-standby2"

In this topology:

standby1 is an in-cluster synchronous standby managed by the existing node_list parameter.
dr-standby1 connects from outside the primary site's cluster via synchronous replication. It is listed in external_standby_node_list.
dr-standby2 normally replicates asynchronously from dr-standby1 (cascading replication) and does not connect directly to the primary. However, it is also listed in external_standby_node_list so that if dr-standby1 fails, dr-standby2 can connect directly to the primary and be automatically promoted to synchronous standby.

Key behaviors:

When dr-standby1 connects to the primary, the RA adds it to synchronous_standby_names automatically.
When dr-standby1 disconnects, the RA removes it, preventing transaction hangs.
If dr-standby2 then connects directly to the primary (as a failover within the DR site), the RA detects this and adds dr-standby2 to synchronous_standby_names automatically.

This means external_standby_node_list serves as a pre-registered list of potential sync standby nodes — nodes do not need to be connected at the time of configuration.

Changes

This PR contains two commits:

Commit 1: `pgsql: enhance set_sync_mode to support multiple sync standby targets`

Refactors set_sync_mode() as a prerequisite:

Accepts a space-separated list of node names (previously single node only)
Generates FIRST N (...) syntax when there are 2+ sync targets
Adds idempotency check to skip unnecessary pg_ctl reload
Parses both FIRST N (...) and plain quoted format from rep_mode.conf

No behavioral change when called with a single node argument (existing usage).

Commit 2: `pgsql: add external_standby_node_list for out-of-cluster sync replication management`

Adds the new feature:

New parameter external_standby_node_list (optional, default: empty)
Modified control_slave_status() to evaluate external nodes and make a consolidated sync mode decision
Warning log when a synchronous connection from an external node is lost
Variable initialization in validate_ocf_check_level_10()

Backward compatibility

When external_standby_node_list is not set (default), behavior is identical to the existing implementation
Designed for rep_mode="sync" configurations
FIRST N syntax requires PostgreSQL 9.6+; single-target mode works with PostgreSQL 9.1+

Testing

Tested with:

Red Hat Enterprise Linux release 9.6
pacemaker-2.1.9-1.el9.x86_64
postgresql17-17.6-1PGDG.rhel9.x86_64

Tested topology: primary1 (PRI) + standby1 (sync HS, in-cluster) + dr-standby1 (sync HS, external) + dr-standby2 (async HS, cascading from dr-standby1).

Test scenarios:

dr-standby1 connects → automatically added to synchronous_standby_names
dr-standby1 disconnects → automatically removed (no transaction hang)
dr-standby1 fails, dr-standby2 connects directly to primary → automatically added to synchronous_standby_names
In-cluster standby + external standby connected simultaneously → FIRST N (...) syntax generated
external_standby_node_list not set → identical behavior to current code

AI disclosure

This PR description and commit messages were written with the assistance of Claude (Anthropic). The code itself was designed and implemented by the author. See the Assisted-by: trailer in each commit message.

Refactor set_sync_mode() to handle multiple synchronous standby nodes: - Accept a space-separated list of node names as the argument - Generate FIRST N (...) syntax for synchronous_standby_names when there are two or more sync targets - Add idempotency check: skip configuration reload when the current settings already match the desired state - Parse both FIRST N (...) format and plain quoted format from rep_mode.conf for comparison This prepares for multi-target sync replication scenarios and also reduces unnecessary pg_ctl reloads in the existing single-target case. No behavioral change when called with a single node argument (existing usage). Assisted-by: Claude (Anthropic)

…tion management In multi-site disaster recovery architectures where independent Pacemaker clusters run at separate sites, the pgsql RA needs to manage synchronous replication connections from PostgreSQL instances outside the local cluster. Without this feature, administrators must manually modify synchronous_standby_names to enable synchronous replication with DR-site standbys. When such a standby disconnects, client transactions hang until manual intervention. Add a new optional parameter "external_standby_node_list" that specifies standby node names connecting from outside the cluster: - During monitor (control_slave_status), the RA checks pg_stat_replication for both in-cluster and external nodes - Connected external nodes are added to synchronous_standby_names - Disconnected external nodes are removed automatically, preventing transaction hangs - A warning is logged when an external sync connection is lost When external_standby_node_list is not set (default), behavior is identical to the existing implementation. Tested-on: RHEL 9.6, Pacemaker 2.1.9, PostgreSQL 17.6 Assisted-by: Claude (Anthropic)

knet-jenkins · 2026-03-31T02:51:00Z

Can one of the project admins check and authorise this run please: https://haci.fast.eng.rdu2.dc.redhat.com/job/resource-agents/job/resource-agents-pipeline/job/PR-2142/1/input

y-ikeda-ha added 2 commits March 30, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pgsql: add dynamic replication mode management for out-of-cluster standbys#2142

pgsql: add dynamic replication mode management for out-of-cluster standbys#2142
y-ikeda-ha wants to merge 2 commits intoClusterLabs:mainfrom
y-ikeda-ha:pgsql-external-standby-sync-mode

y-ikeda-ha commented Mar 31, 2026

Uh oh!

knet-jenkins bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

y-ikeda-ha commented Mar 31, 2026

Summary

Motivation

Solution

Use case

Changes

Commit 1: pgsql: enhance set_sync_mode to support multiple sync standby targets

Commit 2: pgsql: add external_standby_node_list for out-of-cluster sync replication management

Backward compatibility

Testing

AI disclosure

Uh oh!

knet-jenkins bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Commit 1: `pgsql: enhance set_sync_mode to support multiple sync standby targets`

Commit 2: `pgsql: add external_standby_node_list for out-of-cluster sync replication management`