pgsql: add dynamic replication mode management for out-of-cluster standbys#2142
Open
y-ikeda-ha wants to merge 2 commits intoClusterLabs:mainfrom
Open
pgsql: add dynamic replication mode management for out-of-cluster standbys#2142y-ikeda-ha wants to merge 2 commits intoClusterLabs:mainfrom
y-ikeda-ha wants to merge 2 commits intoClusterLabs:mainfrom
Conversation
Refactor set_sync_mode() to handle multiple synchronous standby nodes: - Accept a space-separated list of node names as the argument - Generate FIRST N (...) syntax for synchronous_standby_names when there are two or more sync targets - Add idempotency check: skip configuration reload when the current settings already match the desired state - Parse both FIRST N (...) format and plain quoted format from rep_mode.conf for comparison This prepares for multi-target sync replication scenarios and also reduces unnecessary pg_ctl reloads in the existing single-target case. No behavioral change when called with a single node argument (existing usage). Assisted-by: Claude (Anthropic)
…tion management In multi-site disaster recovery architectures where independent Pacemaker clusters run at separate sites, the pgsql RA needs to manage synchronous replication connections from PostgreSQL instances outside the local cluster. Without this feature, administrators must manually modify synchronous_standby_names to enable synchronous replication with DR-site standbys. When such a standby disconnects, client transactions hang until manual intervention. Add a new optional parameter "external_standby_node_list" that specifies standby node names connecting from outside the cluster: - During monitor (control_slave_status), the RA checks pg_stat_replication for both in-cluster and external nodes - Connected external nodes are added to synchronous_standby_names - Disconnected external nodes are removed automatically, preventing transaction hangs - A warning is logged when an external sync connection is lost When external_standby_node_list is not set (default), behavior is identical to the existing implementation. Tested-on: RHEL 9.6, Pacemaker 2.1.9, PostgreSQL 17.6 Assisted-by: Claude (Anthropic)
|
Can one of the project admins check and authorise this run please: https://haci.fast.eng.rdu2.dc.redhat.com/job/resource-agents/job/resource-agents-pipeline/job/PR-2142/1/input |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a new feature to the pgsql resource agent that dynamically manages the replication mode of PostgreSQL standbys connecting from outside the Pacemaker cluster, targeting multi-site disaster recovery (DR) use cases.
Motivation
In multi-site DR architectures where independent Pacemaker HA clusters run at separate sites, PostgreSQL data is replicated from the primary site to the DR site using synchronous streaming replication.
The current pgsql RA has no mechanism to manage replication connections from PostgreSQL instances outside the cluster. Administrators must manually change
synchronous_standby_namesto enable synchronous replication with the DR site. When the out-of-cluster synchronous standby disconnects, client transactions hang until an administrator manually reverts the configuration.Solution
A new optional parameter
external_standby_node_listis introduced.When set, the RA automatically:
pg_stat_replicationduring the monitor actionsynchronous_standby_names(usingFIRST N (...)syntax when multiple sync targets exist)synchronous_standby_names, preventing client transaction hangs without administrator interventionUse case
Normal operation:
When dr-standby1 fails:
In this topology:
standby1is an in-cluster synchronous standby managed by the existingnode_listparameter.dr-standby1connects from outside the primary site's cluster via synchronous replication. It is listed inexternal_standby_node_list.dr-standby2normally replicates asynchronously fromdr-standby1(cascading replication) and does not connect directly to the primary. However, it is also listed inexternal_standby_node_listso that ifdr-standby1fails,dr-standby2can connect directly to the primary and be automatically promoted to synchronous standby.Key behaviors:
dr-standby1connects to the primary, the RA adds it tosynchronous_standby_namesautomatically.dr-standby1disconnects, the RA removes it, preventing transaction hangs.dr-standby2then connects directly to the primary (as a failover within the DR site), the RA detects this and addsdr-standby2tosynchronous_standby_namesautomatically.This means
external_standby_node_listserves as a pre-registered list of potential sync standby nodes — nodes do not need to be connected at the time of configuration.Changes
This PR contains two commits:
Commit 1:
pgsql: enhance set_sync_mode to support multiple sync standby targetsRefactors
set_sync_mode()as a prerequisite:FIRST N (...)syntax when there are 2+ sync targetspg_ctl reloadFIRST N (...)and plain quoted format fromrep_mode.confNo behavioral change when called with a single node argument (existing usage).
Commit 2:
pgsql: add external_standby_node_list for out-of-cluster sync replication managementAdds the new feature:
external_standby_node_list(optional, default: empty)control_slave_status()to evaluate external nodes and make a consolidated sync mode decisionvalidate_ocf_check_level_10()Backward compatibility
external_standby_node_listis not set (default), behavior is identical to the existing implementationrep_mode="sync"configurationsFIRST Nsyntax requires PostgreSQL 9.6+; single-target mode works with PostgreSQL 9.1+Testing
Tested with:
Tested topology: primary1 (PRI) + standby1 (sync HS, in-cluster) + dr-standby1 (sync HS, external) + dr-standby2 (async HS, cascading from dr-standby1).
Test scenarios:
dr-standby1connects → automatically added tosynchronous_standby_namesdr-standby1disconnects → automatically removed (no transaction hang)dr-standby1fails,dr-standby2connects directly to primary → automatically added tosynchronous_standby_namesFIRST N (...)syntax generatedexternal_standby_node_listnot set → identical behavior to current codeAI disclosure
This PR description and commit messages were written with the assistance of Claude (Anthropic). The code itself was designed and implemented by the author. See the
Assisted-by:trailer in each commit message.