Skip to content

Feature: Add reload feature for trace_ra to prevent resource restarts#2147

Open
Reshma676 wants to merge 3 commits intoClusterLabs:mainfrom
Reshma676:feature/zero-downtime-trace-v2
Open

Feature: Add reload feature for trace_ra to prevent resource restarts#2147
Reshma676 wants to merge 3 commits intoClusterLabs:mainfrom
Reshma676:feature/zero-downtime-trace-v2

Conversation

@Reshma676
Copy link
Copy Markdown

@Reshma676 Reshma676 commented Apr 10, 2026

Note on AI Usage: > For full transparency, the code and architecture in this PR were developed with the assistance of the Claude AI model .

Currently, enabling trace_ra=1 forces Pacemaker to perform a full stop/start cycle on the resource. In production environments, this causes unnecessary service downtime just to enable debug logging.

This PR introduces a zero-downtime solution by leveraging the reload action:

Added a shared ocf_trace_reload function to ocf-shellfuncs.in.

Updated IPaddr2 and anything to include the trace_ra parameter, the metadata action, and the reload handler.

Tested successfully on a live RHEL 10 Pacemaker cluster. Updating the trace_ra parameter and triggering a reload successfully enables/disables the trace output while the IPaddr2 resource remains in the Started state without any downtime or transition restarts.

[root@rhel10node1 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 10.0 (Coughlan)
[root@rhel10node1 ~]# date;pcs resource config VirtualIP
Fri Apr 10 05:50:08 AM BST 2026
Resource: VirtualIP (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: VirtualIP-instance_attributes
    ip=192.168.21.211
    trace_ra=0
  Meta Attributes: VirtualIP-meta_attributes
    resource-stickiness=100
  Operations:
    start: VirtualIP-start-interval-0s
      interval=0s timeout=20s
    stop: VirtualIP-stop-interval-0s
      interval=0s timeout=20s
    monitor: VirtualIP-monitor-interval-10s
      interval=10s
[root@rhel10node1 ~]# pcs resource update VirtualIP trace_ra=1
  • Logs from DC node:
Apr 10 05:50:51 localhost pacemaker-schedulerd[6351]: notice: Actions: Reload     VirtualIP    ( rhel10node2 )
Apr 10 05:50:51 localhost pacemaker-schedulerd[6351]: notice: Calculated transition 58, saving inputs in /var/lib/pacemaker/pengine/pe-input-14.bz2
Apr 10 05:50:51 localhost pacemaker-controld[6352]: notice: Requesting local execution of reload operation for VirtualIP on rhel10node2
Apr 10 05:50:52 localhost IPaddr2(VirtualIP)[425166]: INFO: Enabling trace_ra
Apr 10 05:50:52 localhost pacemaker-controld[6352]: notice: Result of reload operation for VirtualIP on rhel10node2: OK
Apr 10 05:50:52 localhost pacemaker-controld[6352]: notice: Requesting local execution of monitor operation for VirtualIP on rhel10node2
Apr 10 05:50:52 localhost pacemaker-controld[6352]: notice: Result of monitor operation for VirtualIP on rhel10node2: OK
Apr 10 05:50:52 localhost pacemaker-controld[6352]: notice: Transition 58 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-14.bz2): Complete
Apr 10 05:50:52 localhost pacemaker-controld[6352]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Apr 10 05:54:49 localhost systemd[423612]: Created slice background.slice - User Background Tasks Slice.
Apr 10 05:54:49 localhost systemd[423612]: Starting systemd-tmpfiles-clean.service - Cleanup of User's Temporary Files and Directories...
Apr 10 05:54:49 localhost systemd[423612]: Finished systemd-tmpfiles-clean.service - Cleanup of User's Temporary Files and Directories.
root@rhel10node1 ~]# pcs resource describe IPaddr2 | egrep -i 'trace|reload'
Assumed agent name 'ocf:heartbeat:IPaddr2' (deduced from 'IPaddr2')
  trace_ra
    Description: Set to 1 to enable resource agent tracing for this resource instance. This parameter can be modified at runtime via the reload action to enable or disable tracing without service interruption.
  reload:
[root@rhel10node2 ~]# cd /var/lib/heartbeat/trace_ra/IPaddr2/VirtualIP.
Display all 131 possibilities? (y or n)
VirtualIP.monitor.2026-04-09.19:07:34  VirtualIP.monitor.2026-04-09.19:13:10  VirtualIP.monitor.2026-04-10.05:53:25  VirtualIP.monitor.2026-04-10.05:57:27  VirtualIP.monitor.2026-04-10.06:01:29  VirtualIP.monitor.2026-04-10.06:05:31
...
  • Disabling trace_ra:
[root@rhel10node1 ~]# date;pcs resource config VirtualIP
Fri Apr 10 06:07:35 AM BST 2026
Resource: VirtualIP (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: VirtualIP-instance_attributes
    ip=192.168.21.211
    trace_ra=1
  Meta Attributes: VirtualIP-meta_attributes
    resource-stickiness=100
  Operations:
    start: VirtualIP-start-interval-0s
      interval=0s timeout=20s
    stop: VirtualIP-stop-interval-0s
      interval=0s timeout=20s
    monitor: VirtualIP-monitor-interval-10s
      interval=10s
[root@rhel10node1 ~]# date;pcs resource update VirtualIP trace_ra=0
Fri Apr 10 06:08:16 AM BST 2026
  • Logs from DC node showing reload of the metadata instead of restarting entire resource :
Apr 10 06:08:18 localhost pacemaker-controld[6352]: notice: Populating nodes and starting an election after cib_diff_notify event triggered by cibadmin
Apr 10 06:08:18 localhost pacemaker-controld[6352]: notice: State transition S_IDLE -> S_ELECTION
Apr 10 06:08:18 localhost pacemaker-controld[6352]: notice: State transition S_ELECTION -> S_INTEGRATION
Apr 10 06:08:18 localhost pacemaker-controld[6352]: notice: Finalizing join-10 for 2 nodes (sync'ing CIB 0.17.1 with schema pacemaker-4.0 and feature set 3.20.1 from rhel10node2)
Apr 10 06:08:18 localhost pacemaker-schedulerd[6351]: notice: Actions: Reload     VirtualIP    ( rhel10node2 )
Apr 10 06:08:18 localhost pacemaker-schedulerd[6351]: notice: Calculated transition 60, saving inputs in /var/lib/pacemaker/pengine/pe-input-16.bz2
Apr 10 06:08:18 localhost pacemaker-controld[6352]: notice: Requesting local execution of reload operation for VirtualIP on rhel10node2
Apr 10 06:08:18 localhost IPaddr2(VirtualIP)[474230]: INFO: Disabling trace_ra
Apr 10 06:08:18 localhost pacemaker-controld[6352]: notice: Result of reload operation for VirtualIP on rhel10node2: OK
Apr 10 06:08:18 localhost pacemaker-controld[6352]: notice: Requesting local execution of monitor operation for VirtualIP on rhel10node2
Apr 10 06:08:18 localhost pacemaker-controld[6352]: notice: Result of monitor operation for VirtualIP on rhel10node2: OK
Apr 10 06:08:18 localhost pacemaker-controld[6352]: notice: Transition 60 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-16.bz2): Complete
Apr 10 06:08:18 localhost pacemaker-controld[6352]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE

@knet-jenkins
Copy link
Copy Markdown

knet-jenkins bot commented Apr 10, 2026

Can one of the project admins check and authorise this run please: https://haci.fast.eng.rdu2.dc.redhat.com/job/resource-agents/job/resource-agents-pipeline/job/PR-2147/1/input

Note: This implementation was developed with the assistance of Claude AI.
@Reshma676 Reshma676 force-pushed the feature/zero-downtime-trace-v2 branch from 6d0f63c to f68ab82 Compare April 10, 2026 05:40
@knet-jenkins
Copy link
Copy Markdown

knet-jenkins bot commented Apr 10, 2026

Can one of the project admins check and authorise this run please: https://haci.fast.eng.rdu2.dc.redhat.com/job/resource-agents/job/resource-agents-pipeline/job/PR-2147/2/input

Reshma676 and others added 2 commits April 10, 2026 11:35
Based on upstream maintainer feedback, simplify the architecture by:
- Removing ocf_trace_reload helper function from ocf-shellfuncs
- Removing agent-specific reload functions (ip_reload, anything_reload)
- Making reload action return OCF_SUCCESS directly
- Changing trace_ra parameter type from integer to boolean

This achieves the same zero-downtime trace configuration updates with
a cleaner, more straightforward implementation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@knet-jenkins
Copy link
Copy Markdown

knet-jenkins bot commented Apr 10, 2026

Can one of the project admins check and authorise this run please: https://haci.fast.eng.rdu2.dc.redhat.com/job/resource-agents/job/resource-agents-pipeline/job/PR-2147/3/input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant