Skip to content

Initial import: Azure SAP zone alignment Pacemaker/OCF resource agent#2122

Open
sanoop-t wants to merge 3 commits intoClusterLabs:mainfrom
sanoop-t:sanoopt_azure-sap-zone
Open

Initial import: Azure SAP zone alignment Pacemaker/OCF resource agent#2122
sanoop-t wants to merge 3 commits intoClusterLabs:mainfrom
sanoop-t:sanoopt_azure-sap-zone

Conversation

@sanoop-t
Copy link
Copy Markdown

PR Summary

This PR adds a new Python-based OCF resource agent under heartbeat/ named azure-sap-zone. The agent helps align SAP application-server activity with the active (PROMOTED) HANA node’s Azure Availability Zone (or a logical “zone group” for non-zonal/PPG deployments), to reduce cross-zone latency and to automate app-tier switching during HANA failover.

Key behavior

  • Runs under Pacemaker as a cloned resource and acts only when the local HANA node is PROMOTED (primary).
  • Determines the “active zone/group” from:
    • Azure IMDS zone metadata (zonal deployments), or
    • hana_vm_zones mapping (non-zonal/PPG deployments).
  • Ensures app VMs in the same zone/group are started and SAP is started/activated.
  • For app VMs in the other zone/group:
    • stop_vms=false: deactivates SAP (passive mode) without stopping VMs.
    • stop_vms=true: stops SAP, waits for shutdown, then deallocates those VMs.

Notable parameters

  • hana_resource (required): Pacemaker HANA resource name used to read status attributes.
  • App VM selection: provide at least one of:
    • app_vm_names (comma-separated), or
    • app_vm_name_pattern (regex), or
    • app_vm_zones (mapping; can also supply names).
  • Non-zonal/PPG mappings:
    • hana_vm_zones="hanavm1:1,hanavm2:2"
    • app_vm_zones="sapapp01:1,sapapp02:1,sapapp03:2,..."
  • Managed Identity:
    • client_id (optional): user-assigned MI; if omitted, system-assigned MI is used.
  • Operational controls:
    • stop_vms, wait_before_stop_sap, wait_time, soft_shutdown_timeout
    • retry_count, retry_wait for ARM API retries

Safety/correctness notes

  • Fail-fast validation on start: when app_vm_zones/hana_vm_zones are provided and Azure zone metadata exists, the agent verifies the mapping matches Azure and fails early on mismatch.
  • Uses managed identity token caching with expiry tracking + refresh.
  • ARM calls include retries for common transient failures (429/5xx).
  • Uses stable VM power-state codes (PowerState/*) from instanceView.
  • Azure Run Command execution includes a payload variation to avoid stale cached results when reusing a fixed runCommand name.

Dependencies / packaging

  • requests is treated as an optional import so meta-data discovery can still work; Azure API operations require python3-requests.

Testing / validation performed

  • Verified meta-data output generation and validate-all parameter validation paths.
  • Reviewed monitor/start/stop flows for zone resolution, app VM selection, and stop/deallocate logic.

Notes

  • Script currently ships with the upstream-style shebang placeholder #!@PYTHON@ -tt (consistent with other Python agents that are patched/installed by packaging or install instructions).

@knet-jenkins
Copy link
Copy Markdown

knet-jenkins bot commented Jan 30, 2026

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2122/1/input

@oalbrigt oalbrigt marked this pull request as draft February 9, 2026 07:04
@sanoop-t sanoop-t marked this pull request as ready for review February 12, 2026 18:26
@sanoop-t
Copy link
Copy Markdown
Author

Hi @oalbrigt I noticed the PR was moved to draft a few days ago. I didn’t see any comments, so I just wanted to check if there’s anything specific, you’d like me to address or if more work is needed before review. Happy to make updates. Thanks!

@oalbrigt
Copy link
Copy Markdown
Contributor

We are discussing where it would make most sense to host this. If it would be with the SAP agents, or in this repo.

I'll come back to you when we have made a decission.

@sanoop-t
Copy link
Copy Markdown
Author

sanoop-t commented Mar 4, 2026

@oalbrigt Hello, please let me know if there are any updates from your discussion.

@sanoop-t
Copy link
Copy Markdown
Author

@oalbrigt Hello, following up on this. Let me know if there are any updates from your discussion.

@oalbrigt
Copy link
Copy Markdown
Contributor

I'm still waiting for an update from our internal discussion, and have asked for an update again.

@sanoop-t
Copy link
Copy Markdown
Author

sanoop-t commented Apr 1, 2026

@oalbrigt Hello, let me know if there are any updates from your discussion.

@oalbrigt
Copy link
Copy Markdown
Contributor

@sanoop-t We've decided to merge it here. So I'll do a full review next week.

From a quick glance you have to add it to the Makefile.am files, and configure.ac to not build it when Python isnt new enough or similar situations arise during the build process.

See https://github.com/ClusterLabs/resource-agents/blob/main/doc/dev-guides/ra-dev-guide.asc#submitting-resource-agents for more info.

@knet-jenkins
Copy link
Copy Markdown

knet-jenkins bot commented Apr 13, 2026

Can one of the project admins check and authorise this run please: https://haci.fast.eng.rdu2.dc.redhat.com/job/resource-agents/job/resource-agents-pipeline/job/PR-2122/2/input

Add azure-sap-zone to configure.ac, heartbeat/Makefile.am,
doc/man/Makefile.am, and .gitignore following the existing
azure-events pattern. The agent is conditionally built when
Python 3.6+ is available.
@sanoop-t sanoop-t force-pushed the sanoopt_azure-sap-zone branch from 2cfa0e7 to 1060cc6 Compare April 13, 2026 15:07
@knet-jenkins
Copy link
Copy Markdown

knet-jenkins bot commented Apr 13, 2026

Can one of the project admins check and authorise this run please: https://haci.fast.eng.rdu2.dc.redhat.com/job/resource-agents/job/resource-agents-pipeline/job/PR-2122/3/input

@sanoop-t
Copy link
Copy Markdown
Author

Thanks @oalbrigt! I've updated the PR with the build system integration:

  • Added conditional build block in configure.ac (Python 3.6+ check)
  • Added azure-sap-zone to ocf_SCRIPTS in heartbeat/Makefile.am
  • Added man page entry in doc/man/Makefile.am
  • Added generated file to .gitignore

All following the existing azure-events pattern. Ready for your full review whenever you get a chance.

@oalbrigt
Copy link
Copy Markdown
Contributor

Great. Thanks.

Comment thread heartbeat/azure-sap-zone.in Outdated
# Pacemaker/OCF tooling does not always populate OCF_ROOT/OCF_FUNCTIONS_DIR when querying meta-data.
# Fall back to common distro paths so the `ocf` Python helper can be imported reliably.
_ocf_root = os.environ.get("OCF_ROOT") or "/usr/lib/ocf"
OCF_FUNCTIONS_DIR = os.environ.get("OCF_FUNCTIONS_DIR") or os.path.join(_ocf_root, "lib", "heartbeat")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the simpler approach that all our agents use (I've not had anyone copmlain about the default path yet):
https://github.com/ClusterLabs/resource-agents/blob/main/doc/dev-guides/writing-python-agents.md#run-loop-and-metadata-example

Comment thread heartbeat/azure-sap-zone.in Outdated
sys.path.append(OCF_FUNCTIONS_DIR)
try:
import ocf # type: ignore
except Exception: # pragma: no cover
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of this and any other duplicate code (use logger(), ocf_exit_reason(), is_probe(), etc from ocf.py. No reason to have a fallback anymore now that the agent will be in the repository with the OCF library.

When you've done that I'll do a full review of the agent.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oalbrigt Done! I've updated the agent to use the standard ocf.py library:

  • Replaced the custom OCF path-searching and _OCFShim fallback class with the standard 3-line import pattern used by other agents in the repo
  • Removed the custom _is_probe_operation() function; now uses ocf.is_probe()
  • Adjusted validate_action to handle string parameter values from the ocf.py run loop

Successfully validated the changes through our testing cycle. Ready for your full review.

Remove the _OCFShim fallback class and custom OCF path-searching
logic. Use the standard ocf.py import pattern consistent with
other Python agents in the repository. Remove the custom
_is_probe_operation() function in favor of ocf.is_probe().
Adjust validate_action to handle string parameter values from
the ocf.py run loop.
@knet-jenkins
Copy link
Copy Markdown

knet-jenkins bot commented Apr 16, 2026

Can one of the project admins check and authorise this run please: https://haci.fast.eng.rdu2.dc.redhat.com/job/resource-agents/job/resource-agents-pipeline/job/PR-2122/4/input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants