Skip to content

IRE watchdog script to monitor IRE errors and automatically shutdown affected ports#399

Draft
senthil-nexthop wants to merge 2 commits into
sonic-net:masterfrom
nexthop-ai:senthil.ire-watchdog
Draft

IRE watchdog script to monitor IRE errors and automatically shutdown affected ports#399
senthil-nexthop wants to merge 2 commits into
sonic-net:masterfrom
nexthop-ai:senthil.ire-watchdog

Conversation

@senthil-nexthop

Copy link
Copy Markdown

On Broadcom DNX ASICs, errors reported via the IRE_ErrorDataPathCrc interrupt can potentially lead to a downstream cascade — egress MACsec ACL flipping to DROP, LACP partner timeouts, eventually BGP %ADJCHANGE Down with reason "Interface down". Time-to-converge ~75s on observed incidents because LACP's 90s timeout was the fastest detector.

This change adds a host-side systemd-managed daemon that detect the fault much earlier and fail-stop the affected DNX core's front-panel ports, dropping convergence to ~1-2s:

ire_watchdog polls IRE_DATA_PATH_CRC_ERROR_COUNTER per-core via docker exec syncd bcmcmd 'getreg ...'. On increment above baseline, shuts all front-panel ports on the affected core via sudo config interface shutdown, logs CRITICAL, and latches.

Both are gated to switch_type == "voq" (DNX) via ExecCondition; the script also exits cleanly if the register isn't recognized by the SDK (defense-in-depth for non-DNX images). Operators must explicitly sudo config interface startup EthernetX after reboot or syncd reload to recover.

On Broadcom DNX/J3 ASICs, a single-bit error reported via the
IRE_ErrorDataPathCrc interrupt can corrupt in-memory state (e.g. MACsec
CKN) and lead to a downstream cascade — egress MACsec ACL flipping to
DROP, LACP partner timeouts, eventually BGP %ADJCHANGE Down with reason
"Interface down". Time-to-converge ~75s on observed incidents because
LACP's 90s timeout was the fastest detector.

This change adds two host-side systemd-managed daemons that detect the
fault much earlier and fail-stop the affected DNX core's front-panel
ports, dropping convergence to ~1-2s:

* ire_watchdog (poll mode)
  Polls IRE_DATA_PATH_CRC_ERROR_COUNTER per-core at 1Hz via
  `docker exec syncd bcmcmd 'getreg ...'`. On increment above baseline,
  shuts all front-panel ports on the affected core via
  `sudo config interface shutdown`, logs CRITICAL, and latches.

* ire_watchdog_syslog (syslog-tail mode, alternate)
  Tails /var/log/syslog for the SDK's `dnxc_interrupt_print_info:
  name=IRE_ErrorDataPathCrc` message. Same shutdown action. Zero
  ongoing syslog pollution since it doesn't invoke bcmcmd.

Both are gated to switch_type == "voq" (DNX) via ExecCondition; the
poll variant also exits cleanly if the register isn't recognized by
the SDK (defense-in-depth for non-DNX images). Operators must
explicitly `sudo config interface startup EthernetX` after reboot or
syncd reload to recover — the shutdown is intentionally durable in
CONFIG_DB to signal that manual review of the hardware is required.

Both variants ship by default but only one should be enabled per DUT.
Recommended: enable ire_watchdog_syslog (no syncd log pressure); use
ire_watchdog if the SDK ever changes the dnxc_interrupt_print_info
log format.

Tested on Nexthop 5010 (DNX/J3, NH-5010-F-O64) — verified end-to-end
that simulated counter increments on a specific core result in the
correct set of front-panel ports being admin-down, with CRITICAL
journal output, latching behavior, and clean recovery via manual
startup.
@mssonicbld

Copy link
Copy Markdown

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Ship only the poll-register variant. The syslog-tail companion
(ire_watchdog_syslog) is removed pending a separate discussion about
which detection mechanism upstream prefers. The polling daemon is the
more robust of the two (no coupling to SDK log-message format,
unaffected by rsyslog issues) and is the variant validated on the
production hardware where the original incident was observed.

Removed:
* scripts/ire_watchdog_syslog
* data/debian/sonic-host-services-data.ire-watchdog-syslog.service
* setup.py entry for the syslog variant
* data/debian/rules dh_installsystemd entry for ire-watchdog-syslog
* Reference in scripts/ire_watchdog docstring to the companion variant
@mssonicbld

Copy link
Copy Markdown

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants