IRE watchdog script to monitor IRE errors and automatically shutdown affected ports#399
Draft
senthil-nexthop wants to merge 2 commits into
Draft
IRE watchdog script to monitor IRE errors and automatically shutdown affected ports#399senthil-nexthop wants to merge 2 commits into
senthil-nexthop wants to merge 2 commits into
Conversation
On Broadcom DNX/J3 ASICs, a single-bit error reported via the IRE_ErrorDataPathCrc interrupt can corrupt in-memory state (e.g. MACsec CKN) and lead to a downstream cascade — egress MACsec ACL flipping to DROP, LACP partner timeouts, eventually BGP %ADJCHANGE Down with reason "Interface down". Time-to-converge ~75s on observed incidents because LACP's 90s timeout was the fastest detector. This change adds two host-side systemd-managed daemons that detect the fault much earlier and fail-stop the affected DNX core's front-panel ports, dropping convergence to ~1-2s: * ire_watchdog (poll mode) Polls IRE_DATA_PATH_CRC_ERROR_COUNTER per-core at 1Hz via `docker exec syncd bcmcmd 'getreg ...'`. On increment above baseline, shuts all front-panel ports on the affected core via `sudo config interface shutdown`, logs CRITICAL, and latches. * ire_watchdog_syslog (syslog-tail mode, alternate) Tails /var/log/syslog for the SDK's `dnxc_interrupt_print_info: name=IRE_ErrorDataPathCrc` message. Same shutdown action. Zero ongoing syslog pollution since it doesn't invoke bcmcmd. Both are gated to switch_type == "voq" (DNX) via ExecCondition; the poll variant also exits cleanly if the register isn't recognized by the SDK (defense-in-depth for non-DNX images). Operators must explicitly `sudo config interface startup EthernetX` after reboot or syncd reload to recover — the shutdown is intentionally durable in CONFIG_DB to signal that manual review of the hardware is required. Both variants ship by default but only one should be enabled per DUT. Recommended: enable ire_watchdog_syslog (no syncd log pressure); use ire_watchdog if the SDK ever changes the dnxc_interrupt_print_info log format. Tested on Nexthop 5010 (DNX/J3, NH-5010-F-O64) — verified end-to-end that simulated counter increments on a specific core result in the correct set of front-panel ports being admin-down, with CRITICAL journal output, latching behavior, and clean recovery via manual startup.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
e0ef376 to
18b5e0b
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Ship only the poll-register variant. The syslog-tail companion (ire_watchdog_syslog) is removed pending a separate discussion about which detection mechanism upstream prefers. The polling daemon is the more robust of the two (no coupling to SDK log-message format, unaffected by rsyslog issues) and is the variant validated on the production hardware where the original incident was observed. Removed: * scripts/ire_watchdog_syslog * data/debian/sonic-host-services-data.ire-watchdog-syslog.service * setup.py entry for the syslog variant * data/debian/rules dh_installsystemd entry for ire-watchdog-syslog * Reference in scripts/ire_watchdog docstring to the companion variant
18b5e0b to
82e44c7
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On Broadcom DNX ASICs, errors reported via the IRE_ErrorDataPathCrc interrupt can potentially lead to a downstream cascade — egress MACsec ACL flipping to DROP, LACP partner timeouts, eventually BGP %ADJCHANGE Down with reason "Interface down". Time-to-converge ~75s on observed incidents because LACP's 90s timeout was the fastest detector.
This change adds a host-side systemd-managed daemon that detect the fault much earlier and fail-stop the affected DNX core's front-panel ports, dropping convergence to ~1-2s:
ire_watchdogpollsIRE_DATA_PATH_CRC_ERROR_COUNTERper-core viadocker exec syncd bcmcmd 'getreg ...'. On increment above baseline, shuts all front-panel ports on the affected core viasudo config interface shutdown, logs CRITICAL, and latches.Both are gated to switch_type == "voq" (DNX) via ExecCondition; the script also exits cleanly if the register isn't recognized by the SDK (defense-in-depth for non-DNX images). Operators must explicitly
sudo config interface startup EthernetXafter reboot or syncd reload to recover.