[system-health] Add periodic full-scan backstop so transient service failures are not missed#27887
Open
BYGX-wcr wants to merge 1 commit into
Open
[system-health] Add periodic full-scan backstop so transient service failures are not missed#27887BYGX-wcr wants to merge 1 commit into
BYGX-wcr wants to merge 1 commit into
Conversation
…failures are not missed Signed-off-by: BYGX-wcr <wcr@live.cn>
Collaborator
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR enhances sysmonitor in SONiC’s system-health component to avoid missing short-lived systemd service failures (e.g., crash + auto-restart) by adding a periodic full rescan backstop and making full-scan state tracking safe to run repeatedly.
Changes:
- Added a periodic (15s) full-scan backstop in
Sysmonitor.system_service()usingtime.monotonic()to catch transient failures that may not be observed viaJobRemoved/STATE_DB events. - Updated
get_all_system_status()to rebuilddnsrvs_namefrom scratch each sweep so recovered services can be cleared during periodic polling. - Introduced
PERIODIC_POLL_INTERVAL_SECSconstant to control the backstop cadence.
Comment on lines
+406
to
+410
| fresh_down = set() | ||
| for service in self.get_all_service_list(): | ||
| ustate = self.get_unit_status(service) | ||
| if ustate == "NOT OK": | ||
| if service not in self.dnsrvs_name: | ||
| self.dnsrvs_name.add(service) | ||
| fresh_down.add(service) |
Comment on lines
+544
to
+551
| # Backstop: if no JobRemoved event arrived for a transient failure | ||
| # (crash + auto-restart inside one window), the next full scan catches it. | ||
| if time.monotonic() - last_full_scan_ts >= PERIODIC_POLL_INTERVAL_SECS: | ||
| try: | ||
| self.update_system_status() | ||
| except Exception as e: | ||
| logger.log_error("periodic update_system_status: "+str(e)) | ||
| last_full_scan_ts = time.monotonic() |
zjswhhh
approved these changes
Jun 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why I did it
sysmonitor(under src/system-health/health_checker/sysmonitor.py) detects service-state changes through two event sources only:JobRemovedsignals (MonitorSystemBusTask)FEATUREtable updates (MonitorStateDbTask)A
JobRemovedsignal is emitted only when an explicit start/stop/restart job completes. An unsolicited main-process death does not emitJobRemovedby itself — only the subsequent auto-restart job does. Combined withRestart=always+RestartSec=30on most SONiC service units, this creates a window where:bgpdin thebgpcontainer),.serviceentersfailed/inactive,active/running,JobRemovedevent sysmonitor receives describes the restart's completion,get_unit_status()queriessystemctl show, the unit is back toactive/running,SYSTEM_READY|SYSTEM_STATE = UPis never flipped toDOWN.On smartswitches, the chassisd DPU
_get_control_plane_state_common()readsSYSTEM_READY|SYSTEM_STATE, so this also meansCHASSIS_STATE_DB|DPU_STATE|DPUx:dpu_control_plane_statedoes not flip todownfor short transient failures of a single service, hiding them from the NPU / DASH HA.Reproduced live on a BlueField DPU running SONiC: killing
bgpdproduced nosystem#monitorlog entry and noSYSTEM_READYtransition, while killingorchagentdid — only because the swss kill cascaded to multiple services and gave sysmonitor enough time to sample one of them infailed.Work item tracking
How I did it
Two minimal changes in src/system-health/health_checker/sysmonitor.py:
Periodic backstop poll. Added a 15 s wall-clock-driven full sweep in the
Sysmonitor.system_service()main loop, usingtime.monotonic()so it is independent of system clock changes and unaffected by event-queue idleness:This guarantees that even if
JobRemovedis missed,systemctl showwill sample the unit while it is stillfailed/inactiveand flip the aggregate state correctly. Worst-case detection latency: ~15 s + one queue cycle.get_all_system_status()now rebuildsself.dnsrvs_namefrom scratch on every call instead of append-only. The previous implementation only ever added entries — fine for the single boot-time invocation, but it would have wedged the aggregate state atDOWNonce periodic polling was introduced. Recovered services are now removed by the next sweep.A new module constant
PERIODIC_POLL_INTERVAL_SECS = 15controls the cadence. No new dependencies, no API changes, no schema changes.How to verify it
On any SONiC device (smartswitch DPU recommended, since the effect on
dpu_control_plane_stateis the original motivator):Confirm baseline:
Kill a critical process so the container exits while leaving
Restart=alwaysin place:docker exec bgp supervisorctl signal KILL bgpdWithin ~15 s,
system#monitorshould now log the transition and flip the state:On a smartswitch, the NPU then sees the DPU's control plane go down:
After the auto-restart completes (~30 s later), the state recovers and a
System is readylog line is emitted;SYSTEM_READYreturns toUP. Before this PR, neither theDOWNlog nor the transition happened in this scenario.Unit tests under tests/test_system_health.py that exercise the touched functions (
test_get_all_system_status_ok,test_get_all_system_status_not_ok,test_check_unit_status*,test_post_unit_status,test_update_system_status) instantiate a freshSysmonitor(), so the rebuild semantics inget_all_system_status()produce the same result as the old append-only logic; no test changes required.Which release branch to backport (please specify)
The change is small and self-contained, so it should backport cleanly anywhere
sysmonitor.pyexists.Tested branch (please specify)
master(live BlueField DPU image, str2-mlnx-4280-smartswitch-02-dpu-0)Description for the changelog
[system-health] sysmonitor periodic full re-scan to catch service failures missed by systemd D-Bus eventsLink to config_db schema for YANG module changes
N/A — no schema or YANG changes.