[system-health] Add periodic full-scan backstop so transient service failures are not missed by BYGX-wcr · Pull Request #27887 · sonic-net/sonic-buildimage

BYGX-wcr · 2026-06-14T18:23:43Z

Why I did it

sysmonitor (under src/system-health/health_checker/sysmonitor.py) detects service-state changes through two event sources only:

systemd D-Bus JobRemoved signals (MonitorSystemBusTask)
STATE_DB FEATURE table updates (MonitorStateDbTask)

A JobRemoved signal is emitted only when an explicit start/stop/restart job completes. An unsolicited main-process death does not emit JobRemoved by itself — only the subsequent auto-restart job does. Combined with Restart=always + RestartSec=30 on most SONiC service units, this creates a window where:

a critical process inside a feature container is killed (e.g. bgpd in the bgp container),
the container exits and the .service enters failed/inactive,
systemd silently waits up to 30 s, then auto-restarts and the service returns to active/running,
the only JobRemoved event sysmonitor receives describes the restart's completion,
by the time get_unit_status() queries systemctl show, the unit is back to active/running,
SYSTEM_READY|SYSTEM_STATE = UP is never flipped to DOWN.

On smartswitches, the chassisd DPU _get_control_plane_state_common() reads SYSTEM_READY|SYSTEM_STATE, so this also means CHASSIS_STATE_DB|DPU_STATE|DPUx:dpu_control_plane_state does not flip to down for short transient failures of a single service, hiding them from the NPU / DASH HA.

Reproduced live on a BlueField DPU running SONiC: killing bgpd produced no system#monitor log entry and no SYSTEM_READY transition, while killing orchagent did — only because the swss kill cascaded to multiple services and gave sysmonitor enough time to sample one of them in failed.

Work item tracking

Microsoft ADO: 38379400

How I did it

Two minimal changes in src/system-health/health_checker/sysmonitor.py:

Periodic backstop poll. Added a 15 s wall-clock-driven full sweep in the Sysmonitor.system_service() main loop, using time.monotonic() so it is independent of system clock changes and unaffected by event-queue idleness:
```
if time.monotonic() - last_full_scan_ts >= PERIODIC_POLL_INTERVAL_SECS:
    self.update_system_status()
    last_full_scan_ts = time.monotonic()
```
This guarantees that even if JobRemoved is missed, systemctl show will sample the unit while it is still failed/inactive and flip the aggregate state correctly. Worst-case detection latency: ~15 s + one queue cycle.
get_all_system_status() now rebuilds self.dnsrvs_name from scratch on every call instead of append-only. The previous implementation only ever added entries — fine for the single boot-time invocation, but it would have wedged the aggregate state at DOWN once periodic polling was introduced. Recovered services are now removed by the next sweep.

A new module constant PERIODIC_POLL_INTERVAL_SECS = 15 controls the cadence. No new dependencies, no API changes, no schema changes.

How to verify it

On any SONiC device (smartswitch DPU recommended, since the effect on dpu_control_plane_state is the original motivator):

Confirm baseline:

redis-cli -n 6 hget "SYSTEM_READY|SYSTEM_STATE" Status     # UP
redis-cli -n 6 hgetall "ALL_SERVICE_STATUS|bgp"            # service_status=OK

Kill a critical process so the container exits while leaving Restart=always in place:
```
docker exec bgp supervisorctl signal KILL bgpd
```

Within ~15 s, system#monitor should now log the transition and flip the state:

healthd[…]: bgp.service service state changed to [failed/failed]   (or inactive/dead)
healthd[…]: System is not ready - one or more services are not up

redis-cli -n 6 hget "SYSTEM_READY|SYSTEM_STATE" Status     # DOWN

On a smartswitch, the NPU then sees the DPU's control plane go down:

redis-cli -n 13 hget "DPU_STATE|DPU0" dpu_control_plane_state   # down

After the auto-restart completes (~30 s later), the state recovers and a System is ready log line is emitted; SYSTEM_READY returns to UP. Before this PR, neither the DOWN log nor the transition happened in this scenario.

Unit tests under tests/test_system_health.py that exercise the touched functions (test_get_all_system_status_ok, test_get_all_system_status_not_ok, test_check_unit_status*, test_post_unit_status, test_update_system_status) instantiate a fresh Sysmonitor(), so the rebuild semantics in get_all_system_status() produce the same result as the old append-only logic; no test changes required.

Which release branch to backport (please specify)

The change is small and self-contained, so it should backport cleanly anywhere sysmonitor.py exists.

Tested branch (please specify)

Tested on: master (live BlueField DPU image, str2-mlnx-4280-smartswitch-02-dpu-0)

Description for the changelog

[system-health] sysmonitor periodic full re-scan to catch service failures missed by systemd D-Bus events

Link to config_db schema for YANG module changes

N/A — no schema or YANG changes.

…failures are not missed Signed-off-by: BYGX-wcr <wcr@live.cn>

mssonicbld · 2026-06-14T18:23:51Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2026-06-14T18:24:01Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

This PR enhances sysmonitor in SONiC’s system-health component to avoid missing short-lived systemd service failures (e.g., crash + auto-restart) by adding a periodic full rescan backstop and making full-scan state tracking safe to run repeatedly.

Changes:

Added a periodic (15s) full-scan backstop in Sysmonitor.system_service() using time.monotonic() to catch transient failures that may not be observed via JobRemoved/STATE_DB events.
Updated get_all_system_status() to rebuild dnsrvs_name from scratch each sweep so recovered services can be cleared during periodic polling.
Introduced PERIODIC_POLL_INTERVAL_SECS constant to control the backstop cadence.

+        fresh_down = set()
+        for service in self.get_all_service_list():
            ustate = self.get_unit_status(service)
            if ustate == "NOT OK":
-                if service not in self.dnsrvs_name:
-                    self.dnsrvs_name.add(service)
+                fresh_down.add(service)


+            # Backstop: if no JobRemoved event arrived for a transient failure
+            # (crash + auto-restart inside one window), the next full scan catches it.
+            if time.monotonic() - last_full_scan_ts >= PERIODIC_POLL_INTERVAL_SECS:
+                try:
+                    self.update_system_status()
+                except Exception as e:
+                    logger.log_error("periodic update_system_status: "+str(e))
+                last_full_scan_ts = time.monotonic()


[system-health] Add periodic full-scan backstop so transient service …

cbccba1

…failures are not missed Signed-off-by: BYGX-wcr <wcr@live.cn>

Copilot AI review requested due to automatic review settings June 14, 2026 18:23

BYGX-wcr requested a review from lguohan as a code owner June 14, 2026 18:23

BYGX-wcr requested a review from vivekrnv June 14, 2026 18:24

Copilot started reviewing on behalf of BYGX-wcr June 14, 2026 18:24 View session

BYGX-wcr requested a review from zjswhhh June 14, 2026 18:24

BYGX-wcr added the Request for 202605 Branch label Jun 14, 2026

Copilot AI reviewed Jun 14, 2026

View reviewed changes

zjswhhh approved these changes Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[system-health] Add periodic full-scan backstop so transient service failures are not missed#27887

[system-health] Add periodic full-scan backstop so transient service failures are not missed#27887
BYGX-wcr wants to merge 1 commit into
sonic-net:masterfrom
BYGX-wcr:enhance-sysmonitor

BYGX-wcr commented Jun 14, 2026 •

edited

Loading

Uh oh!

mssonicbld commented Jun 14, 2026

Uh oh!

azure-pipelines Bot commented Jun 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

BYGX-wcr commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why I did it

Work item tracking

How I did it

How to verify it

Which release branch to backport (please specify)

Tested branch (please specify)

Description for the changelog

Link to config_db schema for YANG module changes

Uh oh!

mssonicbld commented Jun 14, 2026

Uh oh!

azure-pipelines Bot commented Jun 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BYGX-wcr commented Jun 14, 2026 •

edited

Loading