Skip to content

[system-health] Add periodic full-scan backstop so transient service failures are not missed#27887

Open
BYGX-wcr wants to merge 1 commit into
sonic-net:masterfrom
BYGX-wcr:enhance-sysmonitor
Open

[system-health] Add periodic full-scan backstop so transient service failures are not missed#27887
BYGX-wcr wants to merge 1 commit into
sonic-net:masterfrom
BYGX-wcr:enhance-sysmonitor

Conversation

@BYGX-wcr

@BYGX-wcr BYGX-wcr commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Why I did it

sysmonitor (under src/system-health/health_checker/sysmonitor.py) detects service-state changes through two event sources only:

  1. systemd D-Bus JobRemoved signals (MonitorSystemBusTask)
  2. STATE_DB FEATURE table updates (MonitorStateDbTask)

A JobRemoved signal is emitted only when an explicit start/stop/restart job completes. An unsolicited main-process death does not emit JobRemoved by itself — only the subsequent auto-restart job does. Combined with Restart=always + RestartSec=30 on most SONiC service units, this creates a window where:

  • a critical process inside a feature container is killed (e.g. bgpd in the bgp container),
  • the container exits and the .service enters failed/inactive,
  • systemd silently waits up to 30 s, then auto-restarts and the service returns to active/running,
  • the only JobRemoved event sysmonitor receives describes the restart's completion,
  • by the time get_unit_status() queries systemctl show, the unit is back to active/running,
  • SYSTEM_READY|SYSTEM_STATE = UP is never flipped to DOWN.

On smartswitches, the chassisd DPU _get_control_plane_state_common() reads SYSTEM_READY|SYSTEM_STATE, so this also means CHASSIS_STATE_DB|DPU_STATE|DPUx:dpu_control_plane_state does not flip to down for short transient failures of a single service, hiding them from the NPU / DASH HA.

Reproduced live on a BlueField DPU running SONiC: killing bgpd produced no system#monitor log entry and no SYSTEM_READY transition, while killing orchagent did — only because the swss kill cascaded to multiple services and gave sysmonitor enough time to sample one of them in failed.

Work item tracking

  • Microsoft ADO: 38379400

How I did it

Two minimal changes in src/system-health/health_checker/sysmonitor.py:

  1. Periodic backstop poll. Added a 15 s wall-clock-driven full sweep in the Sysmonitor.system_service() main loop, using time.monotonic() so it is independent of system clock changes and unaffected by event-queue idleness:

    if time.monotonic() - last_full_scan_ts >= PERIODIC_POLL_INTERVAL_SECS:
        self.update_system_status()
        last_full_scan_ts = time.monotonic()

    This guarantees that even if JobRemoved is missed, systemctl show will sample the unit while it is still failed/inactive and flip the aggregate state correctly. Worst-case detection latency: ~15 s + one queue cycle.

  2. get_all_system_status() now rebuilds self.dnsrvs_name from scratch on every call instead of append-only. The previous implementation only ever added entries — fine for the single boot-time invocation, but it would have wedged the aggregate state at DOWN once periodic polling was introduced. Recovered services are now removed by the next sweep.

A new module constant PERIODIC_POLL_INTERVAL_SECS = 15 controls the cadence. No new dependencies, no API changes, no schema changes.

How to verify it

On any SONiC device (smartswitch DPU recommended, since the effect on dpu_control_plane_state is the original motivator):

  1. Confirm baseline:

    redis-cli -n 6 hget "SYSTEM_READY|SYSTEM_STATE" Status     # UP
    redis-cli -n 6 hgetall "ALL_SERVICE_STATUS|bgp"            # service_status=OK
  2. Kill a critical process so the container exits while leaving Restart=always in place:

    docker exec bgp supervisorctl signal KILL bgpd
  3. Within ~15 s, system#monitor should now log the transition and flip the state:

    healthd[…]: bgp.service service state changed to [failed/failed]   (or inactive/dead)
    healthd[…]: System is not ready - one or more services are not up
    
    redis-cli -n 6 hget "SYSTEM_READY|SYSTEM_STATE" Status     # DOWN
  4. On a smartswitch, the NPU then sees the DPU's control plane go down:

    redis-cli -n 13 hget "DPU_STATE|DPU0" dpu_control_plane_state   # down
  5. After the auto-restart completes (~30 s later), the state recovers and a System is ready log line is emitted; SYSTEM_READY returns to UP. Before this PR, neither the DOWN log nor the transition happened in this scenario.

Unit tests under tests/test_system_health.py that exercise the touched functions (test_get_all_system_status_ok, test_get_all_system_status_not_ok, test_check_unit_status*, test_post_unit_status, test_update_system_status) instantiate a fresh Sysmonitor(), so the rebuild semantics in get_all_system_status() produce the same result as the old append-only logic; no test changes required.

Which release branch to backport (please specify)

  • 202205
  • 202211
  • 202305
  • 202311
  • 202405
  • 202605

The change is small and self-contained, so it should backport cleanly anywhere sysmonitor.py exists.

Tested branch (please specify)

  • Tested on: master (live BlueField DPU image, str2-mlnx-4280-smartswitch-02-dpu-0)

Description for the changelog

[system-health] sysmonitor periodic full re-scan to catch service failures missed by systemd D-Bus events

Link to config_db schema for YANG module changes

N/A — no schema or YANG changes.

…failures are not missed

Signed-off-by: BYGX-wcr <wcr@live.cn>
Copilot AI review requested due to automatic review settings June 14, 2026 18:23
@BYGX-wcr BYGX-wcr requested a review from lguohan as a code owner June 14, 2026 18:23
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run Azure.sonic-buildimage

@BYGX-wcr BYGX-wcr requested a review from vivekrnv June 14, 2026 18:24
@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances sysmonitor in SONiC’s system-health component to avoid missing short-lived systemd service failures (e.g., crash + auto-restart) by adding a periodic full rescan backstop and making full-scan state tracking safe to run repeatedly.

Changes:

  • Added a periodic (15s) full-scan backstop in Sysmonitor.system_service() using time.monotonic() to catch transient failures that may not be observed via JobRemoved/STATE_DB events.
  • Updated get_all_system_status() to rebuild dnsrvs_name from scratch each sweep so recovered services can be cleared during periodic polling.
  • Introduced PERIODIC_POLL_INTERVAL_SECS constant to control the backstop cadence.

Comment on lines +406 to +410
fresh_down = set()
for service in self.get_all_service_list():
ustate = self.get_unit_status(service)
if ustate == "NOT OK":
if service not in self.dnsrvs_name:
self.dnsrvs_name.add(service)
fresh_down.add(service)
Comment on lines +544 to +551
# Backstop: if no JobRemoved event arrived for a transient failure
# (crash + auto-restart inside one window), the next full scan catches it.
if time.monotonic() - last_full_scan_ts >= PERIODIC_POLL_INTERVAL_SECS:
try:
self.update_system_status()
except Exception as e:
logger.log_error("periodic update_system_status: "+str(e))
last_full_scan_ts = time.monotonic()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants