Skip to content

[Link Event Damping] Add per-port link event damper with syslog and monitor-only mode#1936

Closed
DendroLabs wants to merge 3 commits into
sonic-net:masterfrom
DendroLabs:link-event-damping/per-port-damper
Closed

[Link Event Damping] Add per-port link event damper with syslog and monitor-only mode#1936
DendroLabs wants to merge 3 commits into
sonic-net:masterfrom
DendroLabs:link-event-damping/per-port-damper

Conversation

@DendroLabs

Copy link
Copy Markdown
Contributor

Summary

Supersedes #1334 by @Ashish1805. Adds the per-port link event damper class that implements RFC 2439 link dampening in syncd. This is the core damping algorithm — the component that decides whether to forward or suppress SAI port state change notifications based on penalty tracking.

All review feedback from #1334 has been addressed (see table below). Additionally, this PR adds two differentiating features that no vendor currently provides:

  1. Syslog on suppress/unsuppress — The add README.md #1 industry complaint (Cisco, Juniper, Arista forums) is that dampening activates silently. We emit LED_SUPPRESS (WARNING), LED_UNSUPPRESS (NOTICE), and LED_CLEAR (NOTICE) syslog messages on state transitions.

  2. Monitor-only mode (algorithm=aied-monitor) — RFC 7196 recommends a "Calculate But Do Not Damp" mode. No vendor has implemented it. This allows operators to safely tune damping parameters in production without risking outages. Penalty is tracked and logged, but events are never suppressed.

Depends on: #1798 (SelectablesTracker, merged), #1935 (RedisInterface, open)

Notification Lifecycle — Addressing @kcudnik's Design Concern

@kcudnik requested a flow diagram for the notification lifecycle. The damper does NOT fabricate new SAI state — it operates as a store-and-forward interceptor:

1. SAI vendor lib fires on_port_state_change({port_id, status=DOWN})
   This is the ONLY source of truth. The damper never invents state.

2. Damper intercepts: decay penalty, add flap_penalty if DOWN, store {port_id, status}

3a. penalty < suppress_threshold → FORWARD the original notification as-is
3b. penalty >= suppress_threshold → SUPPRESS: store last_event, start decay timer

4. Timer fires, penalty < reuse_threshold → UNSUPPRESS:
   Forward the stored notification data (same port_id + last SAI-reported status)

The "manual construction" at step 4 is simply de-queuing a stored event — every field originates from a real SAI notification received in step 1. This is the standard implementation pattern used by Cisco IOS-XR and Juniper JUNOS.

Safety guarantees:

  • Last-event verification before forwarding on timer expiry
  • Cleanup on port removal
  • Immediate release when damping is disabled via config update
  • No amplification — damper can only forward fewer notifications than received

Review Feedback Addressed

# @kcudnik Comment Resolution
1 "use std::max?" Used std::min (equivalent logic) in updatePenalty()
2 "comment in weird place, please remove" Removed
3 "join those lines, why cast to 64?" Joined, cast removed
4 "empty line not needed" Removed
5 "move code to cpp file" getCurrentTimeUsecs() moved to .cpp, made virtual protected
6 "each struct/class should have its own file" DampingStats split to DampingStats.h
7 "add empty line before comment" Comment formatting fixed throughout
8 "don't use friend class" Removed. Public getters + TestablePortLinkEventDamper subclass with virtual time
9 "mock classes in separated files" MockNotificationHandler.h in separate file
10 "manually generating port notification" Flow diagram above; see notification lifecycle detail

Files Changed

New:

  • syncd/DampingStats.h — Statistics struct (split per review)
  • syncd/PortLinkEventDamper.h — Damper class header
  • syncd/PortLinkEventDamper.cpp — Damper implementation (~350 lines)
  • unittest/syncd/MockNotificationHandler.h — GMock class
  • unittest/syncd/TestPortLinkEventDamper.cpp — 19 test cases (~600 lines)

Modified:

  • syncd/NotificationHandler.h/.cpp — Add onPortStateChangePostLinkEventDamping() virtual callback
  • syncd/Makefile.am — Add PortLinkEventDamper.cpp
  • unittest/syncd/Makefile.am — Add test file

Test Plan

Unit tests cover:

  • Setup with valid/disabled configs
  • Penalty ceiling calculation (parameterized, 4 configs)
  • Timer expiry with/without state advertisement
  • Event forwarding on config-disabled ports
  • UP/DOWN events without active damping
  • DOWN event activating damping + syslog
  • Events suppressed during active damping
  • Config update clearing active damping + syslog
  • Monitor-only mode: all events forwarded, damping never activates
  • Disabled config parameterized tests (5 variants)

🤖 Generated with Claude Code

Add PortLinkEventDamper class that implements RFC 2439 link event damping
on a per-port basis in syncd. This is the core damping algorithm that
intercepts SAI port state change notifications, applies penalty-based
suppression logic, and either forwards or delays forwarding of events.

Key features beyond the original sonic-net#1334:
- Syslog on suppress/unsuppress transitions (LED_SUPPRESS, LED_UNSUPPRESS)
- Monitor-only mode (algorithm=aied-monitor) per RFC 7196
- Implements SelectableEventHandler interface for SelectablesTracker (sonic-net#1798)
- Public inspection methods replacing friend class pattern

All review feedback from @kcudnik on sonic-net#1334 addressed:
- Use std::max in updatePenalty()
- Remove comment on setInterval
- Join lines, remove unnecessary uint64 cast
- Move getCurrentTimeUsecs from header to cpp
- Split DampingStats into own header file
- Remove friend class usage
- Fix comment formatting

Supersedes: sonic-net#1334

Signed-off-by: DendroLabs <info@dendrolabs.com>
Add comprehensive unit tests for PortLinkEventDamper covering:
- Setup with valid and disabled configs
- Penalty ceiling calculation (parameterized)
- Timer expiration with/without state advertisement
- Event forwarding on config-disabled ports
- UP/DOWN events with damping not active
- DOWN event activating damping
- UP/DOWN events suppressed while damping active
- Config update clearing active damping
- Monitor-only mode forwarding all events
- Monitor-only mode never activating damping

Uses TestablePortLinkEventDamper subclass with virtual
getCurrentTimeUsecs() for deterministic time control, replacing
the friend class / Peer pattern from the original sonic-net#1334.

Signed-off-by: DendroLabs <info@dendrolabs.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@DendroLabs

Copy link
Copy Markdown
Contributor Author

cc @kcudnik @mikeberesford — requesting review. This supersedes #1334 with all review feedback addressed, plus syslog and monitor-only mode. The flow diagram above addresses the notification lifecycle concern from the original PR.

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Add missing #include <cinttypes> for PRIu64 format macro used in
SWSS_LOG calls at lines 128, 213, and 281.

Fix a bug where resetTimer(0) permanently suppresses a port. When a
DOWN event arrives while damping is active and the accumulated penalty
has decayed below the reuse threshold, timeToReachTargetValue returns 0.
timerfd_settime with it_value={0,0} disarms the timer per the Linux man
page, leaving m_dampingActive=true with no mechanism to clear it. The
fix checks whether the penalty is already at or below the reuse
threshold after updatePenalty() and clears damping immediately instead
of scheduling a zero-interval timer.

Fix typo in test case name (MaxSuppresssTimeIsZero -> MaxSuppressTimeIsZero).

Signed-off-by: DendroLabs <info@dendrolabs.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@DendroLabs

Copy link
Copy Markdown
Contributor Author

Closing this in favor of #1906, which merged on June 12 and lands the per port link event damper in syncd. Thanks @prsunny for steering the consolidation and @sivat6 for getting it across the line. Glad the notes I left on the CI failures (the -Wshadow and CodeQL shadowing issues) were useful.

Two enhancements I had built here are not in #1906: syslog on suppress and unsuppress, and the RFC 7196 monitor only mode. I think both are worth considering as follow ups, and I have raised them, along with a couple of STATE_DB schema questions, on the CLI PR (sonic-net/sonic-utilities#4367) so they can be settled in one place.

Thanks also @kcudnik and @mikeberesford for the earlier review attention on this line of work.

@DendroLabs DendroLabs closed this Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants