Skip to content

zebra: add dplane helpers to provide interface speed#233

Open
mwinter-osr wants to merge 4 commits into
opensourcerouting:masterfrom
mwinter-osr:PR19412
Open

zebra: add dplane helpers to provide interface speed#233
mwinter-osr wants to merge 4 commits into
opensourcerouting:masterfrom
mwinter-osr:PR19412

Conversation

@mwinter-osr

Copy link
Copy Markdown
Member

Add dplane ctx helpers to carry interface speed and skip ethtool polling when the dataplane provides a value. Avoid scheduling speed update timers in this case.

Original PR19412 by maxime-leroy

Add dplane ctx helpers to carry interface speed.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
kernel_get_speed() doesn't need the full interface object; it only needs
the interface name and the VRF id to open the right socket/ioctl. Update
the prototype and callers accordingly.

This will be used in the next commit.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
This introduces DPLANE_OP_INTF_SPEED_GET so link speed is resolved in the
dataplane via ethtool and reported to zebra. Zebra no longer performs
synchronous speed reads; it simply applies the value provided by the
dataplane.

If speed is already known during interface creation or modification, it
can be included in INTF_INSTALL/INTF_UPDATE and zebra will use it
directly. If speed is not provided, the zebra main thread
issues a follow-up INTF_SPEED_GET to request the dataplane to fetch the
speed asynchronously.

For dataplane providers that implement only INTF_INSTALL/INTF_UPDATE and
do not support INTF_SPEED_GET, zebra relies on any speed value provided
by install/update. If speed is missing, zebra attempts a single
INTF_SPEED_GET query and stops if the operation is unsupported or fails.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
The 15-second timer used to re-query interface speed is currently
scheduled in if_zebra_new_hook() for every newly created
interface. However, at that point the interface may not yet exist in the
OS, and in some cases it may never be created.

Because of this, the speed query will usually
fail (e.g. INTERFACE_SPEED_ERROR_READ) since the interface doesn't
exist. There is also a race condition: even if the interface is created,
the timer may run before the RTM_NEWLINK message is processed.

As a result, ifp->ifp_index can remain IFINDEX_INTERNAL (0). When
if_add_update() calls zebra_ns_link_ifp(), the interface tree is updated
with this incorrect index. If this happens for multiple interfaces, the
tree can end up with duplicate keys, eventually causing a zebra crash.

A check was added to zebra_ns_link_ifp() to avoid adding an interface
with an IFINDEX_INTERNAL index, but the root cause remained.

This change fixes the underlying issue by scheduling the speed-update
timer only when a valid RTM_NEWLINK has been received. The scheduling
logic is moved from if_zebra_new_hook() to
zebra_if_dplane_ifp_handling(), and runs only once the interface has a
correct ifindex.

Fixes: dc7b3ca ("zebra: Add one-shot thread to recheck speed")
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
@greptile-apps

greptile-apps Bot commented Apr 14, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds a DPLANE_OP_INTF_SPEED_GET dataplane operation so interface speed queries run asynchronously on the dplane thread rather than blocking the main thread, and carries the ethtool result back via ctx helpers. At startup on Linux, speed is fetched synchronously during netlink parsing and embedded in the ctx, skipping timer-based retries when a value is already known.

  • P1 – BSD regression (rt_socket.c): The old BSD stub returned ifp->speed (which kernel_socket.c populates from ifi_baudrate); the new stub always returns INTERFACE_SPEED_ERROR_READ. zebra_if_speed_update_ctx short-circuits on failure without updating the speed or scheduling a retry, permanently silencing speed-change client notifications on BSD.
  • P0 – No topotests: New dplane speed-helper functionality has no accompanying topotest coverage, which is required before merge per project policy.

Confidence Score: 4/5

  • Do not merge: P1 BSD regression in rt_socket.c and missing mandatory topotests must be addressed first.
  • The Linux/netlink path is structurally sound, but the rt_socket.c stub change definitively breaks interface speed on BSD by always returning INTERFACE_SPEED_ERROR_READ. Additionally, no topotests are provided for the new dplane speed-get operation, which is a mandatory requirement per project policy. Both issues must be resolved before this is safe to merge.
  • zebra/rt_socket.c (P1 BSD regression) and zebra/if_netlink.c (startup NULL-error suppresses retry).

Important Files Changed

Filename Overview
zebra/rt_socket.c P1 regression: kernel_get_speed now always returns INTERFACE_SPEED_ERROR_READ instead of ifp->speed, breaking BSD interface speed and client notifications.
zebra/if_netlink.c Startup path calls kernel_get_speed synchronously with NULL error, silently forcing speed_set=true even on EOPNOTSUPP, suppressing the async retry mechanism for interfaces with wrong startup speed.
zebra/interface.c Core speed-update logic refactored to use async dplane ctx; timer and callback plumbing look correct; minor %u/%d format mismatch in default error log.
zebra/zebra_dplane.c New DPLANE_OP_INTF_SPEED_GET op, ctx helpers, counters, and kernel handler all look structurally correct; debug log reads unset speed field (always 0) before the kernel query runs.
zebra/zebra_dplane.h New op enum value and four ctx accessor declarations look correct; no issues found.
zebra/kernel_netlink.c DPLANE_OP_INTF_SPEED_GET added to the FRR_NETLINK_SUCCESS group; correct since the speed op is handled before reaching nl_put_msg.
zebra/kernel_socket.c New op added to the no-op break group in kernel_update_multi; this code is dead (op is intercepted earlier in kernel_dplane_process_func) but harmless.
zebra/dplane_fpm_nl.c INTF_SPEED_GET added to the skip-FPM-message group; correct.

Sequence Diagram

sequenceDiagram
    participant KNL as kernel_netlink / kernel_socket
    participant MAIN as Zebra Main Thread
    participant DPLANE as Dplane Thread
    participant CB as Result Callback

    Note over KNL,MAIN: Startup (Linux / netlink path)
    KNL->>KNL: kernel_get_speed() [synchronous ethtool ioctl]
    KNL->>MAIN: "dplane_ctx with speed_set=true, speed=N"
    MAIN->>MAIN: "zebra_if_dplane_ifp_handling()<br/>if_update_state_speed(ifp, speed)<br/>no timer scheduled"

    Note over MAIN,CB: Runtime / non-startup path
    MAIN->>DPLANE: "dplane_intf_speed_get(ifp)<br/>[enqueue INTF_SPEED_GET ctx]"
    DPLANE->>DPLANE: "zebra_if_speed_process(ctx)<br/>kernel_get_speed(vrf_id, ifname, &err)"
    DPLANE->>CB: enqueue result ctx
    CB->>MAIN: "zebra_if_dplane_result()<br/>→ zebra_if_speed_update_ctx()"
    alt speed changed or unknown
        MAIN->>MAIN: if_update_state_speed() + if_add_update()
        MAIN->>MAIN: schedule retry timer (5s, up to 4 min)
    end

    Note over MAIN,CB: if_up() path
    MAIN->>MAIN: event_cancel(speed_update)
    MAIN->>DPLANE: dplane_intf_speed_get(ifp) [immediate]
    MAIN->>MAIN: "if_handle_bond_speed_change()<br/>→ zebra_if_schedule_speed_update(bond_master, 1s)"
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: zebra/rt_socket.c
Line: 384-392

Comment:
**BSD speed regression: always returning INTERFACE_SPEED_ERROR_READ**

The old stub returned `ifp->speed`, which works correctly on BSD because `kernel_socket.c` populates `ifp->speed` directly from `ifm->ifm_data.ifi_baudrate / 1000000` on RTM_IFINFO messages (`kernel_socket.c` line 730). That assignment does not call `if_update_state_speed`, so the timer-based path was the only way clients received the speed-change notification via `if_add_update`.

With this change, `kernel_get_speed` always returns `INTERFACE_SPEED_ERROR_READ` on BSD. `zebra_if_speed_update_ctx` then hits the `ZEBRA_DPLANE_REQUEST_FAILURE` early-return path, skips the speed update, and never reschedules — leaving BSD clients without any speed-change notification. The correct fix is to restore the `ifp->speed` return for BSD:

```c
uint32_t kernel_get_speed(vrf_id_t vrf_id, const char *ifname, int *error)
{
	struct interface *ifp;

	ifp = if_lookup_by_name(ifname, VRF_DEFAULT);
	if (!ifp) {
		if (error)
			*error = INTERFACE_SPEED_ERROR_READ;
		return 0;
	}

	if (error)
		*error = 0;
	return ifp->speed;
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: zebra/if_netlink.c
Line: 1509-1513

Comment:
**Startup error silently suppresses the retry mechanism**

`kernel_get_speed` is called with `NULL` for the `error` pointer, so any error return (including `EOPNOTSUPP` for virtual/bridge interfaces, which sets `speed=0, error=INTERFACE_SPEED_ERROR_UNKNOWN`) is silently swallowed. The caller then unconditionally sets `speed_set=true`, which in `zebra_if_dplane_ifp_handling` skips both `dplane_intf_speed_get` and the 15-second retry timer. An interface that reports the wrong speed at startup will never be corrected.

Passing `&error` and treating `INTERFACE_SPEED_ERROR_UNKNOWN` as `speed_set=false` would preserve the retry logic for uncertain speeds:

```c
		int sp_err = 0;
		uint32_t sp = kernel_get_speed(vrf_id, name, &sp_err);

		if (sp_err == 0) {
			dplane_ctx_set_ifp_speed(ctx, sp);
			dplane_ctx_set_ifp_speed_set(ctx, true);
		} else {
			dplane_ctx_set_ifp_speed_set(ctx, false);
		}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: zebra/zebra_dplane.c
Line: 7104-7108

Comment:
**Debug log shows stale speed=0 on the request path**

`kernel_dplane_log_detail` is called from `kernel_dplane_process_func` before `kernel_dplane_process_if_speed` has been invoked, so `dplane_ctx_get_ifp_speed(ctx)` will always return 0 here. Logging the speed at this stage is misleading. The meaningful speed is available only after the kernel query completes. Consider logging the ifname and vrf instead, which are set at enqueue time:

```suggestion
	case DPLANE_OP_INTF_SPEED_GET:
		zlog_debug("Dplane intf %s, idx %u, ifname %s",
			   dplane_op2str(dplane_ctx_get_op(ctx)), dplane_ctx_get_ifindex(ctx),
			   dplane_ctx_get_ifname(ctx));
		break;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: zebra/interface.c
Line: 154-156

Comment:
**Format specifier mismatch in default error-case log**

`error` is declared as `int` but formatted with `%u` (unsigned). While benign for small values, a negative `error` code would log as a large unsigned number, making debugging harder.

```suggestion
		if (IS_ZEBRA_DEBUG_KERNEL)
			zlog_debug("kernel_get_speed returned an unknown error %d", error);
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "zebra: schedule speed-query timer only w..." | Re-trigger Greptile

Comment thread zebra/rt_socket.c
Comment on lines 384 to +392
return 0;
}

uint32_t kernel_get_speed(struct interface *ifp, int *error)
uint32_t kernel_get_speed(vrf_id_t vrf_id, const char *ifname, int *error)
{
return ifp->speed;
if (error)
*error = INTERFACE_SPEED_ERROR_READ;

return 0;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 BSD speed regression: always returning INTERFACE_SPEED_ERROR_READ

The old stub returned ifp->speed, which works correctly on BSD because kernel_socket.c populates ifp->speed directly from ifm->ifm_data.ifi_baudrate / 1000000 on RTM_IFINFO messages (kernel_socket.c line 730). That assignment does not call if_update_state_speed, so the timer-based path was the only way clients received the speed-change notification via if_add_update.

With this change, kernel_get_speed always returns INTERFACE_SPEED_ERROR_READ on BSD. zebra_if_speed_update_ctx then hits the ZEBRA_DPLANE_REQUEST_FAILURE early-return path, skips the speed update, and never reschedules — leaving BSD clients without any speed-change notification. The correct fix is to restore the ifp->speed return for BSD:

uint32_t kernel_get_speed(vrf_id_t vrf_id, const char *ifname, int *error)
{
	struct interface *ifp;

	ifp = if_lookup_by_name(ifname, VRF_DEFAULT);
	if (!ifp) {
		if (error)
			*error = INTERFACE_SPEED_ERROR_READ;
		return 0;
	}

	if (error)
		*error = 0;
	return ifp->speed;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: zebra/rt_socket.c
Line: 384-392

Comment:
**BSD speed regression: always returning INTERFACE_SPEED_ERROR_READ**

The old stub returned `ifp->speed`, which works correctly on BSD because `kernel_socket.c` populates `ifp->speed` directly from `ifm->ifm_data.ifi_baudrate / 1000000` on RTM_IFINFO messages (`kernel_socket.c` line 730). That assignment does not call `if_update_state_speed`, so the timer-based path was the only way clients received the speed-change notification via `if_add_update`.

With this change, `kernel_get_speed` always returns `INTERFACE_SPEED_ERROR_READ` on BSD. `zebra_if_speed_update_ctx` then hits the `ZEBRA_DPLANE_REQUEST_FAILURE` early-return path, skips the speed update, and never reschedules — leaving BSD clients without any speed-change notification. The correct fix is to restore the `ifp->speed` return for BSD:

```c
uint32_t kernel_get_speed(vrf_id_t vrf_id, const char *ifname, int *error)
{
	struct interface *ifp;

	ifp = if_lookup_by_name(ifname, VRF_DEFAULT);
	if (!ifp) {
		if (error)
			*error = INTERFACE_SPEED_ERROR_READ;
		return 0;
	}

	if (error)
		*error = 0;
	return ifp->speed;
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread zebra/if_netlink.c
Comment on lines +1509 to +1513
if (startup) {
dplane_ctx_set_ifp_speed(ctx, kernel_get_speed(vrf_id, name, NULL));
dplane_ctx_set_ifp_speed_set(ctx, true);
} else
dplane_ctx_set_ifp_speed_set(ctx, false);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Startup error silently suppresses the retry mechanism

kernel_get_speed is called with NULL for the error pointer, so any error return (including EOPNOTSUPP for virtual/bridge interfaces, which sets speed=0, error=INTERFACE_SPEED_ERROR_UNKNOWN) is silently swallowed. The caller then unconditionally sets speed_set=true, which in zebra_if_dplane_ifp_handling skips both dplane_intf_speed_get and the 15-second retry timer. An interface that reports the wrong speed at startup will never be corrected.

Passing &error and treating INTERFACE_SPEED_ERROR_UNKNOWN as speed_set=false would preserve the retry logic for uncertain speeds:

		int sp_err = 0;
		uint32_t sp = kernel_get_speed(vrf_id, name, &sp_err);

		if (sp_err == 0) {
			dplane_ctx_set_ifp_speed(ctx, sp);
			dplane_ctx_set_ifp_speed_set(ctx, true);
		} else {
			dplane_ctx_set_ifp_speed_set(ctx, false);
		}
Prompt To Fix With AI
This is a comment left during a code review.
Path: zebra/if_netlink.c
Line: 1509-1513

Comment:
**Startup error silently suppresses the retry mechanism**

`kernel_get_speed` is called with `NULL` for the `error` pointer, so any error return (including `EOPNOTSUPP` for virtual/bridge interfaces, which sets `speed=0, error=INTERFACE_SPEED_ERROR_UNKNOWN`) is silently swallowed. The caller then unconditionally sets `speed_set=true`, which in `zebra_if_dplane_ifp_handling` skips both `dplane_intf_speed_get` and the 15-second retry timer. An interface that reports the wrong speed at startup will never be corrected.

Passing `&error` and treating `INTERFACE_SPEED_ERROR_UNKNOWN` as `speed_set=false` would preserve the retry logic for uncertain speeds:

```c
		int sp_err = 0;
		uint32_t sp = kernel_get_speed(vrf_id, name, &sp_err);

		if (sp_err == 0) {
			dplane_ctx_set_ifp_speed(ctx, sp);
			dplane_ctx_set_ifp_speed_set(ctx, true);
		} else {
			dplane_ctx_set_ifp_speed_set(ctx, false);
		}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread zebra/zebra_dplane.c
Comment on lines +7104 to +7108
case DPLANE_OP_INTF_SPEED_GET:
zlog_debug("Dplane intf %s, idx %u, speed %u",
dplane_op2str(dplane_ctx_get_op(ctx)), dplane_ctx_get_ifindex(ctx),
dplane_ctx_get_ifp_speed(ctx));
break;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Debug log shows stale speed=0 on the request path

kernel_dplane_log_detail is called from kernel_dplane_process_func before kernel_dplane_process_if_speed has been invoked, so dplane_ctx_get_ifp_speed(ctx) will always return 0 here. Logging the speed at this stage is misleading. The meaningful speed is available only after the kernel query completes. Consider logging the ifname and vrf instead, which are set at enqueue time:

Suggested change
case DPLANE_OP_INTF_SPEED_GET:
zlog_debug("Dplane intf %s, idx %u, speed %u",
dplane_op2str(dplane_ctx_get_op(ctx)), dplane_ctx_get_ifindex(ctx),
dplane_ctx_get_ifp_speed(ctx));
break;
case DPLANE_OP_INTF_SPEED_GET:
zlog_debug("Dplane intf %s, idx %u, ifname %s",
dplane_op2str(dplane_ctx_get_op(ctx)), dplane_ctx_get_ifindex(ctx),
dplane_ctx_get_ifname(ctx));
break;
Prompt To Fix With AI
This is a comment left during a code review.
Path: zebra/zebra_dplane.c
Line: 7104-7108

Comment:
**Debug log shows stale speed=0 on the request path**

`kernel_dplane_log_detail` is called from `kernel_dplane_process_func` before `kernel_dplane_process_if_speed` has been invoked, so `dplane_ctx_get_ifp_speed(ctx)` will always return 0 here. Logging the speed at this stage is misleading. The meaningful speed is available only after the kernel query completes. Consider logging the ifname and vrf instead, which are set at enqueue time:

```suggestion
	case DPLANE_OP_INTF_SPEED_GET:
		zlog_debug("Dplane intf %s, idx %u, ifname %s",
			   dplane_op2str(dplane_ctx_get_op(ctx)), dplane_ctx_get_ifindex(ctx),
			   dplane_ctx_get_ifname(ctx));
		break;
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread zebra/interface.c
Comment on lines +154 to +156
if (IS_ZEBRA_DEBUG_KERNEL)
zlog_debug("kernel_get_speed returned an unknown error %u", error);
break;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Format specifier mismatch in default error-case log

error is declared as int but formatted with %u (unsigned). While benign for small values, a negative error code would log as a large unsigned number, making debugging harder.

Suggested change
if (IS_ZEBRA_DEBUG_KERNEL)
zlog_debug("kernel_get_speed returned an unknown error %u", error);
break;
if (IS_ZEBRA_DEBUG_KERNEL)
zlog_debug("kernel_get_speed returned an unknown error %d", error);
Prompt To Fix With AI
This is a comment left during a code review.
Path: zebra/interface.c
Line: 154-156

Comment:
**Format specifier mismatch in default error-case log**

`error` is declared as `int` but formatted with `%u` (unsigned). While benign for small values, a negative `error` code would log as a large unsigned number, making debugging harder.

```suggestion
		if (IS_ZEBRA_DEBUG_KERNEL)
			zlog_debug("kernel_get_speed returned an unknown error %d", error);
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants