Add eBPF dataplane design & review guide (felix/bpf/DESIGN.md) by tomastigera · Pull Request #12511 · projectcalico/calico

tomastigera · 2026-04-16T19:53:39Z

Summary

Adds felix/bpf/DESIGN.md: a code-grounded design & review guide for the eBPF dataplane. Intended as the in-tree companion to the internal "eBPF Dataplane Networking" reference document.

Each of the 16 topic sections has:

An architecture description with pointers into the current code (files, function/struct/map/constant names).
A short Review notes block listing the invariants a future PR in that area should respect.

Sections:

Packet path overview
TC program layout
Intra-cluster traffic & service NAT
External traffic (NodePort, DSR)
Maglev load balancer
Connect-Time Load Balancer (CTLB)
Host-networked workaround (bpfnat veth)
VXLAN in eBPF mode
Reverse-path filter (RPF)
Conntrack & cleanup
IP fragmentation
Switching from *tables to eBPF
3rd-party DNAT on host traffic
Debug log filters
QoS
Cross-cutting review notes

Notes

Opened as draft for review of structure, tone and content. Treat comments on specific sections as the primary review mechanism.
Uses *tables as shorthand for "iptables or nftables" throughout (defined in Conventions).
Uses the actual device names (bpfin.cali / bpfout.cali) rather than the reference-doc names (bpfnatin / bpfnatout), with a parenthetical for discoverability.

Test plan

All file paths referenced in the doc exist in the repo.
No TODOs remain in the file.
Human review of each section against current code.

Release Note

None (internal developer documentation, no runtime behaviour change).

Code-grounded companion to the internal "eBPF Dataplane Networking" reference. Describes how the dataplane is organised (packet path, TC program layout, service NAT, Maglev, CTLB, bpfnat workaround, VXLAN flow mode, RPF, conntrack cleanup, IP fragmentation, *tables->BPF switch, 3rd-party DNAT, log filters, QoS) with pointers into the code, and a per-section "Review notes" block listing the invariants a future PR in that area should respect. Release Note: None Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tomastigera · 2026-04-16T19:56:38Z

+The reference for the _principles_ behind the design is the
+"eBPF Dataplane Networking" internal document. This file is the
+in-tree, code-grounded counterpart to that reference.


remove this since this is a public facing document in opensource codebase

tomastigera · 2026-04-16T19:57:37Z

+**In scope:**
+
+- Linux TC, XDP and cgroup BPF programs under `felix/bpf-gpl/` and
+  `felix/bpf-apache/`.


bpf-apache is legacy code only used by iptables mode and is not part of the ebpf datraplane

tomastigera · 2026-04-16T20:00:36Z

+  treats a small number of special interfaces (the main route
+  interface, tunnel devices, the `bpfnat` veth pair) as HEP-like even
+  when no HostEndpoint CRD exists.
+- "Kernel ingress"/"kernel egress" refers to the direction the kernel


not queite kernel, but the host namespace. The dataplane looks at everything from the host namespace perspective

tomastigera · 2026-04-16T20:04:37Z

+
+The practical consequence is that BPF can bypass the host network
+stack completely when it has enough information to forward directly.
+Any packet it forwards via `bpf_redirect` or `bpf_redirect_peer`


and bpf_redirect_neigh. Redirect peer skips the program on the host side of the veth.

tomastigera · 2026-04-16T20:06:37Z

+- **FIB lookup miss.** If `bpf_fib_lookup` fails, BPF lets the kernel
+  route the packet so the kernel can populate its FIB and neigh caches.
+  Subsequent packets of the same flow can then be forwarded directly.


true, but less frequent now since we have redirect neigh. We now support kernels 5.10+ so the helper is always available. BEfore that, fib lookup would fail with no neigh and we would need to defer to the kernel.

tomastigera · 2026-04-16T20:37:37Z

+node decapsulates and routes the packet to the client.
+
+The asymmetry is deliberate: the return path follows the forward path
+because the ingress node is the only place that can un-DNAT the source


not true, DSR does it on the other host.

tomastigera · 2026-04-16T20:39:12Z

+
+DSR requires a supporting underlay: the client must be willing to
+accept a reply from a node that is not the one it originally talked
+to. The cluster admin opts in via `DSRSupport` configuration or, more


DSRSupport?

note that not only the client needs to be able to receive the response from a different node, it is probably far enough not to notice, but the local network must allow packets from a different node

tomastigera · 2026-04-16T20:40:38Z

+
+### Conflicting nodeport connections
+
+When many clients hit a node port from the same source-IP (for


the key here is that the LB is connecting to different nodes so it can reuse the source port

tomastigera · 2026-04-16T20:42:22Z

+- A new encapsulation path (e.g. a new overlay protocol) must
+  preserve the "ingress node owns the CT reverse entry" invariant
+  for the non-DSR case, otherwise return traffic cannot be un-NATed.


i do not understand, seems useless, remove

tomastigera · 2026-04-16T20:44:39Z

+
+### What it is for
+
+Ordinary services pick a backend per connection more-or-less at random


make a note that maglev mostly reuses the nodeports vxlan machinery - it mostly follow the nodeport way of forwarding except the extra backend selection and policy

Addresses the 18 review comments on projectcalico#12511 plus two factual corrections spotted while verifying them, and adds a new "Fast-path performance discipline" section promoted from the Cross-cutting notes. Review comments: - Drop the reference to the internal design document from a public open-source doc (L41). - Move felix/bpf-apache/ out of scope; it is legacy *tables-mode code, not part of the eBPF dataplane (L48). - Reframe "kernel ingress/egress" as "host-ingress/host-egress" throughout — the dataplane reasons from the host namespace (L80). - List bpf_redirect_neigh alongside bpf_redirect/bpf_redirect_peer in §1 and note that redirect_peer skips the peer's host-side program (L145). - Update the FIB-miss bullet: redirect_neigh (kernel 5.10+, always available) makes the host-stack detour rare now (L156). - Scope bpfnat deferral to host→service traffic; host-local endpoints cannot skip the host stack (L168). - Fix the jump-map sizing rationale: policy is shared across fast/debug paths, only generic sub-programs have separate variants (L270). - Reframe skb->cb[0,1] — makes policy callable from generic programs and lets the debug path reuse the same policy (L306). - Document loglevel=debug-with-no-user-filter: a match-all filter is installed so the preamble path is uniform (L326, §2 + §14). - Add CTLB callout at the top of §3 "common case" — with CTLB on, the TC pod→service path is never taken (L380). - Rewrite "Same host" to describe the bpf_redirect_peer shortcut as a deliberate optimisation over *tables (L410). - Emphasise that with CTLB, pod-service-self traffic does not leave the pod at all — substantial deviation from *tables (L446). - Flag the NodePort-forwarding-VXLAN vs pod-to-pod-overlay-VXLAN distinction; both use the same device (L478). - Fix the "ingress node is the only place to un-DNAT" claim — DSR un-DNATs on the backend node (L498). - Replace non-existent DSRSupport with BPFExternalServiceMode, BPFDSROptoutCIDRs; expand the asymmetric-path requirement to cover every hop, not just the client (L512). - Rewrite the nodeport port-collision explanation: the LB reuses source ports legitimately; they collapse onto the ingress-node IP at the backend (L531). - Remove the useless encapsulation review note (L548). - Frame Maglev as layered on top of the NodePort VXLAN forwarding; the new bits are consistent-hash backend selection and mid-flow policy re-run (L561). Fast-path performance discipline (new §16): - Explicit rule: per-packet fast-path work needs justification. - Cost tiers (cheap / borderline / expensive). - Fast-path vs flow-creation vs slow-path guidance. - Patterns to prefer (CT flags, skb marks, compile-time gates, dedicated sub-programs for slow work). - Concrete review-time checks. Cross-cutting notes (§17): - Map-versioning rule loosened to "bump only when new programs are incompatible with the old map"; repurposing padding is explicitly OK. Factual corrections found while verifying review comments: - BPFHostNetworkedNAT → BPFHostNetworkedNATWithoutCTLB. - BPFConnectTimeLB → BPFConnectTimeLoadBalancing. Release Note: None Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tomastigera · 2026-04-16T21:34:13Z

+arrive over the VXLAN tunnel, and after decap they look like they
+come from the same ingress-node tunnel IP. Source IPs that were


after DNAT they come from the same client ip:port to the same destination. That was not true when the client made the connection

Folds in the L636 review comment and adds five new top-level sections that the gap review identified as missing. L636 fix (§5 Conflicting nodeport connections): the collision mechanism was described wrong. Sources don't collapse to the ingress-node tunnel IP after decap (the inner packet still carries the original client src). The collision happens after DNAT: multiple formerly-distinct (node-IP, node-port) destinations at different nodes can DNAT to the same backend pod, and the LB's legitimate source-port reuse across different node destinations collapses to an identical 5-tuple at the backend's conntrack. New §3 XDP programs and the XDP→TC handoff (folds in topic 10 "untracked/PreDNAT"): XDP's role as early-drop for untracked policy, the xdp2tc metadata handoff via CALI_META_ACCEPTED_BY_XDP and the CALI_SKB_MARK_BYPASS_XDP mark on the TC side, and the BPFForceTrackPacketsFromIfaces per-interface opt-out for 3rd-party DNAT interoperability. New §7 Service session affinity: the cali_v4_nat_aff / cali_v6_nat_aff map, how sessionAffinity=ClientIP is resolved before the normal backend pick, interactions with Maglev and CTLB, applicability to both TC and CTLB paths. New §8 Service syncing & BPF kube-proxy replacement: the role of felix/bpf/proxy/, the maps it owns, the Kubernetes semantics it enforces (externalTrafficPolicy, topology awareness, health checks), and where to look when a feature is at the syncer level rather than the BPF-program level. New §15 BPF-synthesised ICMP errors: why BPF has to synthesise errors (BPF-forwarded paths bypass the kernel stack that would normally emit them), the sub-program calico_tc_skb_send_icmp_replies and the icmp_v4_reply/icmp_v6_reply builders, the TTL-exceeded and too-big cases, and how this sits on the slow path rather than the fast path. New §19 Flow logs & event ring buffer: distinct from the debug log filters (§18) — flow logs are per-flow events emitted into a BPF ring buffer and consumed by felix/bpf/events/ / felix/bpf/ringbuf/. Gated by FLOWLOGS_ENABLED, emitted on flow creation / closure, shipped to Calico's observability pipeline. Sections 3 through 22 are renumbered; all §N cross-references shifted accordingly (sed-driven renumber, ordered highest first to avoid cascading). Release Note: None Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- New .github/instructions/ebpf-dataplane.instructions.md with applyTo globs for felix/bpf-gpl/, felix/bpf/, and the BPF-side Go code under felix/dataplane/linux/. GitHub Copilot's code-review feature picks these up automatically for matching files. Content: condensed must-check principles drawn from DESIGN.md's Review notes sections, plus the "update DESIGN.md on dataplane-behaviour changes" rule with a concrete exemption list (bug-fix-that-restores-described-behaviour, mechanical refactor, comments/log edits, dependency bump). - .github/copilot-instructions.md: new "eBPF Dataplane Review" subsection pointing at DESIGN.md and the path-specific instructions file, restating the update rule in the same words. - felix/CLAUDE.md: new "eBPF Dataplane Design & Review Guide" subsection inside the BPF Dataplane block. /review and the general-purpose review skill read CLAUDE.md files, so adding it here makes the rules available for Claude-driven reviews. - felix/bpf/DESIGN.md §22 Cross-cutting review notes: new "Keep this document in sync with the code" bullet that self-declares the update rule and lists the same exemptions. Wording is identical across all four places so a reviewer (human or AI) sees one rule, not four slightly-different ones. No CI enforcement — review-time judgement is the right level. Release Note: None Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewers (human and AI) consistently miss the performance impact of changes that suppress an existing fast-path shortcut for a class of flows: no new code is added on the hot path, but more flows now pay the cost the shortcut was designed to skip. The §21 framing was built around adding per-packet work; the inverse case wasn't named. - DESIGN.md §21 Review notes: add the inverse-case bullet — narrowing an existing fast-path shortcut needs the same justification as a new lookup (benchmark, scoping mechanism, or quantified flow-class argument). - Copilot must-check: replace "new per-packet work" framing with a per-packet cost question that covers both directions. - felix/CLAUDE.md review guidance: require the question to be answered in plain prose before signoff, with the suppress-bypass case named explicitly.

marvin-tigera added this to the Calico v3.33.0 milestone Apr 16, 2026

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 16, 2026

tomastigera commented Apr 16, 2026

View reviewed changes

tomastigera and others added 2 commits April 16, 2026 14:48

tomastigera added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact and removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eBPF dataplane design & review guide (felix/bpf/DESIGN.md)#12511

Add eBPF dataplane design & review guide (felix/bpf/DESIGN.md)#12511
tomastigera wants to merge 5 commits intoprojectcalico:masterfrom
tomastigera:ebpf-dataplane-design-doc

tomastigera commented Apr 16, 2026

Uh oh!

tomastigera Apr 16, 2026

Uh oh!

tomastigera Apr 16, 2026

Uh oh!

tomastigera Apr 16, 2026

Uh oh!

tomastigera Apr 16, 2026

Uh oh!

tomastigera Apr 16, 2026

Uh oh!

tomastigera Apr 16, 2026

Uh oh!

tomastigera Apr 16, 2026

Uh oh!

tomastigera Apr 16, 2026

Uh oh!

tomastigera Apr 16, 2026

Uh oh!

tomastigera Apr 16, 2026

Uh oh!

tomastigera Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### Conflicting nodeport connections

		When many clients hit a node port from the same source-IP (for


		### What it is for

		Ordinary services pick a backend per connection more-or-less at random

		arrive over the VXLAN tunnel, and after decap they look like they
		come from the same ingress-node tunnel IP. Source IPs that were

Conversation

tomastigera commented Apr 16, 2026

Summary

Notes

Test plan

Release Note

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants