Add eBPF dataplane design & review guide (felix/bpf/DESIGN.md)#12511
Add eBPF dataplane design & review guide (felix/bpf/DESIGN.md)#12511tomastigera wants to merge 5 commits intoprojectcalico:masterfrom
Conversation
Code-grounded companion to the internal "eBPF Dataplane Networking" reference. Describes how the dataplane is organised (packet path, TC program layout, service NAT, Maglev, CTLB, bpfnat workaround, VXLAN flow mode, RPF, conntrack cleanup, IP fragmentation, *tables->BPF switch, 3rd-party DNAT, log filters, QoS) with pointers into the code, and a per-section "Review notes" block listing the invariants a future PR in that area should respect. Release Note: None Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| The reference for the _principles_ behind the design is the | ||
| "eBPF Dataplane Networking" internal document. This file is the | ||
| in-tree, code-grounded counterpart to that reference. |
There was a problem hiding this comment.
remove this since this is a public facing document in opensource codebase
| **In scope:** | ||
|
|
||
| - Linux TC, XDP and cgroup BPF programs under `felix/bpf-gpl/` and | ||
| `felix/bpf-apache/`. |
There was a problem hiding this comment.
bpf-apache is legacy code only used by iptables mode and is not part of the ebpf datraplane
| treats a small number of special interfaces (the main route | ||
| interface, tunnel devices, the `bpfnat` veth pair) as HEP-like even | ||
| when no HostEndpoint CRD exists. | ||
| - "Kernel ingress"/"kernel egress" refers to the direction the kernel |
There was a problem hiding this comment.
not queite kernel, but the host namespace. The dataplane looks at everything from the host namespace perspective
|
|
||
| The practical consequence is that BPF can bypass the host network | ||
| stack completely when it has enough information to forward directly. | ||
| Any packet it forwards via `bpf_redirect` or `bpf_redirect_peer` |
There was a problem hiding this comment.
and bpf_redirect_neigh. Redirect peer skips the program on the host side of the veth.
| - **FIB lookup miss.** If `bpf_fib_lookup` fails, BPF lets the kernel | ||
| route the packet so the kernel can populate its FIB and neigh caches. | ||
| Subsequent packets of the same flow can then be forwarded directly. |
There was a problem hiding this comment.
true, but less frequent now since we have redirect neigh. We now support kernels 5.10+ so the helper is always available. BEfore that, fib lookup would fail with no neigh and we would need to defer to the kernel.
| node decapsulates and routes the packet to the client. | ||
|
|
||
| The asymmetry is deliberate: the return path follows the forward path | ||
| because the ingress node is the only place that can un-DNAT the source |
There was a problem hiding this comment.
not true, DSR does it on the other host.
|
|
||
| DSR requires a supporting underlay: the client must be willing to | ||
| accept a reply from a node that is not the one it originally talked | ||
| to. The cluster admin opts in via `DSRSupport` configuration or, more |
There was a problem hiding this comment.
DSRSupport?
note that not only the client needs to be able to receive the response from a different node, it is probably far enough not to notice, but the local network must allow packets from a different node
|
|
||
| ### Conflicting nodeport connections | ||
|
|
||
| When many clients hit a node port from the same source-IP (for |
There was a problem hiding this comment.
the key here is that the LB is connecting to different nodes so it can reuse the source port
| - A new encapsulation path (e.g. a new overlay protocol) must | ||
| preserve the "ingress node owns the CT reverse entry" invariant | ||
| for the non-DSR case, otherwise return traffic cannot be un-NATed. |
There was a problem hiding this comment.
i do not understand, seems useless, remove
|
|
||
| ### What it is for | ||
|
|
||
| Ordinary services pick a backend per connection more-or-less at random |
There was a problem hiding this comment.
make a note that maglev mostly reuses the nodeports vxlan machinery - it mostly follow the nodeport way of forwarding except the extra backend selection and policy
Addresses the 18 review comments on projectcalico#12511 plus two factual corrections spotted while verifying them, and adds a new "Fast-path performance discipline" section promoted from the Cross-cutting notes. Review comments: - Drop the reference to the internal design document from a public open-source doc (L41). - Move felix/bpf-apache/ out of scope; it is legacy *tables-mode code, not part of the eBPF dataplane (L48). - Reframe "kernel ingress/egress" as "host-ingress/host-egress" throughout — the dataplane reasons from the host namespace (L80). - List bpf_redirect_neigh alongside bpf_redirect/bpf_redirect_peer in §1 and note that redirect_peer skips the peer's host-side program (L145). - Update the FIB-miss bullet: redirect_neigh (kernel 5.10+, always available) makes the host-stack detour rare now (L156). - Scope bpfnat deferral to host→service traffic; host-local endpoints cannot skip the host stack (L168). - Fix the jump-map sizing rationale: policy is shared across fast/debug paths, only generic sub-programs have separate variants (L270). - Reframe skb->cb[0,1] — makes policy callable from generic programs and lets the debug path reuse the same policy (L306). - Document loglevel=debug-with-no-user-filter: a match-all filter is installed so the preamble path is uniform (L326, §2 + §14). - Add CTLB callout at the top of §3 "common case" — with CTLB on, the TC pod→service path is never taken (L380). - Rewrite "Same host" to describe the bpf_redirect_peer shortcut as a deliberate optimisation over *tables (L410). - Emphasise that with CTLB, pod-service-self traffic does not leave the pod at all — substantial deviation from *tables (L446). - Flag the NodePort-forwarding-VXLAN vs pod-to-pod-overlay-VXLAN distinction; both use the same device (L478). - Fix the "ingress node is the only place to un-DNAT" claim — DSR un-DNATs on the backend node (L498). - Replace non-existent DSRSupport with BPFExternalServiceMode, BPFDSROptoutCIDRs; expand the asymmetric-path requirement to cover every hop, not just the client (L512). - Rewrite the nodeport port-collision explanation: the LB reuses source ports legitimately; they collapse onto the ingress-node IP at the backend (L531). - Remove the useless encapsulation review note (L548). - Frame Maglev as layered on top of the NodePort VXLAN forwarding; the new bits are consistent-hash backend selection and mid-flow policy re-run (L561). Fast-path performance discipline (new §16): - Explicit rule: per-packet fast-path work needs justification. - Cost tiers (cheap / borderline / expensive). - Fast-path vs flow-creation vs slow-path guidance. - Patterns to prefer (CT flags, skb marks, compile-time gates, dedicated sub-programs for slow work). - Concrete review-time checks. Cross-cutting notes (§17): - Map-versioning rule loosened to "bump only when new programs are incompatible with the old map"; repurposing padding is explicitly OK. Factual corrections found while verifying review comments: - BPFHostNetworkedNAT → BPFHostNetworkedNATWithoutCTLB. - BPFConnectTimeLB → BPFConnectTimeLoadBalancing. Release Note: None Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| arrive over the VXLAN tunnel, and after decap they look like they | ||
| come from the same ingress-node tunnel IP. Source IPs that were |
There was a problem hiding this comment.
after DNAT they come from the same client ip:port to the same destination. That was not true when the client made the connection
Folds in the L636 review comment and adds five new top-level sections that the gap review identified as missing. L636 fix (§5 Conflicting nodeport connections): the collision mechanism was described wrong. Sources don't collapse to the ingress-node tunnel IP after decap (the inner packet still carries the original client src). The collision happens after DNAT: multiple formerly-distinct (node-IP, node-port) destinations at different nodes can DNAT to the same backend pod, and the LB's legitimate source-port reuse across different node destinations collapses to an identical 5-tuple at the backend's conntrack. New §3 XDP programs and the XDP→TC handoff (folds in topic 10 "untracked/PreDNAT"): XDP's role as early-drop for untracked policy, the xdp2tc metadata handoff via CALI_META_ACCEPTED_BY_XDP and the CALI_SKB_MARK_BYPASS_XDP mark on the TC side, and the BPFForceTrackPacketsFromIfaces per-interface opt-out for 3rd-party DNAT interoperability. New §7 Service session affinity: the cali_v4_nat_aff / cali_v6_nat_aff map, how sessionAffinity=ClientIP is resolved before the normal backend pick, interactions with Maglev and CTLB, applicability to both TC and CTLB paths. New §8 Service syncing & BPF kube-proxy replacement: the role of felix/bpf/proxy/, the maps it owns, the Kubernetes semantics it enforces (externalTrafficPolicy, topology awareness, health checks), and where to look when a feature is at the syncer level rather than the BPF-program level. New §15 BPF-synthesised ICMP errors: why BPF has to synthesise errors (BPF-forwarded paths bypass the kernel stack that would normally emit them), the sub-program calico_tc_skb_send_icmp_replies and the icmp_v4_reply/icmp_v6_reply builders, the TTL-exceeded and too-big cases, and how this sits on the slow path rather than the fast path. New §19 Flow logs & event ring buffer: distinct from the debug log filters (§18) — flow logs are per-flow events emitted into a BPF ring buffer and consumed by felix/bpf/events/ / felix/bpf/ringbuf/. Gated by FLOWLOGS_ENABLED, emitted on flow creation / closure, shipped to Calico's observability pipeline. Sections 3 through 22 are renumbered; all §N cross-references shifted accordingly (sed-driven renumber, ordered highest first to avoid cascading). Release Note: None Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New .github/instructions/ebpf-dataplane.instructions.md with applyTo globs for felix/bpf-gpl/, felix/bpf/, and the BPF-side Go code under felix/dataplane/linux/. GitHub Copilot's code-review feature picks these up automatically for matching files. Content: condensed must-check principles drawn from DESIGN.md's Review notes sections, plus the "update DESIGN.md on dataplane-behaviour changes" rule with a concrete exemption list (bug-fix-that-restores-described-behaviour, mechanical refactor, comments/log edits, dependency bump). - .github/copilot-instructions.md: new "eBPF Dataplane Review" subsection pointing at DESIGN.md and the path-specific instructions file, restating the update rule in the same words. - felix/CLAUDE.md: new "eBPF Dataplane Design & Review Guide" subsection inside the BPF Dataplane block. /review and the general-purpose review skill read CLAUDE.md files, so adding it here makes the rules available for Claude-driven reviews. - felix/bpf/DESIGN.md §22 Cross-cutting review notes: new "Keep this document in sync with the code" bullet that self-declares the update rule and lists the same exemptions. Wording is identical across all four places so a reviewer (human or AI) sees one rule, not four slightly-different ones. No CI enforcement — review-time judgement is the right level. Release Note: None Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewers (human and AI) consistently miss the performance impact of changes that suppress an existing fast-path shortcut for a class of flows: no new code is added on the hot path, but more flows now pay the cost the shortcut was designed to skip. The §21 framing was built around adding per-packet work; the inverse case wasn't named. - DESIGN.md §21 Review notes: add the inverse-case bullet — narrowing an existing fast-path shortcut needs the same justification as a new lookup (benchmark, scoping mechanism, or quantified flow-class argument). - Copilot must-check: replace "new per-packet work" framing with a per-packet cost question that covers both directions. - felix/CLAUDE.md review guidance: require the question to be answered in plain prose before signoff, with the suppress-bypass case named explicitly.
Summary
Adds
felix/bpf/DESIGN.md: a code-grounded design & review guide for the eBPF dataplane. Intended as the in-tree companion to the internal "eBPF Dataplane Networking" reference document.Each of the 16 topic sections has:
Sections:
*tablesto eBPFNotes
*tablesas shorthand for "iptables or nftables" throughout (defined in Conventions).bpfin.cali/bpfout.cali) rather than the reference-doc names (bpfnatin/bpfnatout), with a parenthetical for discoverability.Test plan
Release Note
None (internal developer documentation, no runtime behaviour change).