Skip to content

Add eBPF dataplane design & review guide (felix/bpf/DESIGN.md)#12511

Draft
tomastigera wants to merge 5 commits intoprojectcalico:masterfrom
tomastigera:ebpf-dataplane-design-doc
Draft

Add eBPF dataplane design & review guide (felix/bpf/DESIGN.md)#12511
tomastigera wants to merge 5 commits intoprojectcalico:masterfrom
tomastigera:ebpf-dataplane-design-doc

Conversation

@tomastigera
Copy link
Copy Markdown
Contributor

Summary

Adds felix/bpf/DESIGN.md: a code-grounded design & review guide for the eBPF dataplane. Intended as the in-tree companion to the internal "eBPF Dataplane Networking" reference document.

Each of the 16 topic sections has:

  • An architecture description with pointers into the current code (files, function/struct/map/constant names).
  • A short Review notes block listing the invariants a future PR in that area should respect.

Sections:

  1. Packet path overview
  2. TC program layout
  3. Intra-cluster traffic & service NAT
  4. External traffic (NodePort, DSR)
  5. Maglev load balancer
  6. Connect-Time Load Balancer (CTLB)
  7. Host-networked workaround (bpfnat veth)
  8. VXLAN in eBPF mode
  9. Reverse-path filter (RPF)
  10. Conntrack & cleanup
  11. IP fragmentation
  12. Switching from *tables to eBPF
  13. 3rd-party DNAT on host traffic
  14. Debug log filters
  15. QoS
  16. Cross-cutting review notes

Notes

  • Opened as draft for review of structure, tone and content. Treat comments on specific sections as the primary review mechanism.
  • Uses *tables as shorthand for "iptables or nftables" throughout (defined in Conventions).
  • Uses the actual device names (bpfin.cali / bpfout.cali) rather than the reference-doc names (bpfnatin / bpfnatout), with a parenthetical for discoverability.

Test plan

  • All file paths referenced in the doc exist in the repo.
  • No TODOs remain in the file.
  • Human review of each section against current code.

Release Note

None (internal developer documentation, no runtime behaviour change).

Code-grounded companion to the internal "eBPF Dataplane Networking"
reference. Describes how the dataplane is organised (packet path, TC
program layout, service NAT, Maglev, CTLB, bpfnat workaround, VXLAN
flow mode, RPF, conntrack cleanup, IP fragmentation, *tables->BPF
switch, 3rd-party DNAT, log filters, QoS) with pointers into the
code, and a per-section "Review notes" block listing the invariants
a future PR in that area should respect.

Release Note: None

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@marvin-tigera marvin-tigera added this to the Calico v3.33.0 milestone Apr 16, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 16, 2026
Comment thread felix/bpf/DESIGN.md Outdated
Comment on lines +39 to +41
The reference for the _principles_ behind the design is the
"eBPF Dataplane Networking" internal document. This file is the
in-tree, code-grounded counterpart to that reference.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this since this is a public facing document in opensource codebase

Comment thread felix/bpf/DESIGN.md Outdated
**In scope:**

- Linux TC, XDP and cgroup BPF programs under `felix/bpf-gpl/` and
`felix/bpf-apache/`.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bpf-apache is legacy code only used by iptables mode and is not part of the ebpf datraplane

Comment thread felix/bpf/DESIGN.md Outdated
treats a small number of special interfaces (the main route
interface, tunnel devices, the `bpfnat` veth pair) as HEP-like even
when no HostEndpoint CRD exists.
- "Kernel ingress"/"kernel egress" refers to the direction the kernel
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not queite kernel, but the host namespace. The dataplane looks at everything from the host namespace perspective

Comment thread felix/bpf/DESIGN.md Outdated

The practical consequence is that BPF can bypass the host network
stack completely when it has enough information to forward directly.
Any packet it forwards via `bpf_redirect` or `bpf_redirect_peer`
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and bpf_redirect_neigh. Redirect peer skips the program on the host side of the veth.

Comment thread felix/bpf/DESIGN.md Outdated
Comment on lines +154 to +156
- **FIB lookup miss.** If `bpf_fib_lookup` fails, BPF lets the kernel
route the packet so the kernel can populate its FIB and neigh caches.
Subsequent packets of the same flow can then be forwarded directly.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but less frequent now since we have redirect neigh. We now support kernels 5.10+ so the helper is always available. BEfore that, fib lookup would fail with no neigh and we would need to defer to the kernel.

Comment thread felix/bpf/DESIGN.md Outdated
node decapsulates and routes the packet to the client.

The asymmetry is deliberate: the return path follows the forward path
because the ingress node is the only place that can un-DNAT the source
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not true, DSR does it on the other host.

Comment thread felix/bpf/DESIGN.md Outdated

DSR requires a supporting underlay: the client must be willing to
accept a reply from a node that is not the one it originally talked
to. The cluster admin opts in via `DSRSupport` configuration or, more
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DSRSupport?

note that not only the client needs to be able to receive the response from a different node, it is probably far enough not to notice, but the local network must allow packets from a different node

Comment thread felix/bpf/DESIGN.md Outdated

### Conflicting nodeport connections

When many clients hit a node port from the same source-IP (for
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the key here is that the LB is connecting to different nodes so it can reuse the source port

Comment thread felix/bpf/DESIGN.md Outdated
Comment on lines +546 to +548
- A new encapsulation path (e.g. a new overlay protocol) must
preserve the "ingress node owns the CT reverse entry" invariant
for the non-DSR case, otherwise return traffic cannot be un-NATed.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not understand, seems useless, remove

Comment thread felix/bpf/DESIGN.md

### What it is for

Ordinary services pick a backend per connection more-or-less at random
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a note that maglev mostly reuses the nodeports vxlan machinery - it mostly follow the nodeport way of forwarding except the extra backend selection and policy

Addresses the 18 review comments on projectcalico#12511 plus two factual
corrections spotted while verifying them, and adds a new
"Fast-path performance discipline" section promoted from the
Cross-cutting notes.

Review comments:
- Drop the reference to the internal design document from a public
  open-source doc (L41).
- Move felix/bpf-apache/ out of scope; it is legacy *tables-mode
  code, not part of the eBPF dataplane (L48).
- Reframe "kernel ingress/egress" as "host-ingress/host-egress"
  throughout — the dataplane reasons from the host namespace (L80).
- List bpf_redirect_neigh alongside bpf_redirect/bpf_redirect_peer
  in §1 and note that redirect_peer skips the peer's host-side
  program (L145).
- Update the FIB-miss bullet: redirect_neigh (kernel 5.10+, always
  available) makes the host-stack detour rare now (L156).
- Scope bpfnat deferral to host→service traffic; host-local
  endpoints cannot skip the host stack (L168).
- Fix the jump-map sizing rationale: policy is shared across
  fast/debug paths, only generic sub-programs have separate
  variants (L270).
- Reframe skb->cb[0,1] — makes policy callable from generic
  programs and lets the debug path reuse the same policy (L306).
- Document loglevel=debug-with-no-user-filter: a match-all filter
  is installed so the preamble path is uniform (L326, §2 + §14).
- Add CTLB callout at the top of §3 "common case" — with CTLB
  on, the TC pod→service path is never taken (L380).
- Rewrite "Same host" to describe the bpf_redirect_peer shortcut
  as a deliberate optimisation over *tables (L410).
- Emphasise that with CTLB, pod-service-self traffic does not
  leave the pod at all — substantial deviation from *tables (L446).
- Flag the NodePort-forwarding-VXLAN vs pod-to-pod-overlay-VXLAN
  distinction; both use the same device (L478).
- Fix the "ingress node is the only place to un-DNAT" claim —
  DSR un-DNATs on the backend node (L498).
- Replace non-existent DSRSupport with BPFExternalServiceMode,
  BPFDSROptoutCIDRs; expand the asymmetric-path requirement
  to cover every hop, not just the client (L512).
- Rewrite the nodeport port-collision explanation: the LB
  reuses source ports legitimately; they collapse onto the
  ingress-node IP at the backend (L531).
- Remove the useless encapsulation review note (L548).
- Frame Maglev as layered on top of the NodePort VXLAN
  forwarding; the new bits are consistent-hash backend selection
  and mid-flow policy re-run (L561).

Fast-path performance discipline (new §16):
- Explicit rule: per-packet fast-path work needs justification.
- Cost tiers (cheap / borderline / expensive).
- Fast-path vs flow-creation vs slow-path guidance.
- Patterns to prefer (CT flags, skb marks, compile-time gates,
  dedicated sub-programs for slow work).
- Concrete review-time checks.

Cross-cutting notes (§17):
- Map-versioning rule loosened to "bump only when new programs
  are incompatible with the old map"; repurposing padding is
  explicitly OK.

Factual corrections found while verifying review comments:
- BPFHostNetworkedNAT → BPFHostNetworkedNATWithoutCTLB.
- BPFConnectTimeLB → BPFConnectTimeLoadBalancing.

Release Note: None

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread felix/bpf/DESIGN.md Outdated
Comment on lines +635 to +636
arrive over the VXLAN tunnel, and after decap they look like they
come from the same ingress-node tunnel IP. Source IPs that were
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after DNAT they come from the same client ip:port to the same destination. That was not true when the client made the connection

tomastigera and others added 2 commits April 16, 2026 14:48
Folds in the L636 review comment and adds five new top-level
sections that the gap review identified as missing.

L636 fix (§5 Conflicting nodeport connections): the collision
mechanism was described wrong. Sources don't collapse to the
ingress-node tunnel IP after decap (the inner packet still
carries the original client src). The collision happens after
DNAT: multiple formerly-distinct (node-IP, node-port) destinations
at different nodes can DNAT to the same backend pod, and the LB's
legitimate source-port reuse across different node destinations
collapses to an identical 5-tuple at the backend's conntrack.

New §3 XDP programs and the XDP→TC handoff (folds in topic 10
"untracked/PreDNAT"): XDP's role as early-drop for untracked
policy, the xdp2tc metadata handoff via CALI_META_ACCEPTED_BY_XDP
and the CALI_SKB_MARK_BYPASS_XDP mark on the TC side, and the
BPFForceTrackPacketsFromIfaces per-interface opt-out for 3rd-party
DNAT interoperability.

New §7 Service session affinity: the cali_v4_nat_aff /
cali_v6_nat_aff map, how sessionAffinity=ClientIP is resolved
before the normal backend pick, interactions with Maglev and
CTLB, applicability to both TC and CTLB paths.

New §8 Service syncing & BPF kube-proxy replacement: the role of
felix/bpf/proxy/, the maps it owns, the Kubernetes semantics it
enforces (externalTrafficPolicy, topology awareness, health
checks), and where to look when a feature is at the syncer level
rather than the BPF-program level.

New §15 BPF-synthesised ICMP errors: why BPF has to synthesise
errors (BPF-forwarded paths bypass the kernel stack that would
normally emit them), the sub-program calico_tc_skb_send_icmp_replies
and the icmp_v4_reply/icmp_v6_reply builders, the TTL-exceeded
and too-big cases, and how this sits on the slow path rather
than the fast path.

New §19 Flow logs & event ring buffer: distinct from the debug
log filters (§18) — flow logs are per-flow events emitted into a
BPF ring buffer and consumed by felix/bpf/events/ /
felix/bpf/ringbuf/. Gated by FLOWLOGS_ENABLED, emitted on flow
creation / closure, shipped to Calico's observability pipeline.

Sections 3 through 22 are renumbered; all §N cross-references
shifted accordingly (sed-driven renumber, ordered highest first
to avoid cascading).

Release Note: None

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New .github/instructions/ebpf-dataplane.instructions.md with
  applyTo globs for felix/bpf-gpl/, felix/bpf/, and the BPF-side
  Go code under felix/dataplane/linux/. GitHub Copilot's
  code-review feature picks these up automatically for matching
  files. Content: condensed must-check principles drawn from
  DESIGN.md's Review notes sections, plus the "update DESIGN.md
  on dataplane-behaviour changes" rule with a concrete exemption
  list (bug-fix-that-restores-described-behaviour, mechanical
  refactor, comments/log edits, dependency bump).
- .github/copilot-instructions.md: new "eBPF Dataplane Review"
  subsection pointing at DESIGN.md and the path-specific
  instructions file, restating the update rule in the same
  words.
- felix/CLAUDE.md: new "eBPF Dataplane Design & Review Guide"
  subsection inside the BPF Dataplane block. /review and the
  general-purpose review skill read CLAUDE.md files, so adding
  it here makes the rules available for Claude-driven reviews.
- felix/bpf/DESIGN.md §22 Cross-cutting review notes: new
  "Keep this document in sync with the code" bullet that
  self-declares the update rule and lists the same exemptions.

Wording is identical across all four places so a reviewer (human
or AI) sees one rule, not four slightly-different ones. No CI
enforcement — review-time judgement is the right level.

Release Note: None

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tomastigera tomastigera added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact and removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 16, 2026
Reviewers (human and AI) consistently miss the performance impact of
changes that suppress an existing fast-path shortcut for a class of
flows: no new code is added on the hot path, but more flows now pay
the cost the shortcut was designed to skip. The §21 framing was built
around adding per-packet work; the inverse case wasn't named.

- DESIGN.md §21 Review notes: add the inverse-case bullet — narrowing
  an existing fast-path shortcut needs the same justification as a
  new lookup (benchmark, scoping mechanism, or quantified flow-class
  argument).
- Copilot must-check: replace "new per-packet work" framing with a
  per-packet cost question that covers both directions.
- felix/CLAUDE.md review guidance: require the question to be
  answered in plain prose before signoff, with the suppress-bypass
  case named explicitly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants