[BPF] Support for IPv4 fragmentation #10335

tomastigera · 2025-05-01T00:20:45Z

Description

[BPF] Support for IPv4 fragmentation

Incoming IP fragments are stored in an LRU hash map. They can arrive out
of order. After each fragment, we check whether we have all fragments.
If any fragment is missing, we drop the skb as we cannot let it through.
Once we have all fragments, we use the current skb to assemble the whole
packet, we parse it again and we let it process by the rest of the
programs like if the packet arrived as a single chunk.

We need to defragment incoming packets because we would not be able to
pass then through policies that match on more than IP. Also we would not
be able to match them to connections in conntrack. In fact, the payload
of the fragments would be wrongly treated as L4 headers and
misinterpreted.

After a packet is reassebled, fragments are deleted. If for any reason
we never see all fragments, LRU will kick out the stored fragments
eventually.

There are some limitations:

* packet cannot have more than 10 fragments - 10 is arbitrary number
  greater than a reasonable number of fragments in modern networks (2)
  plus we fragment the packet internally into 1500 chunks in case the
  fragments were bigger than this - unlikely, but not impossible.
  However, there is no limit on fragmentation in any RFC except the
  smallest MTu of 576 bytes.
* we can store up to 10k fragments - 10k is again arbitrary. If there is
  a higher fragmentation rate than this, eBPF dataplane is probably not
  the right choice as performance would suffer and it is likely better
  to let generic Linux handle such cases.
* defragmentation is meant to handle corner cases and is not meant to be
  performant.

Related issues/PRs

fixes #8821

Todos

Tests
Documentation
Release note

Release Note

ebpf: handles fragmented IPv4 packets, some limitations apply

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

docs-pr-required: This change requires a change to the documentation that has not been completed yet.
docs-completed: This change has all necessary documentation completed.
docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

release-note-required: This PR has user-facing changes. Most PRs should have this label.
release-note-not-required: This PR has no user-facing changes.

Other optional labels:

cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

sridhartigera

LGTM. Thanks.

sridhartigera · 2025-06-09T20:19:52Z

felix/bpf-gpl/ip_v4_fragment.h

+	int r_off = skb_l4hdr_offset(ctx);
+	bool more_frags = bpf_ntohs(ip_hdr(ctx)->frag_off) & 0x2000;
+
+	for (i = 0; i < 10; i++) {


nit: Nice to have a comment explaining what we are doing in this block.

sridhartigera · 2025-06-09T20:34:22Z

felix/bpf-gpl/ip_v4_fragment.h

+		k.offset += v->len;
+	}
+
+	goto out;


Better to just return false in place of goto out

Incoming IP fragments are stored in an LRU hash map. They can arrive out of order. After each fragment, we check whether we have all fragments. If any fragment is missing, we drop the skb as we cannot let it through. Once we have all fragments, we use the current skb to assemble the whole packet, we parse it again and we let it process by the rest of the programs like if the packet arrived as a single chunk. We need to defragment incoming packets because we would not be able to pass then through policies that match on more than IP. Also we would not be able to match them to connections in conntrack. In fact, the payload of the fragments would be wrongly treated as L4 headers and misinterpreted. After a packet is reassebled, fragments are deleted. If for any reason we never see all fragments, LRU will kick out the stored fragments eventually. There are some limitations: * packet cannot have more than 10 fragments - 10 is arbitrary number greater than a reasonable number of fragments in modern networks (2) plus we fragment the packet internally into 1500 chunks in case the fragments were bigger than this - unlikely, but not impossible. However, there is no limit on fragmentation in any RFC except the smallest MTu of 576 bytes. * we can store up to 10k fragments - 10k is again arbitrary. If there is a higher fragmentation rate than this, eBPF dataplane is probably not the right choice as performance would suffer and it is likely better to let generic Linux handle such cases. * defragmentation is meant to handle corner cases and is not meant to be performant.

We need to assemble the fragments towards the host either to deal with reordering - the first fragment with L4 headers may not arrive first - or with NATing as it is easy to NAT a whole packet, but difficult to NAT the first and then only do partial nating without being able to find the CT entry due to missing L4 headers. We assume that the host does not reorder packets and therefore we police the first fragment, record that it is allowed and let the subsequent fragments through. The last fragment remove the record, however, in case of any failure and missing fragments, LRU will eventually clean it up.

Forwarding would create fragmented VXLAN packet. First let it be fragmented and then route it into vxlan. Easier to handle.

no need to defrag on WEP egress if we assume that host does not reorder packets.

Even when mtu is OK,w we may still get BPF_FIB_LKUP_RET_FRAG_NEEDED even when we do not ask for mtu check. The irony is that if we asked for mtu check, we would not get BPF_FIB_LKUP_RET_FRAG_NEEDED and all would be good.

marvin-tigera added this to the Calico v3.31.0 milestone May 1, 2025

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels May 1, 2025

tomastigera changed the title ~~[BPF] drop unsupported IPv4 fragments~~ [BPF] Support for IPv4 fragmentation May 1, 2025

tomastigera force-pushed the tomas-bpf-ip-defrag branch 2 times, most recently from 32780c0 to ca4a2b4 Compare May 8, 2025 00:23

tomastigera added the cherry-pick-candidate label May 8, 2025

tomastigera force-pushed the tomas-bpf-ip-defrag branch 4 times, most recently from bb0d8a3 to 90c24fb Compare May 13, 2025 23:05

tomastigera force-pushed the tomas-bpf-ip-defrag branch 4 times, most recently from ee7f6f8 to 84148cd Compare May 23, 2025 18:38

tomastigera force-pushed the tomas-bpf-ip-defrag branch 2 times, most recently from 8d0c1f6 to 1fa7fb1 Compare May 28, 2025 23:20

tomastigera marked this pull request as ready for review May 29, 2025 21:13

tomastigera requested a review from a team as a code owner May 29, 2025 21:13

tomastigera force-pushed the tomas-bpf-ip-defrag branch from ef33e2c to 8aedc09 Compare May 30, 2025 21:07

sridhartigera approved these changes Jun 9, 2025

View reviewed changes

tomastigera added 9 commits June 10, 2025 16:14

[BPF] handle fragmentation over vxlan

fbbfd35

Forwarding would create fragmented VXLAN packet. First let it be fragmented and then route it into vxlan. Easier to handle.

[BPF] add counters for dropped fragments

bec17b9

[BPF] move IP defrag to its own program

02ca707

[BPF] disable frag tests with vxlan on ubuntu 22.04

10dc03f

[BPF] don't defrag on WEP

58f4114

no need to defrag on WEP egress if we assume that host does not reorder packets.

bpf_fib_lookup() may return BPF_FIB_LKUP_RET_FRAG_NEEDED

75a8905

Even when mtu is OK,w we may still get BPF_FIB_LKUP_RET_FRAG_NEEDED even when we do not ask for mtu check. The irony is that if we asked for mtu check, we would not get BPF_FIB_LKUP_RET_FRAG_NEEDED and all would be good.

fix BPF defrag UTs

ca39258

tomastigera added 3 commits June 10, 2025 16:14

[BPF] improve failure reason reporting

8b48f60

fix BPF ut

296294f

fix code nits

071e846

tomastigera force-pushed the tomas-bpf-ip-defrag branch from 8aedc09 to 071e846 Compare June 10, 2025 23:14

tomastigera merged commit 1ed5738 into projectcalico:master Jun 11, 2025
3 checks passed

tomastigera deleted the tomas-bpf-ip-defrag branch June 11, 2025 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BPF] Support for IPv4 fragmentation #10335

[BPF] Support for IPv4 fragmentation #10335

Uh oh!

tomastigera commented May 1, 2025 •

edited

Loading

Uh oh!

sridhartigera left a comment

Uh oh!

sridhartigera Jun 9, 2025

Uh oh!

sridhartigera Jun 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[BPF] Support for IPv4 fragmentation #10335

[BPF] Support for IPv4 fragmentation #10335

Uh oh!

Conversation

tomastigera commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues/PRs

Todos

Release Note

Reminder for the reviewer

Uh oh!

sridhartigera left a comment

Choose a reason for hiding this comment

Uh oh!

sridhartigera Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

sridhartigera Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomastigera commented May 1, 2025 •

edited

Loading