Skip to content

Conversation

@tomastigera
Copy link
Contributor

@tomastigera tomastigera commented May 1, 2025

Description

[BPF] Support for IPv4 fragmentation

Incoming IP fragments are stored in an LRU hash map. They can arrive out
of order. After each fragment, we check whether we have all fragments.
If any fragment is missing, we drop the skb as we cannot let it through.
Once we have all fragments, we use the current skb to assemble the whole
packet, we parse it again and we let it process by the rest of the
programs like if the packet arrived as a single chunk.

We need to defragment incoming packets because we would not be able to
pass then through policies that match on more than IP. Also we would not
be able to match them to connections in conntrack. In fact, the payload
of the fragments would be wrongly treated as L4 headers and
misinterpreted.

After a packet is reassebled, fragments are deleted. If for any reason
we never see all fragments, LRU will kick out the stored fragments
eventually.

There are some limitations:

* packet cannot have more than 10 fragments - 10 is arbitrary number
  greater than a reasonable number of fragments in modern networks (2)
  plus we fragment the packet internally into 1500 chunks in case the
  fragments were bigger than this - unlikely, but not impossible.
  However, there is no limit on fragmentation in any RFC except the
  smallest MTu of 576 bytes.
* we can store up to 10k fragments - 10k is again arbitrary. If there is
  a higher fragmentation rate than this, eBPF dataplane is probably not
  the right choice as performance would suffer and it is likely better
  to let generic Linux handle such cases.
* defragmentation is meant to handle corner cases and is not meant to be
  performant.

Related issues/PRs

fixes #8821

Todos

  • Tests
  • Documentation
  • Release note

Release Note

ebpf: handles fragmented IPv4 packets, some limitations apply

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@marvin-tigera marvin-tigera added this to the Calico v3.31.0 milestone May 1, 2025
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels May 1, 2025
@tomastigera tomastigera changed the title [BPF] drop unsupported IPv4 fragments [BPF] Support for IPv4 fragmentation May 1, 2025
@tomastigera tomastigera force-pushed the tomas-bpf-ip-defrag branch 2 times, most recently from 32780c0 to ca4a2b4 Compare May 8, 2025 00:23
@tomastigera tomastigera force-pushed the tomas-bpf-ip-defrag branch 4 times, most recently from bb0d8a3 to 90c24fb Compare May 13, 2025 23:05
@tomastigera tomastigera force-pushed the tomas-bpf-ip-defrag branch 4 times, most recently from ee7f6f8 to 84148cd Compare May 23, 2025 18:38
@tomastigera tomastigera force-pushed the tomas-bpf-ip-defrag branch 2 times, most recently from 8d0c1f6 to 1fa7fb1 Compare May 28, 2025 23:20
@tomastigera tomastigera marked this pull request as ready for review May 29, 2025 21:13
@tomastigera tomastigera requested a review from a team as a code owner May 29, 2025 21:13
@tomastigera tomastigera force-pushed the tomas-bpf-ip-defrag branch from ef33e2c to 8aedc09 Compare May 30, 2025 21:07
Copy link
Member

@sridhartigera sridhartigera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

int r_off = skb_l4hdr_offset(ctx);
bool more_frags = bpf_ntohs(ip_hdr(ctx)->frag_off) & 0x2000;

for (i = 0; i < 10; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Nice to have a comment explaining what we are doing in this block.

k.offset += v->len;
}

goto out;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to just return false in place of goto out

Incoming IP fragments are stored in an LRU hash map. They can arrive out
of order. After each fragment, we check whether we have all fragments.
If any fragment is missing, we drop the skb as we cannot let it through.
Once we have all fragments, we use the current skb to assemble the whole
packet, we parse it again and we let it process by the rest of the
programs like if the packet arrived as a single chunk.

We need to defragment incoming packets because we would not be able to
pass then through policies that match on more than IP. Also we would not
be able to match them to connections in conntrack. In fact, the payload
of the fragments would be wrongly treated as L4 headers and
misinterpreted.

After a packet is reassebled, fragments are deleted. If for any reason
we never see all fragments, LRU will kick out the stored fragments
eventually.

There are some limitations:

* packet cannot have more than 10 fragments - 10 is arbitrary number
  greater than a reasonable number of fragments in modern networks (2)
  plus we fragment the packet internally into 1500 chunks in case the
  fragments were bigger than this - unlikely, but not impossible.
  However, there is no limit on fragmentation in any RFC except the
  smallest MTu of 576 bytes.
* we can store up to 10k fragments - 10k is again arbitrary. If there is
  a higher fragmentation rate than this, eBPF dataplane is probably not
  the right choice as performance would suffer and it is likely better
  to let generic Linux handle such cases.
* defragmentation is meant to handle corner cases and is not meant to be
  performant.
We need to assemble the fragments towards the host either to deal with
reordering - the first fragment with L4 headers may not arrive first -
or with NATing as it is easy to NAT a whole packet, but difficult to NAT
the first and then only do partial nating without being able to find the
CT entry due to missing L4 headers.

We assume that the host does not reorder packets and therefore we police
the first fragment, record that it is allowed and let the subsequent
fragments through. The last fragment remove the record, however, in case
of any failure and missing fragments, LRU will eventually clean it up.
Forwarding would create fragmented VXLAN packet. First let it be
fragmented and then route it into vxlan. Easier to handle.
no need to defrag on WEP egress if we assume that host does not reorder
packets.
Even when mtu is OK,w we may still get BPF_FIB_LKUP_RET_FRAG_NEEDED even
when we do not ask for mtu check. The irony is that if we asked for mtu
check, we would not get BPF_FIB_LKUP_RET_FRAG_NEEDED and all would be
good.
@tomastigera tomastigera force-pushed the tomas-bpf-ip-defrag branch from 8aedc09 to 071e846 Compare June 10, 2025 23:14
@tomastigera tomastigera merged commit 1ed5738 into projectcalico:master Jun 11, 2025
3 checks passed
@tomastigera tomastigera deleted the tomas-bpf-ip-defrag branch June 11, 2025 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-candidate docs-pr-required Change is not yet documented release-note-required Change has user-facing impact (no matter how small)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support IP fragmentation in eBPF

3 participants