Skip to content

B4/rhash#11367

Draft
mykyta5 wants to merge 13 commits intokernel-patches:bpf-next_basefrom
mykyta5:b4/rhash
Draft

B4/rhash#11367
mykyta5 wants to merge 13 commits intokernel-patches:bpf-next_basefrom
mykyta5:b4/rhash

Conversation

@mykyta5
Copy link
Contributor

@mykyta5 mykyta5 commented Mar 11, 2026

No description provided.

mykyta5 added 13 commits March 11, 2026 16:22
This patch series introduces BPF_MAP_TYPE_RHASH, a new hash map type that
leverages the kernel's rhashtable to provide resizable hash map for BPF.

The existing BPF_MAP_TYPE_HASH uses a fixed number of buckets determined at
map creation time. While this works well for many use cases, it presents
challenges when:

1. The number of elements is unknown at creation time
2. The element count varies significantly during runtime
3. Memory efficiency is important (over-provisioning wastes memory,
 under-provisioning hurts performance)

BPF_MAP_TYPE_RHASH addresses these issues by using rhashtable, which
automatically grows and shrinks based on load factor.

The implementation wraps the kernel's rhashtable with BPF map operations:

- Uses bpf_mem_alloc for RCU-safe memory management
- Supports all standard map operations (lookup, update, delete, get_next_key)
- Supports batch operations (lookup_batch, lookup_and_delete_batch)
- Supports BPF iterators for traversal
- Supports BPF_F_LOCK for spin locks in values
- Requires BPF_F_NO_PREALLOC flag (elements allocated on demand)
- max_entries serves as a hard limit, not bucket count

The series includes comprehensive tests:
- Basic operations in test_maps (lookup, update, delete, get_next_key)
- BPF program tests for lookup/update/delete semantics
- BPF_F_LOCK tests with concurrent access
- Stress tests for get_next_key during concurrent resize operations
- Seq file tests

Signed-off-by: Mykyta Yatsenko <[email protected]>

---
Current implementation of the BPF_MAP_TYPE_RHASH does not provide
the same strong guarantees on the values consistency under concurrent
reads/writes as BPF_MAP_TYPE_HASH.
BPF_MAP_TYPE_HASH allocates a new element and atomically swaps the
pointer, so RCU readers always see a complete value. BPF_MAP_TYPE_RHASH
does memcpy in place with no lock held.
rhash trades consistency for speed (5x improvement in update benchmark):
concurrent readers can observe partially updated data. Two concurrent
writers to the same key can also interleave, producing mixed values.
As a solution, user may use BPF_F_LOCK to guarantee consistent reads
and write serialization.
Summary of the read consistency guarantees:
  map type     |  write mechanism |  read consistency
  -------------+------------------+--------------------------
  htab         |  alloc, swap ptr |  always consistent (RCU)
  htab  F_LOCK |  in-place + lock |  consistent if reader locks
  -------------+------------------+--------------------------
  rhtab        |  in-place memcpy |  torn reads
  rhtab F_LOCK |  in-place + lock |  consistent if reader locks

Changes in v2:
- Added benchmarks
- Link to v1: https://lore.kernel.org/r/[email protected]

--- b4-submit-tracking ---
{
  "series": {
    "revision": 2,
    "change-id": "20251103-rhash-7b70069923d8",
    "prefixes": [
      "RFC bpf-next"
    ],
    "history": {
      "v1": [
        "[email protected]"
      ]
    }
  }
}
Add resizable hash map into enums where it is needed.

Signed-off-by: Mykyta Yatsenko <[email protected]>
Introduce basic operations for BPF_MAP_TYPE_RHASH, a new hash map type
built on top of the kernel's rhashtable.

Key implementation details:
- Uses rhashtable for automatic resizing with RCU-safe operations
- Elements allocated via bpf_mem_alloc for lock-free allocation
- Supports BPF_F_LOCK for spin_lock protected values
- Requires BPF_F_NO_PREALLOC

Implemented map operations:
 * map_alloc/map_free: Initialize and destroy the rhashtable
 * map_lookup_elem: RCU-protected lookup via rhashtable_lookup
 * map_update_elem: Insert or update with BPF_NOEXIST/EXIST/ANY
 * map_delete_elem: Remove element with RCU-deferred freeing
 * map_get_next_key: Returns the next key in the table
 * map_release_uref: Free internal structs (timers, workqueues)

Other operations (batch, seq file) are implemented in the next patch

Signed-off-by: Mykyta Yatsenko <[email protected]>
Add batch operations and BPF iterator support for BPF_MAP_TYPE_RHASH.

Batch operations:
 * rhtab_map_lookup_batch: Bulk lookup of elements by bucket
 * rhtab_map_lookup_and_delete_batch: Atomic bulk lookup and delete

The batch implementation iterates through buckets under RCU protection,
copying keys and values to userspace buffers. When the buffer fills
mid-bucket, it rolls back to the bucket boundary so the next call can
retry that bucket completely.

BPF iterator:
 * Uses rhashtable_walk_* API for safe iteration
 * Handles -EAGAIN during table resize transparently
 * Tracks skip_elems to resume iteration across read() calls

Also implements rhtab_map_mem_usage() to report memory consumption.

Signed-off-by: Mykyta Yatsenko <[email protected]>
Test basic map operations (lookup, update, delete) for
BPF_MAP_TYPE_RHASH including boundary conditions like duplicate
key insertion and deletion of nonexistent keys.

Signed-off-by: Mykyta Yatsenko <[email protected]>
Add tests validating resizable hash map handles BPF_F_LOCK flag as
expected.

Signed-off-by: Mykyta Yatsenko <[email protected]>
Test get_next_key behavior under concurrent modification:
 * Resize test: verify all elements visited after resize trigger
 * Stress test: concurrent iterators and modifiers to detect races

Signed-off-by: Mykyta Yatsenko <[email protected]>
Test BPF iterator functionality for BPF_MAP_TYPE_RHASH:
 * Basic iteration verifying all elements are visited
 * Overflow test triggering seq_file restart, validating correct
resume behavior via skip_elems tracking

Signed-off-by: Mykyta Yatsenko <[email protected]>
Make bpftool documentation aware of the resizable hash map.

Signed-off-by: Mykyta Yatsenko <[email protected]>
Support resizable hashmap in BPF map benchmarks.

Results:
$ sudo ./bench -w3 -d10 -a bpf-rhashmap-full-update
0:hash_map_full_perf 21641414 events per sec

$ sudo ./bench -w3 -d10 -a bpf-hashmap-full-update
0:hash_map_full_perf 4392758 events per sec

$ sudo ./bench -w3 -d10 -a -p8 htab-mem --use-case overwrite --value-size 8
Iter   0 (302.834us): per-prod-op   62.85k/s, memory usage    2.70MiB
Iter   1 (-44.810us): per-prod-op   62.81k/s, memory usage    2.70MiB
Iter   2 (-45.821us): per-prod-op   62.81k/s, memory usage    2.70MiB
Iter   3 (-63.658us): per-prod-op   62.92k/s, memory usage    2.70MiB
Iter   4 ( 32.887us): per-prod-op   62.85k/s, memory usage    2.70MiB
Iter   5 (-76.948us): per-prod-op   62.75k/s, memory usage    2.70MiB
Iter   6 (157.235us): per-prod-op   63.01k/s, memory usage    2.70MiB
Iter   7 (-118.761us): per-prod-op   62.85k/s, memory usage    2.70MiB
Iter   8 (127.139us): per-prod-op   62.92k/s, memory usage    2.70MiB
Iter   9 (-169.908us): per-prod-op   62.99k/s, memory usage    2.70MiB
Iter  10 (101.962us): per-prod-op   62.97k/s, memory usage    2.70MiB
Iter  11 (-64.330us): per-prod-op   63.05k/s, memory usage    2.70MiB
Iter  12 (-20.543us): per-prod-op   62.86k/s, memory usage    2.70MiB
Iter  13 ( 55.382us): per-prod-op   62.95k/s, memory usage    2.70MiB
Summary: per-prod-op   62.92 ±    0.09k/s, memory usage    2.70 ±    0.00MiB, peak memory usage    2.96MiB

$ sudo ./bench -w3 -d10 -a -p8 rhtab-mem --use-case overwrite --value-size 8
Iter   0 (316.805us): per-prod-op   96.40k/s, memory usage    2.71MiB
Iter   1 (-35.225us): per-prod-op   96.54k/s, memory usage    2.71MiB
Iter   2 (-12.431us): per-prod-op   96.54k/s, memory usage    2.71MiB
Iter   3 (-56.537us): per-prod-op   96.58k/s, memory usage    2.71MiB
Iter   4 ( 27.108us): per-prod-op   96.62k/s, memory usage    2.71MiB
Iter   5 (-52.491us): per-prod-op   96.57k/s, memory usage    2.71MiB
Iter   6 ( -2.777us): per-prod-op   96.52k/s, memory usage    2.71MiB
Iter   7 (108.963us): per-prod-op   96.45k/s, memory usage    2.71MiB
Iter   8 (-61.575us): per-prod-op   96.48k/s, memory usage    2.71MiB
Iter   9 (-21.595us): per-prod-op   96.14k/s, memory usage    2.71MiB
Iter  10 (  3.243us): per-prod-op   96.36k/s, memory usage    2.71MiB
Iter  11 (  3.102us): per-prod-op   94.70k/s, memory usage    2.71MiB
Iter  12 (109.102us): per-prod-op   95.77k/s, memory usage    2.71MiB
Iter  13 ( 16.153us): per-prod-op   95.91k/s, memory usage    2.71MiB
Summary: per-prod-op   96.19 ±    0.57k/s, memory usage    2.71 ±    0.00MiB, peak memory usage    2.71MiB

sudo ./bench -w3 -d10 -a bpf-hashmap-lookup --key_size 4\
  --max_entries 1000 --nr_entries 500 --nr_loops 1000000
cpu00: lookup 28.603M ± 0.536M events/sec (approximated from 32 samples of ~34ms)

sudo ./bench -w3 -d10 -a bpf-rhashmap-lookup --key_size 4\
  --max_entries 1000 --nr_entries 500 --nr_loops 1000000
cpu00: lookup 27.340M ± 0.864M events/sec (approximated from 32 samples of ~36ms)

Signed-off-by: Mykyta Yatsenko <[email protected]>
Signed-off-by: Mykyta Yatsenko <[email protected]>
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 9 times, most recently from 6f71402 to 3aabcc8 Compare March 17, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant