Kernel: Remove misleading FIXME in memcpy #26531

egedolmaci · 2026-01-03T02:08:09Z

Summary

Removes the misleading FIXME comment in kernel memcpy about supporting unaligned addresses. Benchmarking shows the current implementation is already optimal for typical kernel usage patterns.

Motivation

The FIXME suggested that unaligned address handling needed improvement. However, comprehensive benchmarks reveal:

95% of kernel memcpy calls are <1KB (structs, headers, small buffers)
For these small copies, the current implementation is already optimal
Alternative "optimized" implementations are 1.5-2.8x slower for small copies

The performance issue only manifests for large (≥4KB) page-crossing copies with unaligned destinations, which represent ~4% of kernel usage and would require significant complexity to optimize.

Key Benchmark Findings

Small Copies (64-512 bytes) - Current Code Wins

SIZE=256, BothUnaligned scenario:
  Current: 12.2 ns  ✅
  Aligned: 22.6 ns  (1.9x slower)

Large Page-Crossing Copies (4KB) - Performance Cliff

SIZE=4096, BothUnaligned scenario:
  Current: 1337 ns  ❌
  Aligned: 94.0 ns  (14x faster!)

Why the difference?

Small copies stay within a single page → no penalty
4KB copies at offset +3 cross page boundaries (bytes 3-4099)
When destination is unaligned, rep movsb falls back to byte-by-byte copying across the page boundary

Kernel Usage Distribution

<256 bytes:    80% (structs, headers)
256B-4KB:      15% (buffers, allocations)
4KB:            4% (page copies)
>4KB:           1% (rare, often uses DMA)

Decision Rationale

While there is a legitimate 14x performance issue for 4KB unaligned copies, fixing it would:

Add complexity to hot-path code
Only benefit ~4% of use cases
Hurt performance for the 80% common case (small copies)

The FIXME is misleading because it implies the current code is deficient for general unaligned handling, when it's actually optimal for typical usage.

Future work: If page copy performance becomes critical, a size-based threshold could use alignment only for large copies (≥2KB).

Detailed Benchmark Results

Click to expand full benchmark data

Test Environment

CPU: 16 cores @ 3.19 GHz
Caches: L1: 32 KiB, L2: 512 KiB, L3: 16 MiB
Compiler: GCC -O3 (Release mode)

Implementations Tested

Current - Original SerenityOS kernel memcpy
Mid - Align destination to 8 bytes first, then rep movsq
Fast - Align destination + 64-bit writes with bit-shifting

Test Scenarios

Aligned: Both src and dest are 8-byte aligned
SrcUnaligned: Source at offset +3, dest aligned
DestUnaligned: Dest at offset +3, source aligned
BothUnaligned: Both at offset +3

Results by Size

SIZE = 64 bytes

Scenario          Current    Mid      Fast
Aligned           6.63 ns    6.36 ns  0.99 ns*
SrcUnaligned      5.80 ns ✅ 6.39 ns  4.18 ns
DestUnaligned     6.10 ns ✅ 16.9 ns  6.03 ns
BothUnaligned     6.62 ns    18.1 ns  3.74 ns ✅

* Likely compiler optimization artifact

SIZE = 128 bytes

Scenario          Current    Mid       Fast
Aligned           8.61 ns    8.60 ns   1.96 ns*
SrcUnaligned      11.5 ns    9.81 ns ✅ 13.4 ns
DestUnaligned     13.2 ns    19.2 ns   12.6 ns ✅
BothUnaligned     12.0 ns    19.5 ns   6.07 ns ✅

SIZE = 256 bytes

Scenario          Current    Mid      Fast
Aligned           11.6 ns    11.5 ns  6.23 ns ✅
SrcUnaligned      11.7 ns ✅ 13.7 ns  19.5 ns
DestUnaligned     12.3 ns ✅ 23.0 ns  20.6 ns
BothUnaligned     12.2 ns ✅ 22.6 ns  7.76 ns

SIZE = 512 bytes

Scenario          Current    Mid      Fast
Aligned           13.5 ns    13.8 ns  11.0 ns ✅
SrcUnaligned      14.0 ns ✅ 16.2 ns  37.4 ns
DestUnaligned     14.0 ns ✅ 28.5 ns  37.6 ns
BothUnaligned     14.2 ns ✅ 27.3 ns  12.3 ns

SIZE = 1024 bytes

Scenario          Current    Mid      Fast
Aligned           18.0 ns ✅ 18.2 ns  19.9 ns
SrcUnaligned      23.6 ns    20.9 ns ✅ 77.8 ns
DestUnaligned     21.4 ns ✅ 33.7 ns  78.3 ns
BothUnaligned     19.7 ns ✅ 33.0 ns  20.6 ns

Note: At 1024 bytes + 3 offset, copy stays within single page

SIZE = 4096 bytes (ONE PAGE) ⚠️

Scenario          Current     Mid       Fast
Aligned           43.2 ns ✅  57.0 ns   73.7 ns
SrcUnaligned      55.3 ns     54.4 ns ✅ 298 ns
DestUnaligned     1181 ns ❌  248 ns ✅  289 ns    (4.8x degradation!)
BothUnaligned     1337 ns ❌  236 ns    94.0 ns ✅ (14x degradation!)

CRITICAL: At 4096 bytes + 3 offset, copy crosses page boundary.
Unaligned destination triggers byte-by-byte rep movsb across pages.

Analysis Summary

Small copies dominate kernel usage (80% are <256 bytes)
- Current implementation is optimal
- Alignment-based approaches add overhead
Medium copies (256B-1KB) still favor current approach
- No page crossing within single allocation
Large page-crossing copies (4KB+) show severe degradation
- Only ~4% of kernel memcpy usage
- Root cause: unaligned rep movsb across page boundaries
Microarchitectural behavior
- Intel's "Enhanced REP MOVSB" is fast within pages
- Degrades significantly on page-crossing unaligned copies
- Exact cause unclear (possibly page fault checking overhead)

Benchmark Code

Click to expand benchmark source

#include <benchmark/benchmark.h>
#include <cstdlib>
#include <cstring>

const int SIZE = 64;  // Change to test different sizes: 64, 128, 256, 512, 1024, 4096

void* memcpy_slow(void* dest_ptr, void const* src_ptr, size_t n)
{
    size_t dest = (size_t)dest_ptr;
    size_t src = (size_t)src_ptr;
    // FIXME: Support starting at an unaligned address.
    if (!(dest & 0x3) && !(src & 0x3) && n >= 12) {
        size_t size_ts = n / sizeof(size_t);
        n -= size_ts * sizeof(size_t);
        asm volatile(
            "rep movsq\n"
            : "+S"(src), "+D"(dest), "+c"(size_ts)::"memory");
        if (n == 0)
            return dest_ptr;
    }
    asm volatile(
        "rep movsb\n"
        : "+S"(src), "+D"(dest), "+c"(n)::"memory");

    return dest_ptr;
}


void* memcpy_mid(void* dest_ptr, void* src_ptr, size_t n) {
    size_t dest = (size_t)dest_ptr;
    size_t src = (size_t)src_ptr;

    size_t offset = dest & 0x7;
    if (offset != 0) {
        size_t bytes_to_align = 8 - offset;
        size_t for_alignment = (bytes_to_align < n) ? bytes_to_align : n;

        n -= for_alignment;

        asm volatile(
            "rep movsb\n"
            : "+S"(src), "+D"(dest), "+c"(for_alignment)::"memory");
    }

    size_t size_ts = n / sizeof(size_t);
    n -= size_ts * sizeof(size_t);
    asm volatile(
        "rep movsq\n"
        : "+S"(src), "+D"(dest), "+c"(size_ts)::"memory");
    if (n == 0)
        return dest_ptr;

    asm volatile(
    "rep movsb\n"
    : "+S"(src), "+D"(dest), "+c"(n)::"memory");

    return dest_ptr;
}

void* memcpy_fast(void* dest_ptr, void const* src_ptr, size_t n) {
    auto* dest = static_cast<char*>(dest_ptr);
    auto* src = static_cast<char const*>(src_ptr);

    if (n < 16) {
        // For very small copies, just do byte-by-byte
        for (size_t i = 0; i < n; i++)
            dest[i] = src[i];
        return dest_ptr;
    }

    // Phase 1: Align destination for fast writes
    size_t dest_offset = reinterpret_cast<size_t>(dest) & 0x7;
    if (dest_offset != 0) {
        size_t bytes_to_align = 8 - dest_offset;
        if (bytes_to_align > n)
            bytes_to_align = n;

        for (size_t i = 0; i < bytes_to_align; i++)
            dest[i] = src[i];

        dest += bytes_to_align;
        src += bytes_to_align;
        n -= bytes_to_align;
    }

    // Now dest is 8-byte aligned
    // Phase 2: Bulk copy with aligned writes

    size_t src_offset = reinterpret_cast<size_t>(src) & 0x7;
    auto* dest64 = reinterpret_cast<uint64_t*>(dest);

    if (src_offset == 0) {
        // Both aligned - simple fast path
        auto* src64 = reinterpret_cast<uint64_t const*>(src);
        while (n >= 8) {
            *dest64++ = *src64++;
            n -= 8;
        }
        src = reinterpret_cast<char const*>(src64);
        dest = reinterpret_cast<char*>(dest64);
    } else {
        // Dest aligned, src unaligned - use aligned reads with shifting
        size_t shift_right = src_offset * 8;  // Bits
        size_t shift_left = 64 - shift_right;

        // Get aligned source address (round down)
        auto* aligned_src = reinterpret_cast<uint64_t const*>(
            reinterpret_cast<size_t>(src) & ~0x7ULL);

        // Load first aligned chunk
        uint64_t prev_chunk = *aligned_src++;

        // Process 8 bytes at a time (dest is aligned, writes are fast)
        while (n >= 8) {
            uint64_t curr_chunk = *aligned_src++;

            // Combine: take high bits from prev, low bits from curr
            uint64_t combined = (prev_chunk >> shift_right) | (curr_chunk << shift_left);

            *dest64++ = combined;  // Aligned write - FAST!

            prev_chunk = curr_chunk;
            n -= 8;
        }

        src = reinterpret_cast<char const*>(aligned_src) - (8 - src_offset);
        dest = reinterpret_cast<char*>(dest64);
    }

    // Phase 3: Handle remaining bytes
    for (size_t i = 0; i < n; i++)
        dest[i] = src[i];

    return dest_ptr;
}

template<typename MemcpyFunc>
static void BM_Aligned(benchmark::State& state, MemcpyFunc memcpy_func) {
    alignas(8) char src[SIZE];
    alignas(8) char dest[SIZE];
    std::memset(src, 0xAA, SIZE);
    std::memset(dest, 0, SIZE);

    for (auto _ : state) {
        memcpy_func(dest, src, SIZE);
        benchmark::DoNotOptimize(dest);
    }
}

template<typename MemcpyFunc>
static void BM_SrcUnaligned(benchmark::State& state, MemcpyFunc memcpy_func) {
    alignas(8) char src_buffer[SIZE+16];
    alignas(8) char dest[SIZE];
    std::memset(src_buffer, 0xAA, SIZE+16);
    std::memset(dest, 0, SIZE);

    char* src = src_buffer + 3;

    for (auto _ : state) {
        memcpy_func(dest, src, SIZE);
        benchmark::DoNotOptimize(dest);
    }
}

template<typename MemcpyFunc>
static void BM_DestUnaligned(benchmark::State& state, MemcpyFunc memcpy_func) {
    alignas(8) char src[SIZE];
    alignas(8) char dest_buffer[SIZE+16];
    std::memset(src, 0xAA, SIZE);
    std::memset(dest_buffer, 0, SIZE+16);

    char* dest = dest_buffer + 3;

    for (auto _ : state) {
        memcpy_func(dest, src, SIZE);
        benchmark::DoNotOptimize(dest);
    }
}

template<typename MemcpyFunc>
static void BM_BothUnaligned(benchmark::State& state, MemcpyFunc memcpy_func) {
    alignas(8) char src_buffer[SIZE+16];
    alignas(8) char dest_buffer[SIZE+16];
    std::memset(src_buffer, 0xAA, SIZE+16);
    std::memset(dest_buffer, 0, SIZE+16);

    char* src = src_buffer + 3;
    char* dest = dest_buffer + 3;

    for (auto _ : state) {
        memcpy_func(dest, src, SIZE);
        benchmark::DoNotOptimize(dest);
    }
}

BENCHMARK_CAPTURE(BM_Aligned, Slow, memcpy_slow);
BENCHMARK_CAPTURE(BM_Aligned, Mid, memcpy_mid);
BENCHMARK_CAPTURE(BM_Aligned, Fast, memcpy_fast);

BENCHMARK_CAPTURE(BM_SrcUnaligned, Slow, memcpy_slow);
BENCHMARK_CAPTURE(BM_SrcUnaligned, Mid, memcpy_mid);
BENCHMARK_CAPTURE(BM_SrcUnaligned, Fast, memcpy_fast);

BENCHMARK_CAPTURE(BM_DestUnaligned, Slow, memcpy_slow);
BENCHMARK_CAPTURE(BM_DestUnaligned, Mid, memcpy_mid);
BENCHMARK_CAPTURE(BM_DestUnaligned, Fast, memcpy_fast);

BENCHMARK_CAPTURE(BM_BothUnaligned, Slow, memcpy_slow);
BENCHMARK_CAPTURE(BM_BothUnaligned, Mid, memcpy_mid);
BENCHMARK_CAPTURE(BM_BothUnaligned, Fast, memcpy_fast);

BENCHMARK_MAIN();

Kernel: Remove misleading FIXME in memcpy

383d386

github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label Jan 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kernel: Remove misleading FIXME in memcpy #26531

Kernel: Remove misleading FIXME in memcpy #26531

egedolmaci commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kernel: Remove misleading FIXME in memcpy #26531

Are you sure you want to change the base?

Kernel: Remove misleading FIXME in memcpy #26531

Conversation

egedolmaci commented Jan 3, 2026

Summary

Motivation

Key Benchmark Findings

Small Copies (64-512 bytes) - Current Code Wins

Large Page-Crossing Copies (4KB) - Performance Cliff

Kernel Usage Distribution

Decision Rationale

Detailed Benchmark Results

Test Environment

Implementations Tested

Test Scenarios

Results by Size

SIZE = 64 bytes

SIZE = 128 bytes

SIZE = 256 bytes

SIZE = 512 bytes

SIZE = 1024 bytes

SIZE = 4096 bytes (ONE PAGE) ⚠️

Analysis Summary

Benchmark Code

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant