Skip to content

Conversation

ThePseudo
Copy link

@ThePseudo ThePseudo commented Sep 12, 2025

On the main branch, the encode and decode operations look at the file ahead-of-time to gather information about padding. However, padding only appears at the end, and the rest of the file can be encoded and decoded disregarding the padding.

The main issue with the file being read ahead-of-time is that we need the entire file to be available from the beginning. This is in contrast with a use case that can be streaming data: imagine you have a web socket, the sender sends base64-encoded data, but the receiver can only translate it in the end, making real-time communication impossible.

Moreover, reading the entire file from the beginning means that it needs to stay in RAM the whole time. For smaller files it is not a problem, but when encoding to base64 few gigabytes of file this can be an issue, as it could easily saturate the main memory when reading the file.

This patch is aimed to solve the issue of the ahead-of-time reading. First, we do not check for padding, but let the decoder work for us: as said earlier, most of the encoded file does not have padding, and there is a 1/3 probability that there is no padding in the end. The STANDARD_NO_PAD base64 decoder used produces an error if padding is present; if so, we resort to the STANDARD base64 decoder. This is how the problem of the padding ahead-of-time is solved.

Also, please notice that the encoder does not need any ahead-of-time knowledge of padding, since it is the encoder itself that generates it.

For the benchmarking:
coreutils base64 refers to this PR version
coreutils_main_branch base64 refers to the version that is on the main branch
base64 refers to GNU Coreutils base64

As this is partially also a performance-related patch, I will paste the hyperfine analysis:

For encoding:

Benchmark 1: ./coreutils base64 model-00001-of-000163.safetensors
  Time (mean ± σ):      2.423 s ±  0.039 s    [User: 0.997 s, System: 1.424 s]
  Range (min … max):    2.393 s …  2.524 s    10 runs
 
Benchmark 2: ./coreutils_main_branch base64 model-00001-of-000163.safetensors
  Time (mean ± σ):      4.111 s ±  0.035 s    [User: 1.172 s, System: 2.937 s]
  Range (min … max):    4.052 s …  4.158 s    10 runs
 
Benchmark 3: base64 model-00001-of-000163.safetensors
  Time (mean ± σ):      4.000 s ±  0.016 s    [User: 3.054 s, System: 0.941 s]
  Range (min … max):    3.976 s …  4.033 s    10 runs
 
Summary
  ./coreutils base64 model-00001-of-000163.safetensors ran
    1.65 ± 0.03 times faster than base64 model-00001-of-000163.safetensors
    1.70 ± 0.03 times faster than ./coreutils_main_branch base64 model-00001-of-000163.safetensors

For decoding:

Benchmark 1: ./coreutils base64 -d base64.txt
  Time (mean ± σ):      9.442 s ±  0.060 s    [User: 7.622 s, System: 1.814 s]
  Range (min … max):    9.373 s …  9.580 s    10 runs
 
Benchmark 2: ./coreutils_main_branch base64 -d base64.txt
  Time (mean ± σ):      9.504 s ±  0.201 s    [User: 5.766 s, System: 3.727 s]
  Range (min … max):    9.309 s …  9.882 s    10 runs
 
Benchmark 3: base64 -d base64.txt
  Time (mean ± σ):      8.362 s ±  0.140 s    [User: 6.750 s, System: 1.605 s]
  Range (min … max):    8.155 s …  8.527 s    10 runs
 
Summary
  base64 -d base64.txt ran
    1.13 ± 0.02 times faster than ./coreutils base64 -d base64.txt
    1.14 ± 0.03 times faster than ./coreutils_main_branch base64 -d base64.txt

For memory consumption, using ps and grep on the 3 implementation variants working on the same file to gather the memory used, I will put the 3 values near each other to compare. I will report the entire line, since it has no sensitive information for me.

This approach is feasible because the memory footprint remains stable during the program execution: after the file is loaded/memory is allocated, there is no more large allocations that take place (except, maybe, inside of the fast_encoder/decoder in the base64_simd crate, which is shown by the flamegraph tool (I used flamegraph, which also generates an svg to explore) (image at the end of this PR).

For encoding:

andrea    167746  100  0.0  15880  6616 pts/6    R+   10:08   0:01 ./coreutils base64 model-00001-of-000163.safetensors
andrea    168813  102  6.1 5127348 1894336 pts/6 R+   10:10   0:00 ./coreutils_main_branch base64 model-00001-of-000163.safetensors
andrea    169415  100  0.0   8392  2272 pts/6    R+   10:11   0:02 base64 model-00001-of-000163.safetensors

For decoding:

andrea    164864  100  0.0  15876  6288 pts/6    R+   10:01   0:01 ./coreutils base64 -d base64.txt
andrea    165735  125  0.7 6920844 233384 pts/6  R+   10:03   0:00 ./coreutils_main_branch base64 -d base64.txt
andrea    166374  100  0.0   8388  2208 pts/6    R+   10:05   0:03 base64 -d base64.txt

the issue we still have is that memory usage is double with respect to the GNU Coreutils implementation, but it also does not increase with the size of the file.

Malloc inside base64_simd:
image

@ThePseudo ThePseudo marked this pull request as ready for review September 12, 2025 08:17
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

@sylvestre
Copy link
Contributor

Could you please share your example file? I don't get the same results

@ThePseudo
Copy link
Author

ThePseudo commented Sep 12, 2025

Uhm it is almost 5 GB large... maybe I can try with a smaller one? What do you suggest?

Nevermind, I found it back online... it is one of the models for DeepSeek, those are available here. https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main

Probably a good option is selecting this one: https://huggingface.co/deepseek-ai/DeepSeek-V3/resolve/main/model-00001-of-000163.safetensors?download=true

It is roughly the same size

@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch from 8f5d7b1 to c38288b Compare September 15, 2025 07:18
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

@ThePseudo
Copy link
Author

I re-ran the tests with the file linked above:

For encoding:

Benchmark 1: ./coreutils base64 model-00001-of-000163.safetensors
  Time (mean ± σ):      2.152 s ±  0.066 s    [User: 0.952 s, System: 1.199 s]
  Range (min … max):    2.092 s …  2.301 s    10 runs
 
Benchmark 2: ./coreutils_main_branch base64 model-00001-of-000163.safetensors
  Time (mean ± σ):      3.759 s ±  0.119 s    [User: 1.140 s, System: 2.619 s]
  Range (min … max):    3.616 s …  3.976 s    10 runs
 
Benchmark 3: base64 model-00001-of-000163.safetensors
  Time (mean ± σ):      3.723 s ±  0.032 s    [User: 3.044 s, System: 0.679 s]
  Range (min … max):    3.687 s …  3.783 s    10 runs
 
Summary
  ./coreutils base64 model-00001-of-000163.safetensors ran
    1.73 ± 0.05 times faster than base64 model-00001-of-000163.safetensors
    1.75 ± 0.08 times faster than ./coreutils_main_branch base64 model-00001-of-000163.safetensors

For decoding:

Benchmark 1: ./coreutils base64 -d base64.txt
  Time (mean ± σ):      9.167 s ±  0.101 s    [User: 7.637 s, System: 1.499 s]
  Range (min … max):    9.063 s …  9.347 s    10 runs
 
Benchmark 2: ./coreutils_main_branch base64 -d base64.txt
  Time (mean ± σ):      9.329 s ±  0.020 s    [User: 5.620 s, System: 3.669 s]
  Range (min … max):    9.301 s …  9.380 s    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 3: base64 -d base64.txt
  Time (mean ± σ):      8.038 s ±  0.037 s    [User: 6.471 s, System: 1.536 s]
  Range (min … max):    7.991 s …  8.104 s    10 runs
 
Summary
  base64 -d base64.txt ran
    1.14 ± 0.01 times faster than ./coreutils base64 -d base64.txt
    1.16 ± 0.01 times faster than ./coreutils_main_branch base64 -d base64.txt

The system is also on some load, so it might be slower than usual, but more or less the results stay consistent with what reported before. Please let me know if there is any difference.

@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch from c38288b to 8e0969e Compare September 16, 2025 06:07
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch from 8e0969e to 44147d1 Compare September 17, 2025 07:11
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)

@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch from 44147d1 to 05d7d9f Compare September 17, 2025 09:24
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch from 05d7d9f to 1854b91 Compare September 18, 2025 07:27
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)

Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

Copy link

codspeed-hq bot commented Sep 20, 2025

CodSpeed Performance Report

Merging #8622 will not alter performance

Comparing ThePseudo:streamline_b64_decode (76cb7e6) with main (0258583)

Summary

✅ 106 untouched
⏩ 73 skipped1

Footnotes

  1. 73 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)

@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch 2 times, most recently from bcd2ec4 to 1854b91 Compare September 22, 2025 08:01
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)

@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch 2 times, most recently from 23bc39f to 1bc46e6 Compare September 22, 2025 12:35
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/tail/overlay-headers is no longer failing!

@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch from e9882d5 to f60b1b9 Compare September 24, 2025 13:36
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)

@ThePseudo
Copy link
Author

@Nekrolm do you think it is ok to be merged?

@Nekrolm
Copy link
Contributor

Nekrolm commented Sep 26, 2025

I'm ok with these changes. But I'm not a maintainer

@ThePseudo
Copy link
Author

@sylvestre then, what do you think?

@aduskett
Copy link

Any news on getting this merged? It looks great!

@sylvestre
Copy link
Contributor

i would like to see benchmark integrated in the repo before it is merged

is someone interested in doing that ? (in a different PR)
we have examples here:
ls -d src/uu/*/benches

@ThePseudo
Copy link
Author

I could do it in a different PR!

@sylvestre sylvestre force-pushed the streamline_b64_decode branch from f60b1b9 to 4e08bb6 Compare September 30, 2025 09:57
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch from 4e08bb6 to c91947a Compare October 1, 2025 06:15
@ThePseudo
Copy link
Author

@sylvestre I guess now it should work, hope it looks good! :D

@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch from c91947a to 88e3b2b Compare October 13, 2025 06:57
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch from 88e3b2b to 9bdac31 Compare October 13, 2025 11:38
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)

@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch from 9bdac31 to 87978e0 Compare October 16, 2025 06:16
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)

Andrea Calabrese added 3 commits October 17, 2025 11:44
This should remove the dependency we have in knowing whether the final
message has padding or not. This is the first step to not have a
ahead-of-time loading of the entire message to encode/decode, and allow
for streaming.

Signed-off-by: Andrea Calabrese <[email protected]>
As per title, this is the main feature of this patch set. First, by
avoiding looking for the final padding, there is the ability to read
data streaming in before the stream finished producing them. This also
enables the tool to work with much less memory needed, essentially
making it a fixed amount instead of tepending by the file size.

Signed-off-by: Andrea Calabrese <[email protected]>
We read linearly, so we do not need to seek within a file

Signed-off-by: Andrea Calabrese <[email protected]>
@ThePseudo ThePseudo force-pushed the streamline_b64_decode branch from 87978e0 to 76cb7e6 Compare October 17, 2025 09:45
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/tail/overlay-headers (passes in this run but fails in the 'main' branch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants