-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Streaming base64 encode/decode #8622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
GNU testsuite comparison:
|
Could you please share your example file? I don't get the same results |
Uhm it is almost 5 GB large... maybe I can try with a smaller one? What do you suggest? Nevermind, I found it back online... it is one of the models for DeepSeek, those are available here. https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main Probably a good option is selecting this one: https://huggingface.co/deepseek-ai/DeepSeek-V3/resolve/main/model-00001-of-000163.safetensors?download=true It is roughly the same size |
8f5d7b1
to
c38288b
Compare
GNU testsuite comparison:
|
I re-ran the tests with the file linked above: For encoding:
For decoding:
The system is also on some load, so it might be slower than usual, but more or less the results stay consistent with what reported before. Please let me know if there is any difference. |
c38288b
to
8e0969e
Compare
GNU testsuite comparison:
|
8e0969e
to
44147d1
Compare
GNU testsuite comparison:
|
44147d1
to
05d7d9f
Compare
GNU testsuite comparison:
|
05d7d9f
to
1854b91
Compare
GNU testsuite comparison:
|
GNU testsuite comparison:
|
CodSpeed Performance ReportMerging #8622 will not alter performanceComparing Summary
Footnotes
|
GNU testsuite comparison:
|
GNU testsuite comparison:
|
bcd2ec4
to
1854b91
Compare
GNU testsuite comparison:
|
23bc39f
to
1bc46e6
Compare
GNU testsuite comparison:
|
GNU testsuite comparison:
|
GNU testsuite comparison:
|
e9882d5
to
f60b1b9
Compare
GNU testsuite comparison:
|
@Nekrolm do you think it is ok to be merged? |
I'm ok with these changes. But I'm not a maintainer |
@sylvestre then, what do you think? |
Any news on getting this merged? It looks great! |
i would like to see benchmark integrated in the repo before it is merged is someone interested in doing that ? (in a different PR) |
I could do it in a different PR! |
f60b1b9
to
4e08bb6
Compare
GNU testsuite comparison:
|
4e08bb6
to
c91947a
Compare
@sylvestre I guess now it should work, hope it looks good! :D |
c91947a
to
88e3b2b
Compare
GNU testsuite comparison:
|
88e3b2b
to
9bdac31
Compare
GNU testsuite comparison:
|
9bdac31
to
87978e0
Compare
GNU testsuite comparison:
|
This should remove the dependency we have in knowing whether the final message has padding or not. This is the first step to not have a ahead-of-time loading of the entire message to encode/decode, and allow for streaming. Signed-off-by: Andrea Calabrese <[email protected]>
As per title, this is the main feature of this patch set. First, by avoiding looking for the final padding, there is the ability to read data streaming in before the stream finished producing them. This also enables the tool to work with much less memory needed, essentially making it a fixed amount instead of tepending by the file size. Signed-off-by: Andrea Calabrese <[email protected]>
We read linearly, so we do not need to seek within a file Signed-off-by: Andrea Calabrese <[email protected]>
87978e0
to
76cb7e6
Compare
GNU testsuite comparison:
|
On the main branch, the encode and decode operations look at the file ahead-of-time to gather information about padding. However, padding only appears at the end, and the rest of the file can be encoded and decoded disregarding the padding.
The main issue with the file being read ahead-of-time is that we need the entire file to be available from the beginning. This is in contrast with a use case that can be streaming data: imagine you have a web socket, the sender sends base64-encoded data, but the receiver can only translate it in the end, making real-time communication impossible.
Moreover, reading the entire file from the beginning means that it needs to stay in RAM the whole time. For smaller files it is not a problem, but when encoding to base64 few gigabytes of file this can be an issue, as it could easily saturate the main memory when reading the file.
This patch is aimed to solve the issue of the ahead-of-time reading. First, we do not check for padding, but let the decoder work for us: as said earlier, most of the encoded file does not have padding, and there is a 1/3 probability that there is no padding in the end. The STANDARD_NO_PAD base64 decoder used produces an error if padding is present; if so, we resort to the STANDARD base64 decoder. This is how the problem of the padding ahead-of-time is solved.
Also, please notice that the encoder does not need any ahead-of-time knowledge of padding, since it is the encoder itself that generates it.
For the benchmarking:
coreutils base64
refers to this PR versioncoreutils_main_branch base64
refers to the version that is on the main branchbase64
refers to GNU Coreutils base64As this is partially also a performance-related patch, I will paste the hyperfine analysis:
For encoding:
For decoding:
For memory consumption, using ps and grep on the 3 implementation variants working on the same file to gather the memory used, I will put the 3 values near each other to compare. I will report the entire line, since it has no sensitive information for me.
This approach is feasible because the memory footprint remains stable during the program execution: after the file is loaded/memory is allocated, there is no more large allocations that take place (except, maybe, inside of the fast_encoder/decoder in the base64_simd crate, which is shown by the flamegraph tool (I used
flamegraph
, which also generates an svg to explore) (image at the end of this PR).For encoding:
For decoding:
the issue we still have is that memory usage is double with respect to the GNU Coreutils implementation, but it also does not increase with the size of the file.
Malloc inside base64_simd:
