Skip to content

Allow urlencoded data URLs #467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Allow urlencoded data URLs #467

wants to merge 1 commit into from

Conversation

TobiX
Copy link

@TobiX TobiX commented May 2, 2025

This automatically selects urlencoded data URLs if that results in smaller output then base64 encoding them.

This is a kinda stupid idea I had as I saw the size of some really large dumps (for example from comic pages on tapas.io)...

It seems to work... In a sample page from tapas.io, it reduces the final size from 213,233,452 bytes to 170,968,176 bytes, which is about 20% smaller.

The characters which are percent-encoded come from https://datatracker.ietf.org/doc/html/rfc3986#section-2.2, since that is refernced from https://developer.mozilla.org/en-US/docs/Web/URI/Reference/Schemes/data - I found no concrete list of characters which should be escaped in the data-URL RFC (https://www.rfc-editor.org/rfc/rfc2397) - Additionally, we escape %, so nested data: URLs (PNGs in CSS anyone?) work and ", so we don't accidentally close the quotes surrounding the data URL.

I'm not sure if the encoding is correct for exotic (non-UTF8) charsets... Please advise if I should add more tests testing such scenarios.

(PS: Feel free to close this PR if this all sounds too stupid/mad)

This automatically selects urlencoded data URLs if that results in
smaller output then base64 encoding them.
@snshn
Copy link
Member

snshn commented May 5, 2025

Hello Tobias,

Thank you very much for this PR!

Not a stupid idea at all, base64 unnecessarily bloats up plaintext, something like 30% longer result on average than using URL encoding.

Originally I created https://github.com/Y2Z/dataurl to move the code that parses and creates data URLs out of Monolith, and make dataurl available as both a crate and CLI tool. It's somewhere in my backlog to switch to using it along with only using base64 for binary data.

I'll review your PR briefly and get back to you.

let base64 = BASE64_STANDARD.encode(data);
let urlenc = percent_encode(data, DATA_ESC).to_string();

if urlenc.len() < base64.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your logic, but I worry that using base64 and percent-encoding every single asset will eat both more CPU and RAM, and I don't see any benefit of using base64 for plaintext data anyway, even if somehow it manages to be a few bytes shorter. I think the best way to go here is default to percent_encode for plaintext data, and use base64 for non-printable data URLs (fonts, non-SVG images, etc). There's a data type detector somewhere in this codebase, I think it's called "is_plaintext()", that should be enough to make this function here decide if it needs to be base64 or not. I also believe it's not necessarily about file size, but might be more about how much CPU time it takes to decode that into a blob, and something tells me base64 takes more than percent-encode, but I might be wrong. Last but not least, it's priceless to see for humans what's in the data URL without having to decode it, hence percent-encoding should be preferable here, not just because of shorter length of the data URL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants