Skip to content

poc: what are bytes sequences #4305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

poc: what are bytes sequences #4305

wants to merge 1 commit into from

Conversation

Uzlopak
Copy link
Contributor

@Uzlopak Uzlopak commented Jun 27, 2025

regarding #4295 by @the-sun-will-rise-tomorrow reported issue with TextDecoder hitting the limit of 512MB i want to discuss the approach in this PR.

In the spec, I actually dont find a strict definition of what a byte sequence actually is.

It just says:

A byte sequence is a sequence of bytes, represented as a space-separated sequence of bytes. Byte sequences with bytes in the range 0x20 (SP) to 0x7E (~), inclusive, can alternately be written as a string, but using backticks instead of quotation marks, to avoid confusion with an actual string.

https://infra.spec.whatwg.org/#byte-sequence

So we have maybe some wiggle room and can define byte sequence matching that definition but have our own internal representation?

So instead of always passing around the a big Uint8Array we just pass around a Byte Sequence object and if needed transform it to a target representation form.

As @mcollina pointed out the maximum length of a string in v8 can be 1 GB in x64 architecture. There seems to be a limitation in TextDecoder of only 512MB being allowed and not 1GB. So we have actually some kind of arbitrary restriction, despite that we could have 512 MB longer strings?

I did not test it with a 1GB string. But by avoiding to force the TextDecoder to process a huuuge Uint8Array, it should work also in that case. I expect that the chunks have a size of the corresponding specified highWaterMark, so I dont check if the chunks are above the limit and then split it on the fly. Only drawback is, that we have bigger an bigger growing string.

Well, this is just a poc. Maybe you say, that this is useless. Or maybe it is anyway an anti pattern to load such huuuge payloads? But we should atleast discuss it.

I just wanted to get the tests pass as much as possible. Of course there are still optimizations possible. Potentially I oversaw some bugs for edge cases.

E.g. A Byte Sequence Object with only one element doesnt need to be run with concat. If bytesLength is smaller than like e.g. 10 mb just use the old logic and reuse a pre instantiated TextDecoder-instance and just concat everything, instead of chunkWise decoding.

And of course this doesnt solve the limitation, that we dont create JSON in a stream, as we need the whole payload to pass it to JSON.

Now I go to lunch ;)

@KhafraDev
Copy link
Member

So we have maybe some wiggle room and can define byte sequence matching that definition but have our own internal representation?

Yes, this is something I struggled with before landing on either an array of bytes or a TypedArray (this could be anything holding a sequence of bytes - even a Blob in my mind).

the maximum length of a string in v8 can be 1 GB in x64 architecture. There seems to be a limitation in TextDecoder of only 512MB being allowed and not 1GB. So we have actually some kind of arbitrary restriction, despite that we could have 512 MB longer strings?

The limit is buffer.constants.MAX_STRING_LENGTH. This will not solve the problem.

import { constants } from 'node:buffer'

'a'.repeat(constants.MAX_STRING_LENGTH + 1) // RangeError: Invalid string length

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants