poc: what are bytes sequences #4305

Uzlopak · 2025-06-27T14:47:19Z

regarding #4295 by @the-sun-will-rise-tomorrow reported issue with TextDecoder hitting the limit of 512MB i want to discuss the approach in this PR.

In the spec, I actually dont find a strict definition of what a byte sequence actually is.

It just says:

A byte sequence is a sequence of bytes, represented as a space-separated sequence of bytes. Byte sequences with bytes in the range 0x20 (SP) to 0x7E (~), inclusive, can alternately be written as a string, but using backticks instead of quotation marks, to avoid confusion with an actual string.

https://infra.spec.whatwg.org/#byte-sequence

So we have maybe some wiggle room and can define byte sequence matching that definition but have our own internal representation?

So instead of always passing around the a big Uint8Array we just pass around a Byte Sequence object and if needed transform it to a target representation form.

As @mcollina pointed out the maximum length of a string in v8 can be 1 GB in x64 architecture. There seems to be a limitation in TextDecoder of only 512MB being allowed and not 1GB. So we have actually some kind of arbitrary restriction, despite that we could have 512 MB longer strings?

I did not test it with a 1GB string. But by avoiding to force the TextDecoder to process a huuuge Uint8Array, it should work also in that case. I expect that the chunks have a size of the corresponding specified highWaterMark, so I dont check if the chunks are above the limit and then split it on the fly. Only drawback is, that we have bigger an bigger growing string.

Well, this is just a poc. Maybe you say, that this is useless. Or maybe it is anyway an anti pattern to load such huuuge payloads? But we should atleast discuss it.

I just wanted to get the tests pass as much as possible. Of course there are still optimizations possible. Potentially I oversaw some bugs for edge cases.

E.g. A Byte Sequence Object with only one element doesnt need to be run with concat. If bytesLength is smaller than like e.g. 10 mb just use the old logic and reuse a pre instantiated TextDecoder-instance and just concat everything, instead of chunkWise decoding.

And of course this doesnt solve the limitation, that we dont create JSON in a stream, as we need the whole payload to pass it to JSON.

Now I go to lunch ;)

KhafraDev · 2025-06-27T16:48:27Z

So we have maybe some wiggle room and can define byte sequence matching that definition but have our own internal representation?

Yes, this is something I struggled with before landing on either an array of bytes or a TypedArray (this could be anything holding a sequence of bytes - even a Blob in my mind).

the maximum length of a string in v8 can be 1 GB in x64 architecture. There seems to be a limitation in TextDecoder of only 512MB being allowed and not 1GB. So we have actually some kind of arbitrary restriction, despite that we could have 512 MB longer strings?

The limit is buffer.constants.MAX_STRING_LENGTH. This will not solve the problem.

import { constants } from 'node:buffer'

'a'.repeat(constants.MAX_STRING_LENGTH + 1) // RangeError: Invalid string length

poc: what are byteSequences?

c9554a9

Uzlopak mentioned this pull request Jun 27, 2025

Can't parse JSON response exceeding 512 MB #4295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

poc: what are bytes sequences #4305

poc: what are bytes sequences #4305

Uh oh!

Uzlopak commented Jun 27, 2025

Uh oh!

KhafraDev commented Jun 27, 2025

Uh oh!

Uh oh!

Uh oh!

poc: what are bytes sequences #4305

Are you sure you want to change the base?

poc: what are bytes sequences #4305

Uh oh!

Conversation

Uzlopak commented Jun 27, 2025

Uh oh!

KhafraDev commented Jun 27, 2025

Uh oh!

Uh oh!