poc: what are bytes sequences #4305
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
regarding #4295 by @the-sun-will-rise-tomorrow reported issue with TextDecoder hitting the limit of 512MB i want to discuss the approach in this PR.
In the spec, I actually dont find a strict definition of what a byte sequence actually is.
It just says:
https://infra.spec.whatwg.org/#byte-sequence
So we have maybe some wiggle room and can define byte sequence matching that definition but have our own internal representation?
So instead of always passing around the a big Uint8Array we just pass around a Byte Sequence object and if needed transform it to a target representation form.
As @mcollina pointed out the maximum length of a string in v8 can be 1 GB in x64 architecture. There seems to be a limitation in TextDecoder of only 512MB being allowed and not 1GB. So we have actually some kind of arbitrary restriction, despite that we could have 512 MB longer strings?
I did not test it with a 1GB string. But by avoiding to force the TextDecoder to process a huuuge Uint8Array, it should work also in that case. I expect that the chunks have a size of the corresponding specified highWaterMark, so I dont check if the chunks are above the limit and then split it on the fly. Only drawback is, that we have bigger an bigger growing string.
Well, this is just a poc. Maybe you say, that this is useless. Or maybe it is anyway an anti pattern to load such huuuge payloads? But we should atleast discuss it.
I just wanted to get the tests pass as much as possible. Of course there are still optimizations possible. Potentially I oversaw some bugs for edge cases.
E.g. A Byte Sequence Object with only one element doesnt need to be run with concat. If bytesLength is smaller than like e.g. 10 mb just use the old logic and reuse a pre instantiated TextDecoder-instance and just concat everything, instead of chunkWise decoding.
And of course this doesnt solve the limitation, that we dont create JSON in a stream, as we need the whole payload to pass it to JSON.
Now I go to lunch ;)