Skip to content

Conversation

@Rob-Hague
Copy link
Contributor

closes #2088

This improves the ByteCount logic when consuming surrogate characters because
the Encoder maintains state across calls. Previously the default behaviour was
that the Encoding would return the byte count having replaced the surrogate
character with U+FFFD REPLACEMENT CHARACTER because on its own a surrogate
character is invalid.

For example, when reading the UTF-16 surrogate pair \uD83D\uDE17 corresponding
to the Unicode character U+1F617, the (UTF-8) ByteCount would be 6 (the byte
count of the sequence \uFFFD\uFFFD) instead of the correct value of 4 UTF-8
bytes.
An encoder with a custom EncoderReplacementFallback could easily require
a larger buffer, for example if its replacement string is "{LONG REPLACEMENT STRING}".
So the upper-bound of 16 bytes is not correct. Note that in this case
CsvParser.ByteCount would be nonsensical for input requiring the fallback, and
configuration.Encoding probably does not match the actual encoding, e.g we are reading
from a UTF-16 byte stream containing non-ASCII characters and we have set
configuration.Encoding to ASCII.

Nevertheless the safeguard is more sensible to have than not.
@JoshClose
Copy link
Owner

I'm going to wait on this for now. I believe the SIMD code will completely change how this works. I will be counting blocks of bytes at a time instead of single characters.

# Conflicts:
#	src/CsvHelper/CsvParser.cs
@JoshClose
Copy link
Owner

I'm thinking of leaving counting bytes and chars up to the user. I believe the user should easily be able to do it themselves once the new parser is released. You will have a ReadOnlySpan<char> Row that contains the whole row of data that you can count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ByteCount fails to count surrogate characters properly.

2 participants