Use Encoder instead of Encoding in CsvParser #2106

Rob-Hague · 2022-12-26T18:57:13Z

closes #2088

This improves the ByteCount logic when consuming surrogate characters because the Encoder maintains state across calls. Previously the default behaviour was that the Encoding would return the byte count having replaced the surrogate character with U+FFFD REPLACEMENT CHARACTER because on its own a surrogate character is invalid. For example, when reading the UTF-16 surrogate pair \uD83D\uDE17 corresponding to the Unicode character U+1F617, the (UTF-8) ByteCount would be 6 (the byte count of the sequence \uFFFD\uFFFD) instead of the correct value of 4 UTF-8 bytes.

An encoder with a custom EncoderReplacementFallback could easily require a larger buffer, for example if its replacement string is "{LONG REPLACEMENT STRING}". So the upper-bound of 16 bytes is not correct. Note that in this case CsvParser.ByteCount would be nonsensical for input requiring the fallback, and configuration.Encoding probably does not match the actual encoding, e.g we are reading from a UTF-16 byte stream containing non-ASCII characters and we have set configuration.Encoding to ASCII. Nevertheless the safeguard is more sensible to have than not.

JoshClose · 2024-01-25T23:24:49Z

I'm going to wait on this for now. I believe the SIMD code will completely change how this works. I will be counting blocks of bytes at a time instead of single characters.

# Conflicts: # src/CsvHelper/CsvParser.cs

JoshClose · 2025-06-01T02:06:19Z

I'm thinking of leaving counting bytes and chars up to the user. I believe the user should easily be able to do it themselves once the new parser is released. You will have a ReadOnlySpan<char> Row that contains the whole row of data that you can count.

Rob-Hague mentioned this pull request Dec 26, 2022

ByteCount fails to count surrogate characters properly. #2088

Open

JoshClose mentioned this pull request Jan 25, 2024

Count surrogate pairs. #2090

Draft

Merge remote-tracking branch 'upstream/master' into encoder

ddc42f4

# Conflicts: # src/CsvHelper/CsvParser.cs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use Encoder instead of Encoding in CsvParser #2106

Use Encoder instead of Encoding in CsvParser #2106

Uh oh!

Rob-Hague commented Dec 26, 2022

Uh oh!

JoshClose commented Jan 25, 2024

Uh oh!

JoshClose commented Jun 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Use Encoder instead of Encoding in CsvParser #2106

Are you sure you want to change the base?

Use Encoder instead of Encoding in CsvParser #2106

Uh oh!

Conversation

Rob-Hague commented Dec 26, 2022

Uh oh!

JoshClose commented Jan 25, 2024

Uh oh!

JoshClose commented Jun 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants