-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
To Reproduce
Slurp a file that contains a multibyte character that crosses a 4095-byte boundary.
For example, a text encoded in UTF-8 that is 4096 bytes long and ends with the ü character (U+00FC, 0xc3 0xbc in UTF-8):
# Example text encoded in UTF-8, 4096 bytes long;
# ends with a multibyte character that crosses the 4095-byte boundary
{ yes '*' | tr -d $'\n' | head -c "$(( 4095 - 1 ))"; LC_ALL=en_US.UTF-8 printf '\u00fc'; } | xxd -g 1
When slurping such a file, jq replaces that character with two instances of the Unicode replacement character (U+FFFD, 0xef 0xbf 0xbd in UTF-8) as if it had broken up the character at the 4095-byte boundary:
# Feeding the example text into jq with slurping corrupts the `ü`
LC_ALL=en_US.UTF-8 jq -rsR . <(yes '*' | tr -d $'\n' | head -c "$(( 4095 - 1 ))"; LC_ALL=en_US.UTF-8 printf '\u00fc') | xxd -g 1
Expected behavior
I would expect that, when slurping, jq keeps multibyte characters intact, just like it does when I use --rawfile with no slurping:
# Feeding the same example text into jq using `--rawfile` with no slurping processes the `ü` correctly
LC_ALL=en_US.UTF-8 jq -nr --rawfile payload <(yes '*' | tr -d $'\n' | head -c "$(( 4095 - 1 ))"; LC_ALL=en_US.UTF-8 printf '\u00fc') '$payload' | xxd -g 1
Environment (please complete the following information):
- OS and Version: Arch Linux
- jq v1.8.1
Additional context
-
This behavior doesn’t depend on what I tell jq to do with the input once it’s been slurped.
For example, I can omit-rand it still breaks; I can wrap it inside an object (e.g.{ foo: . }) and it still breaks. -
The same corruption occurs with any integer multiple of 4095. You can replace the
4095in the above expression with e.g.2 * 4095(= 8189),3 * 4095(= 12284),4 * 4095(= 16379), and so on. -
The command line in the example uses Bash’s process substitution just for brevity; you can use an actual file instead (or even a different shell) but the issue remains.
-
I couldn’t reproduce the issue using
--rawfile; it only seems to occur when slurping is involved (-sR). -
The issue no longer occurs once you increment or decrement the
- 1offset in the command line so the character no longer straddles the 4095-byte boundary. -
I haven’t looked at the implementation of the slurping feature yet. From the outside, it smells a little like a buffer of 4096 bytes (including a null terminator?) might be involved, and the bytes from the buffer might be decoded into text without accounting for incomplete multibyte characters at either buffer boundary.