Skip to content

Slurping corrupts multibyte characters that cross a 4095-byte boundary #3389

@claui

Description

@claui

To Reproduce

Slurp a file that contains a multibyte character that crosses a 4095-byte boundary.

For example, a text encoded in UTF-8 that is 4096 bytes long and ends with the ü character (U+00FC, 0xc3 0xbc in UTF-8):

# Example text encoded in UTF-8, 4096 bytes long;
# ends with a multibyte character that crosses the 4095-byte boundary
{ yes '*' | tr -d $'\n' | head -c "$(( 4095 - 1 ))"; LC_ALL=en_US.UTF-8 printf '\u00fc'; } | xxd -g 1
Image

When slurping such a file, jq replaces that character with two instances of the Unicode replacement character (U+FFFD, 0xef 0xbf 0xbd in UTF-8) as if it had broken up the character at the 4095-byte boundary:

# Feeding the example text into jq with slurping corrupts the `ü`
LC_ALL=en_US.UTF-8 jq -rsR . <(yes '*' | tr -d $'\n' | head -c "$(( 4095 - 1 ))"; LC_ALL=en_US.UTF-8 printf '\u00fc') | xxd -g 1
Image

Expected behavior

I would expect that, when slurping, jq keeps multibyte characters intact, just like it does when I use --rawfile with no slurping:

# Feeding the same example text into jq using `--rawfile` with no slurping processes the `ü` correctly
LC_ALL=en_US.UTF-8 jq -nr --rawfile payload <(yes '*' | tr -d $'\n' | head -c "$(( 4095 - 1 ))"; LC_ALL=en_US.UTF-8 printf '\u00fc') '$payload' | xxd -g 1
Image

Environment (please complete the following information):

  • OS and Version: Arch Linux
  • jq v1.8.1

Additional context

  • This behavior doesn’t depend on what I tell jq to do with the input once it’s been slurped.
    For example, I can omit -r and it still breaks; I can wrap it inside an object (e.g. { foo: . }) and it still breaks.

  • The same corruption occurs with any integer multiple of 4095. You can replace the 4095 in the above expression with e.g. 2 * 4095 (= 8189), 3 * 4095 (= 12284), 4 * 4095 (= 16379), and so on.

  • The command line in the example uses Bash’s process substitution just for brevity; you can use an actual file instead (or even a different shell) but the issue remains.

  • I couldn’t reproduce the issue using --rawfile; it only seems to occur when slurping is involved (-sR).

  • The issue no longer occurs once you increment or decrement the - 1 offset in the command line so the character no longer straddles the 4095-byte boundary.

  • I haven’t looked at the implementation of the slurping feature yet. From the outside, it smells a little like a buffer of 4096 bytes (including a null terminator?) might be involved, and the bytes from the buffer might be decoded into text without accounting for incomplete multibyte characters at either buffer boundary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions