Slurping corrupts multibyte characters that cross a 4095-byte boundary

**To Reproduce**

Slurp a file that contains a multibyte character that crosses a 4095-byte boundary.

For example, a text encoded in UTF-8 that is 4096 bytes long and ends with the `ü` character (U+00FC, `0xc3 0xbc` in UTF-8):

```bash
# Example text encoded in UTF-8, 4096 bytes long;
# ends with a multibyte character that crosses the 4095-byte boundary
{ yes '*' | tr -d $'\n' | head -c "$(( 4095 - 1 ))"; LC_ALL=en_US.UTF-8 printf '\u00fc'; } | xxd -g 1
```

<img width="1022" height="192" alt="Image" src="https://github.com/user-attachments/assets/195c6e23-e7b8-4c7f-83fa-7b1638d62300" />

When slurping such a file, jq replaces that character with two instances of the Unicode replacement character (U+FFFD, `0xef 0xbf 0xbd` in UTF-8) as if it had broken up the character at the 4095-byte boundary:

```bash
# Feeding the example text into jq with slurping corrupts the `ü`
LC_ALL=en_US.UTF-8 jq -rsR . <(yes '*' | tr -d $'\n' | head -c "$(( 4095 - 1 ))"; LC_ALL=en_US.UTF-8 printf '\u00fc') | xxd -g 1
```

<img width="1054" height="124" alt="Image" src="https://github.com/user-attachments/assets/ec607d7a-67cd-431f-8fdd-c8f0dad0d778" />

**Expected behavior**

I would expect that, when slurping, jq keeps multibyte characters intact, just like it does when I use `--rawfile` with no slurping:

```bash
# Feeding the same example text into jq using `--rawfile` with no slurping processes the `ü` correctly
LC_ALL=en_US.UTF-8 jq -nr --rawfile payload <(yes '*' | tr -d $'\n' | head -c "$(( 4095 - 1 ))"; LC_ALL=en_US.UTF-8 printf '\u00fc') '$payload' | xxd -g 1
```

<img width="1028" height="142" alt="Image" src="https://github.com/user-attachments/assets/62c0ab00-0a72-44da-8106-212b1d333bfd" />

**Environment (please complete the following information):**

- OS and Version: Arch Linux
- jq v1.8.1

**Additional context**

- This behavior doesn’t depend on what I tell jq to do with the input once it’s been slurped.  
  For example, I can omit `-r` and it still breaks; I can wrap it inside an object (e.g. `{ foo: . }`) and it still breaks.

- The same corruption occurs with any integer multiple of 4095. You can replace the `4095` in the above expression with e.g. `2 * 4095` (= 8189), `3 * 4095` (= 12284), `4 * 4095` (= 16379), and so on.

- The command line in the example uses Bash’s [process substitution](https://tldp.org/LDP/abs/html/process-sub.html) just for brevity; you can use an actual file instead (or even a different shell) but the issue remains.

- I couldn’t reproduce the issue using `--rawfile`; it only seems to occur when slurping is involved (`-sR`).

- The issue no longer occurs once you increment or decrement the `- 1` offset in the command line so the character no longer straddles the 4095-byte boundary.

- I haven’t looked at the implementation of the slurping feature yet. From the outside, it smells a little like a buffer of 4096 bytes (including a null terminator?) might be involved, and the bytes from the buffer might be decoded into text without accounting for incomplete multibyte characters at either buffer boundary.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slurping corrupts multibyte characters that cross a 4095-byte boundary #3389

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slurping corrupts multibyte characters that cross a 4095-byte boundary #3389

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions