Fix slurping raw input not to corrupt multi-byte characters (fix #3389) #3421

itchyny · 2025-10-07T22:36:53Z

Fixes #3389, also refactors the similar logic for reading jq filter files.

Copilot

Pull Request Overview

This PR fixes a bug where multi-byte UTF-8 characters could be corrupted when reading raw input or jq filter files, specifically addressing issue #3389. The changes ensure that when reading data in chunks, partial UTF-8 sequences are properly handled by backtracking and reading the remaining bytes of incomplete multi-byte characters.

Key changes:

Updated raw input reading logic to avoid truncating multi-byte UTF-8 characters
Refactored similar logic in jq filter file reading for consistency
Enhanced test coverage to verify UTF-8 handling in both scenarios

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/utf8test	Enhanced test script to cover both jq file reading and raw input slurping scenarios
tests/utf8-truncate.jq	Removed obsolete test file that was replaced by improved testing logic
src/util.c	Fixed raw input reading to prevent UTF-8 character corruption by implementing proper backtracking
src/jv_file.c	Refactored UTF-8 handling logic for consistency and clarity
Makefile.am	Removed reference to deleted test file from distribution list

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-08T08:21:33Z

src/util.c

+    const int max_utf8_len = 4;
+    const int max_gets_len = sizeof(state->buf) - max_utf8_len;


The magic number 4 for UTF-8 maximum length should be defined as a named constant to improve maintainability and consistency across the codebase.

Looks like a named constant to me

Copilot · 2025-10-08T08:21:34Z

src/jv_file.c

  // before appending.
  const int max_utf8_len = 4;
-  char buf[4096+max_utf8_len];
+  char buf[4096];


The buffer size calculation is incorrect. The code reads sizeof(buf)-max_utf8_len bytes but buf is now only 4096 bytes, so this reads 4096-4=4092 bytes. However, the buffer should accommodate the additional bytes that might be read for completing UTF-8 sequences.

Suggested change

char buf[4096];

char buf[4096 + 4];

Looks fine to me. Before i think we over-compensated read limit by 4 bytes?

wader · 2025-10-22T11:58:45Z

src/util.c

+    const int max_utf8_len = 4;
+    const int max_gets_len = sizeof(state->buf) - max_utf8_len;


Looks like a named constant to me

wader · 2025-10-22T12:02:16Z

src/jv_file.c

  // before appending.
  const int max_utf8_len = 4;
-  char buf[4096+max_utf8_len];
+  char buf[4096];


Looks fine to me. Before i think we over-compensated read limit by 4 bytes?

wader · 2025-10-22T12:15:13Z

src/jv_file.c

+    if (jvp_utf8_backtrack(end - 1, buf, &len) && len > 0 &&
        !feof(file) && !ferror(file)) {
-      n += fread(buf+n, 1, len, file);
+      n += fread(end, 1, len, file);


This is mostly a refactor?

Fix slurping raw input not to corrupt multi-byte characters (fix jqla…

04b1a61

…ng#3389)

itchyny force-pushed the fix-slurp-raw-input-utf8 branch from 1eb1925 to 04b1a61 Compare October 7, 2025 23:39

itchyny requested a review from Copilot October 8, 2025 08:20

Copilot AI reviewed Oct 8, 2025

View reviewed changes

wader reviewed Oct 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix slurping raw input not to corrupt multi-byte characters (fix #3389) #3421

Fix slurping raw input not to corrupt multi-byte characters (fix #3389) #3421

Uh oh!

itchyny commented Oct 7, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 8, 2025

Uh oh!

wader Oct 22, 2025

Uh oh!

Copilot AI Oct 8, 2025

Uh oh!

wader Oct 22, 2025

Uh oh!

wader Oct 22, 2025

Uh oh!

wader Oct 22, 2025

Uh oh!

wader Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const int max_utf8_len = 4;
		const int max_gets_len = sizeof(state->buf) - max_utf8_len;

Fix slurping raw input not to corrupt multi-byte characters (fix #3389) #3421

Are you sure you want to change the base?

Fix slurping raw input not to corrupt multi-byte characters (fix #3389) #3421

Uh oh!

Conversation

itchyny commented Oct 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

wader Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

wader Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

wader Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

wader Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

wader Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants