-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Fix slurping raw input not to corrupt multi-byte characters (fix #3389) #3421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1eb1925 to
04b1a61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a bug where multi-byte UTF-8 characters could be corrupted when reading raw input or jq filter files, specifically addressing issue #3389. The changes ensure that when reading data in chunks, partial UTF-8 sequences are properly handled by backtracking and reading the remaining bytes of incomplete multi-byte characters.
Key changes:
- Updated raw input reading logic to avoid truncating multi-byte UTF-8 characters
- Refactored similar logic in jq filter file reading for consistency
- Enhanced test coverage to verify UTF-8 handling in both scenarios
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/utf8test | Enhanced test script to cover both jq file reading and raw input slurping scenarios |
| tests/utf8-truncate.jq | Removed obsolete test file that was replaced by improved testing logic |
| src/util.c | Fixed raw input reading to prevent UTF-8 character corruption by implementing proper backtracking |
| src/jv_file.c | Refactored UTF-8 handling logic for consistency and clarity |
| Makefile.am | Removed reference to deleted test file from distribution list |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| const int max_utf8_len = 4; | ||
| const int max_gets_len = sizeof(state->buf) - max_utf8_len; |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 4 for UTF-8 maximum length should be defined as a named constant to improve maintainability and consistency across the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a named constant to me
| // before appending. | ||
| const int max_utf8_len = 4; | ||
| char buf[4096+max_utf8_len]; | ||
| char buf[4096]; |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer size calculation is incorrect. The code reads sizeof(buf)-max_utf8_len bytes but buf is now only 4096 bytes, so this reads 4096-4=4092 bytes. However, the buffer should accommodate the additional bytes that might be read for completing UTF-8 sequences.
| char buf[4096]; | |
| char buf[4096 + 4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me. Before i think we over-compensated read limit by 4 bytes?
| const int max_utf8_len = 4; | ||
| const int max_gets_len = sizeof(state->buf) - max_utf8_len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a named constant to me
| // before appending. | ||
| const int max_utf8_len = 4; | ||
| char buf[4096+max_utf8_len]; | ||
| char buf[4096]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me. Before i think we over-compensated read limit by 4 bytes?
| if (jvp_utf8_backtrack(end - 1, buf, &len) && len > 0 && | ||
| !feof(file) && !ferror(file)) { | ||
| n += fread(buf+n, 1, len, file); | ||
| n += fread(end, 1, len, file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly a refactor?
Fixes #3389, also refactors the similar logic for reading jq filter files.