-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented valid UTF8 character checks #180
Open
zbalkan
wants to merge
5
commits into
wazuh:master
Choose a base branch
from
zbalkan:fix/fix-utf8-validation
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
zbalkan
changed the title
Implemented valud UTF8 character checks
Implemented valid UTF8 character checks
Sep 26, 2024
zbalkan
force-pushed
the
fix/fix-utf8-validation
branch
from
September 26, 2024 12:11
31c4c73
to
48e2945
Compare
28 tasks
…th 0xF4 is in the range 0x80 to 0x8F
Below is the breakdown of the UTF-8 characters based on acceptance. Before
After
|
Improved unit tests based on the edge cases. UTF-8 Validation Test Coverage Matrix
|
UTF-8 Validation Test Coverage Matrix
|
While this PR attempts to improve the existing solution, it is better to use https://github.com/simdutf/is_utf8 for this task. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This is a continuation of issue wazuh/wazuh#23354, about the fix PR wazuh/wazuh#23543.
This PR addresses an issue with the UTF-8 validation logic in the agent where valid UTF-8 multibyte characters were mistakenly being identified as invalid. The original implementation performed overly restrictive checks on sequences of bytes representing characters like Ü, ü, Õ, õ, Ö, ö, Ä, ä, Ş, ş, Ç, ç, causing the File Integrity Monitoring (FIM) module to incorrectly ignore file paths containing these characters.
Problem
The original validation logic checked for valid UTF-8 sequences but incorrectly marked certain valid multibyte characters as invalid due to overly restrictive rules on the leading byte of 2-, 3-, and 4-byte sequences. As a result, characters that are fully compliant with the UTF-8 standard were ignored, causing the FIM module to overlook legitimate file paths containing these characters. This led to unintended behavior in path validation and monitoring.
Solution
The macros for validating UTF-8 sequences have been updated to properly handle all valid UTF-8 byte ranges:
valid_2
: Now properly validates 2-byte sequences, ensuring no overlong encodings occur and that valid 2-byte sequences are recognized.valid_3
: Correctly handles special cases where the leading byte is 0xE0 or 0xED. Overlong encodings starting with 0xE0 are excluded, and surrogate halves (reserved for UTF-16) starting with 0xED are correctly identified as invalid.valid_4
: Properly validates 4-byte sequences, ensuring sequences that start with 0xF0 are not overlong and that sequences do not exceed the Unicode limit (U+10FFFF).With these fixes, the validation logic correctly identifies all valid UTF-8 sequences, including multibyte characters commonly used in various languages.
Configuration options
Logs/Alerts example
Tests