Implemented valid UTF8 character checks #180

zbalkan · 2024-09-26T12:05:43Z

Related issue
#179

Description

This is a continuation of issue wazuh/wazuh#23354, about the fix PR wazuh/wazuh#23543.

This PR addresses an issue with the UTF-8 validation logic in the agent where valid UTF-8 multibyte characters were mistakenly being identified as invalid. The original implementation performed overly restrictive checks on sequences of bytes representing characters like Ü, ü, Õ, õ, Ö, ö, Ä, ä, Ş, ş, Ç, ç, causing the File Integrity Monitoring (FIM) module to incorrectly ignore file paths containing these characters.

Problem

The original validation logic checked for valid UTF-8 sequences but incorrectly marked certain valid multibyte characters as invalid due to overly restrictive rules on the leading byte of 2-, 3-, and 4-byte sequences. As a result, characters that are fully compliant with the UTF-8 standard were ignored, causing the FIM module to overlook legitimate file paths containing these characters. This led to unintended behavior in path validation and monitoring.

Solution

The macros for validating UTF-8 sequences have been updated to properly handle all valid UTF-8 byte ranges:

valid_2: Now properly validates 2-byte sequences, ensuring no overlong encodings occur and that valid 2-byte sequences are recognized.
valid_3: Correctly handles special cases where the leading byte is 0xE0 or 0xED. Overlong encodings starting with 0xE0 are excluded, and surrogate halves (reserved for UTF-16) starting with 0xED are correctly identified as invalid.
valid_4: Properly validates 4-byte sequences, ensuring sequences that start with 0xF0 are not overlong and that sequences do not exceed the Unicode limit (U+10FFFF).

With these fixes, the validation logic correctly identifies all valid UTF-8 sequences, including multibyte characters commonly used in various languages.

Configuration options

Logs/Alerts example

Tests

Memory tests for Linux
- Scan-build report
- Coverity
- Valgrind (memcheck and descriptor leaks check)
- Dr. Memory
- AddressSanitizer
Memory tests for Windows
- Scan-build report
- Coverity
- Dr. Memory
Memory tests for macOS
- Scan-build report
- Leaks
- AddressSanitizer

Retrocompatibility with older Wazuh versions
Working on cluster environments
Configuration on demand reports new parameters
The data flow works as expected (agent-manager-api-app)
Added unit tests (for new features)
Stress test for affected components

Decoder/Rule tests
- Added unit testing files ".ini"
- runtests.py executed without errors

…th 0xF4 is in the range 0x80 to 0x8F

zbalkan · 2024-10-13T17:09:22Z

Below is the breakdown of the UTF-8 characters based on acceptance.

Before

Byte Sequence	Valid Unicode Range	Byte Length	Status	Related Macro	Comments
`0x00` to `0x7F`	U+0000 to U+007F	1	Accepted	`valid_1`	Single-byte ASCII characters are accepted.
`0xC2 0x80` to `0xDF 0xBF`	U+0080 to U+07FF	2	Accepted	`valid_2`	Two-byte characters, including extended Latin, Greek, and Cyrillic.
`0xC0 0x80` to `0xC1 0xBF`	U+0000 to U+007F	2	Excluded	`valid_2`	Overlong encodings for ASCII characters are excluded.
`0xE1 0x80 0x80` to `0xEC 0xBF 0xBF`	U+1000 to U+CFFF	3	Accepted	`valid_3`	Valid three-byte characters, including many scripts, are accepted.
`0xEE 0x80 0x80` to `0xEF 0xBF 0xBF`	U+E000 to U+FFFF	3	Accepted	`valid_3`	Valid three-byte characters, excluding surrogate pairs, are accepted.
`0xE0 0xA0 0x80` to `0xE0 0xBF 0xBF`	U+0800 to U+0FFF	3	Excluded	`valid_3`	Excluded due to the macro rejecting valid three-byte characters with `0xE0`.
`0xE0 0x80 0x80` to `0xE0 0x9F 0xBF`	U+0000 to U+07FF	3	Excluded	`valid_3`	Overlong encodings using three bytes for the `U+0000` to `U+07FF` range.
`0xED 0x80 0x80` to `0xED 0x9F 0xBF`	U+D800 to U+DFFF	3	Excluded	`valid_3`	Surrogate pairs reserved for UTF-16 are excluded (correctly).
`0xF1 0x80 0x80 0x80` to `0xF3 0xBF 0xBF 0xBF`	U+40000 to U+10FFFF	4	Accepted	`valid_4`	Valid four-byte characters in a narrow range (higher supplementary planes).
`0xF0 0x90 0x80 0x80` to `0xF0 0xBF 0xBF 0xBF`	U+10000 to U+3FFFF	4	Excluded	`valid_4`	Excluded valid four-byte characters in the `U+10000` to `U+3FFFF` range.
`0xF0 0x80 0x80 0x80` to `0xF0 0x8F 0xBF 0xBF`	U+0000 to U+FFFF	4	Excluded	`valid_4`	Overlong encodings using four bytes for ranges that could be encoded with fewer bytes.

After

Byte Sequence	Valid Unicode Range	Byte Length	Status	Related Macro	Comments
`0x00` to `0x7F`	U+0000 to U+007F	1	Accepted	`valid_1`	Single-byte ASCII characters are accepted.
`0xC2 0x80` to `0xDF 0xBF`	U+0080 to U+07FF	2	Accepted	`valid_2`	Two-byte characters, including extended Latin, Greek, and Cyrillic.
`0xC0 0x80` to `0xC1 0xBF`	U+0000 to U+007F	2	Excluded	`valid_2`	Overlong encodings for ASCII characters are excluded.
`0xE1 0x80 0x80` to `0xEC 0xBF 0xBF`	U+1000 to U+CFFF	3	Accepted	`valid_3`	Valid three-byte characters, including many scripts, are accepted.
`0xEE 0x80 0x80` to `0xEF 0xBF 0xBF`	U+E000 to U+FFFF	3	Accepted	`valid_3`	Valid three-byte characters, excluding surrogate pairs, are accepted.
`0xE0 0xA0 0x80` to `0xE0 0xBF 0xBF`	U+0800 to U+0FFF	3	Accepted	`valid_3`	Valid three-byte characters in the range `U+0800` to `U+0FFF` are now accepted.
`0xE0 0x80 0x80` to `0xE0 0x9F 0xBF`	U+0000 to U+07FF	3	Excluded	`valid_3`	Overlong encodings using three bytes for the `U+0000` to `U+07FF` range.
`0xED 0x80 0x80` to `0xED 0x9F 0xBF`	U+D800 to U+DFFF	3	Excluded	`valid_3`	Surrogate pairs reserved for UTF-16 are excluded (correctly).
`0xF0 0x90 0x80 0x80` to `0xF4 0x8F 0xBF 0xBF`	U+10000 to U+10FFFF	4	Accepted	`valid_4`	Valid four-byte characters from `U+10000` to `U+10FFFF` are accepted.
`0xF0 0x80 0x80 0x80` to `0xF0 0x8F 0xBF 0xBF`	U+0000 to U+FFFF	4	Excluded	`valid_4`	Overlong encodings using four bytes for ranges that could be encoded with fewer bytes.
`0xF4 0x90 0x80 0x80` to `0xF7 0xBF 0xBF 0xBF`	Out of Unicode range	4	Excluded	`valid_4`	Invalid four-byte sequences that exceed Unicode limit (`U+10FFFF`).

zbalkan · 2024-10-13T17:20:52Z

Improved unit tests based on the edge cases.

UTF-8 Validation Test Coverage Matrix

Test Case	Valid/Invalid	Covered Case
`test_valid_utf8_sequences`	Valid	ASCII, 2-byte, 3-byte, 4-byte, complex scripts
`test_invalid_utf8_sequences`	Invalid	Overlong encodings, invalid sequences, surrogate halves
`test_utf8_random_replace`	Valid	Random byte stream with replacement, ensuring valid UTF-8
`test_utf8_random_not_replace`	N/A	Random byte stream without replacement
`test_utf8_edge_cases`	Valid/Invalid	Edge: U+10FFFF (valid), beyond U+10FFFF (invalid)
New: `test_empty_string`	Valid	Empty string (valid UTF-8)
New: `test_incomplete_utf8_sequences`	Invalid	Incomplete 2-byte, 3-byte, 4-byte sequences
New: `test_overlong_encodings`	Invalid	Overlong encodings with 2, 3, or 4 bytes
New: `test_surrogate_pair_boundary`	Valid/Invalid	Just below and just in the surrogate range
New: `test_maximal_overhead_cases`	Valid	Maximal valid cases for each UTF-8 length
New: `test_continuation_without_leading`	Invalid	Continuation byte without a valid leading byte

zbalkan · 2024-10-13T17:26:18Z

UTF-8 Validation Test Coverage Matrix

Test Case	Valid/Invalid	Covered Case
`test_valid_utf8_sequences`	Valid	ASCII, 2-byte, 3-byte, 4-byte, complex scripts
`test_invalid_utf8_sequences`	Invalid	Overlong encodings, invalid sequences, surrogate halves
`test_utf8_random_replace`	Valid	Random byte stream with replacement, ensuring valid UTF-8
`test_utf8_random_not_replace`	N/A	Random byte stream without replacement
`test_utf8_edge_cases`	Valid/Invalid	Edge: U+10FFFF (valid), beyond U+10FFFF (invalid)
`test_empty_string`	Valid	Empty string (valid UTF-8)
`test_incomplete_utf8_sequences`	Invalid	Incomplete 2-byte, 3-byte, 4-byte sequences
`test_overlong_encodings`	Invalid	Overlong encodings with 2, 3, or 4 bytes
`test_surrogate_pair_boundary`	Valid/Invalid	Just below and just in the surrogate range
`test_maximal_overhead_cases`	Valid	Maximal valid cases for each UTF-8 length
`test_continuation_without_leading`	Invalid	Continuation byte without a valid leading byte
New: `test_surrogate_pair_extended_boundary`	Valid/Invalid	U+D7FF (valid), U+DFFF (invalid, end of surrogate range)
New: `test_multilingual_plane_cases`	Valid	Characters from Supplementary Multilingual Plane (`U+10000`-`U+1FFFF`)
New: `test_mixed_valid_invalid_utf8`	Invalid	Mixed valid and invalid UTF-8 sequences in a single string

zbalkan · 2024-10-14T19:40:10Z

While this PR attempts to improve the existing solution, it is better to use https://github.com/simdutf/is_utf8 for this task.

zbalkan changed the title ~~Implemented valud UTF8 character checks~~ Implemented valid UTF8 character checks Sep 26, 2024

Implemented valid UTF8 character checks

48e2945

zbalkan force-pushed the fix/fix-utf8-validation branch from 31c4c73 to 48e2945 Compare September 26, 2024 12:11

zbalkan mentioned this pull request Oct 11, 2024

Implemented valid UTF8 character checks wazuh/wazuh#26289

Open

28 tasks

Ensure that the second byte of a four-byte UTF-8 sequence starting wi…

dcea21f

…th 0xF4 is in the range 0x80 to 0x8F

Added more unit tests for valid and invalid cases

fc7f68b

Improved unit tests

b91c14f

vikman90 linked an issue Oct 14, 2024 that may be closed by this pull request

The non-UTF8 character check excludes valid UTF8 characters #179

Open

Covered edge cases and improved unit tests

32a9f99

juljar mentioned this pull request Oct 21, 2024

Non-UTF8 filenames ignored by Wazuh agent on Windows platform wazuh/wazuh#25308

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented valid UTF8 character checks #180

Implemented valid UTF8 character checks #180

zbalkan commented Sep 26, 2024 •

edited

Loading

zbalkan commented Oct 13, 2024

zbalkan commented Oct 13, 2024

zbalkan commented Oct 13, 2024

zbalkan commented Oct 14, 2024

Implemented valid UTF8 character checks #180

Are you sure you want to change the base?

Implemented valid UTF8 character checks #180

Conversation

zbalkan commented Sep 26, 2024 • edited Loading

Description

Problem

Solution

Configuration options

Logs/Alerts example

Tests

zbalkan commented Oct 13, 2024

Before

After

zbalkan commented Oct 13, 2024

UTF-8 Validation Test Coverage Matrix

zbalkan commented Oct 13, 2024

UTF-8 Validation Test Coverage Matrix

zbalkan commented Oct 14, 2024

zbalkan commented Sep 26, 2024 •

edited

Loading