Fix: Adjust VARCHAR byte length calculation to fully support UTF-8#621
Open
cs17899219 wants to merge 1 commit intoapache:masterfrom
Open
Fix: Adjust VARCHAR byte length calculation to fully support UTF-8#621cs17899219 wants to merge 1 commit intoapache:masterfrom
cs17899219 wants to merge 1 commit intoapache:masterfrom
Conversation
ba5ce06 to
4586599
Compare
Member
|
In most scenarios, it's probably 3 bytes? If it's only for specific scenarios, it might be better to add a configuration option to control it. https://doris.apache.org/zh-CN/docs/dev/sql-manual/basic-element/sql-data-types/string-type/VARCHAR |
Author
|
In some countries, using characters like 𝓐𝓑𝓒𝓓𝓔𝓕𝓖𝓗𝓘𝓙 𝓚𝓛𝓜𝓝𝓞𝓟𝓠𝓡𝓢𝓣 𝓤𝓥𝓦𝓧𝓨𝓩𝓪𝓫𝓬𝓭 𝓮𝓯𝓰𝓱𝓲𝓳𝓴𝓵𝓶𝓷 𝓸𝓹𝓺𝓻𝓼𝓽𝓾𝓿𝔀𝔁 to store information is quite common. In addition, because utf8mb4 supports characters up to 4 bytes, using 4-byte Unicode characters (such as 𝓐𝓑𝓒𝓓𝓔…) can fully utilize the encoding range and ensures they are stored without loss or truncation. Another common scenario involves storing Chinese text combined with emojis, such as: "一起喝咖啡吧☕️☕️☕️" |
4586599 to
cc533e8
Compare
…tf8mb4) (apache#620) The current logic in `TypeConverter.java` uses a multiplier of `3` to calculate the required byte length for the Doris `VARCHAR` type: ```java // Current implementation return length * 3 > 65533 ? DorisType.STRING : String.format("%s(%s)", DorisType.VARCHAR, length * 3); ``` This assumes a maximum of 3 bytes per character, which is insufficient for the widely used utf8mb4 character set (common in MySQL/MariaDB and other sources). The utf8mb4 encoding supports the full range of Unicode characters (including emojis), requiring up to 4 bytes per character. If a source column contains 4-byte characters, the calculated byte length may underestimate the required size, leading to: Data truncation or corruption during the synchronization process. Load failures with errors such as "data length exceeded" or "row size too large" when Doris enforces the byte limit. Proposed Solution This change updates the byte multiplier from 3 to 4 to safely accommodate the full utf8mb4 character set, ensuring the calculated byte length is always sufficient for the defined character length, thus guaranteeing data integrity and preventing sync failures.
cc533e8 to
550b627
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
Issue Number: close #620
Problem Summary:
The current logic in
TypeConverter.javauses a multiplier of3to calculate the required byte length for the DorisVARCHARtype:This assumes a maximum of 3 bytes per character, which is insufficient for the widely used utf8mb4 character set (common in MySQL/MariaDB and other sources). The utf8mb4 encoding supports the full range of Unicode characters (including emojis), requiring up to 4 bytes per character.
If a source column contains 4-byte characters, the calculated byte length may underestimate the required size, leading to:
Data truncation or corruption during the synchronization process.
Load failures with errors such as "data length exceeded" or "row size too large" when Doris enforces the byte limit.
Proposed Solution
This change updates the byte multiplier from 3 to 4 to safely accommodate the full utf8mb4 character set, ensuring the calculated byte length is always sufficient for the defined character length, thus guaranteeing data integrity and preventing sync failures.
Checklist(Required)