Skip to content

Fix: Adjust VARCHAR byte length calculation to fully support UTF-8#621

Open
cs17899219 wants to merge 1 commit intoapache:masterfrom
cs17899219:fix-varchar-length
Open

Fix: Adjust VARCHAR byte length calculation to fully support UTF-8#621
cs17899219 wants to merge 1 commit intoapache:masterfrom
cs17899219:fix-varchar-length

Conversation

@cs17899219
Copy link

Proposed changes

Issue Number: close #620

Problem Summary:

The current logic in TypeConverter.java uses a multiplier of 3 to calculate the required byte length for the Doris VARCHAR type:

// Current implementation
return length * 3 > 65533
        ? DorisType.STRING
        : String.format("%s(%s)", DorisType.VARCHAR, length * 3);

This assumes a maximum of 3 bytes per character, which is insufficient for the widely used utf8mb4 character set (common in MySQL/MariaDB and other sources). The utf8mb4 encoding supports the full range of Unicode characters (including emojis), requiring up to 4 bytes per character.

If a source column contains 4-byte characters, the calculated byte length may underestimate the required size, leading to:

Data truncation or corruption during the synchronization process.

Load failures with errors such as "data length exceeded" or "row size too large" when Doris enforces the byte limit.

Proposed Solution

This change updates the byte multiplier from 3 to 4 to safely accommodate the full utf8mb4 character set, ensuring the calculated byte length is always sufficient for the defined character length, thus guaranteeing data integrity and preventing sync failures.

Checklist(Required)

  1. Does it affect the original behavior: No
  2. Has unit tests been added: No
  3. Has document been added or modified: No
  4. Does it need to update dependencies: No
  5. Are there any changes that cannot be rolled back: No

@JNSimba
Copy link
Member

JNSimba commented Nov 19, 2025

In most scenarios, it's probably 3 bytes? If it's only for specific scenarios, it might be better to add a configuration option to control it. https://doris.apache.org/zh-CN/docs/dev/sql-manual/basic-element/sql-data-types/string-type/VARCHAR

@cs17899219
Copy link
Author

cs17899219 commented Nov 25, 2025

In some countries, using characters like 𝓐𝓑𝓒𝓓𝓔𝓕𝓖𝓗𝓘𝓙 𝓚𝓛𝓜𝓝𝓞𝓟𝓠𝓡𝓢𝓣 𝓤𝓥𝓦𝓧𝓨𝓩𝓪𝓫𝓬𝓭 𝓮𝓯𝓰𝓱𝓲𝓳𝓴𝓵𝓶𝓷 𝓸𝓹𝓺𝓻𝓼𝓽𝓾𝓿𝔀𝔁 to store information is quite common. In addition, because utf8mb4 supports characters up to 4 bytes, using 4-byte Unicode characters (such as 𝓐𝓑𝓒𝓓𝓔…) can fully utilize the encoding range and ensures they are stored without loss or truncation. Another common scenario involves storing Chinese text combined with emojis, such as: "一起喝咖啡吧☕️☕️☕️"

…tf8mb4) (apache#620)

The current logic in `TypeConverter.java` uses a multiplier of `3` to calculate the required byte length for the Doris `VARCHAR` type:

```java
// Current implementation
return length * 3 > 65533
        ? DorisType.STRING
        : String.format("%s(%s)", DorisType.VARCHAR, length * 3);
```

This assumes a maximum of 3 bytes per character, which is insufficient for the widely used utf8mb4 character set (common in MySQL/MariaDB and other sources). The utf8mb4 encoding supports the full range of Unicode characters (including emojis), requiring up to 4 bytes per character.

If a source column contains 4-byte characters, the calculated byte length may underestimate the required size, leading to:

Data truncation or corruption during the synchronization process.

Load failures with errors such as "data length exceeded" or "row size too large" when Doris enforces the byte limit.

Proposed Solution
This change updates the byte multiplier from 3 to 4 to safely accommodate the full utf8mb4 character set, ensuring the calculated byte length is always sufficient for the defined character length, thus guaranteeing data integrity and preventing sync failures.
@cs17899219 cs17899219 marked this pull request as draft December 1, 2025 08:11
@cs17899219 cs17899219 marked this pull request as ready for review December 1, 2025 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Adjust VARCHAR byte length calculation to fully support UTF-8 (utf8mb4)

2 participants