Fix: Adjust VARCHAR byte length calculation to fully support UTF-8 by cs17899219 · Pull Request #621 · apache/doris-flink-connector

cs17899219 · 2025-11-18T08:58:10Z

Proposed changes

Issue Number: close #620

Problem Summary:

The current logic in TypeConverter.java uses a multiplier of 3 to calculate the required byte length for the Doris VARCHAR type:

// Current implementation
return length * 3 > 65533
        ? DorisType.STRING
        : String.format("%s(%s)", DorisType.VARCHAR, length * 3);

This assumes a maximum of 3 bytes per character, which is insufficient for the widely used utf8mb4 character set (common in MySQL/MariaDB and other sources). The utf8mb4 encoding supports the full range of Unicode characters (including emojis), requiring up to 4 bytes per character.

If a source column contains 4-byte characters, the calculated byte length may underestimate the required size, leading to:

Data truncation or corruption during the synchronization process.

Load failures with errors such as "data length exceeded" or "row size too large" when Doris enforces the byte limit.

Proposed Solution

This change updates the byte multiplier from 3 to 4 to safely accommodate the full utf8mb4 character set, ensuring the calculated byte length is always sufficient for the defined character length, thus guaranteeing data integrity and preventing sync failures.

Checklist(Required)

Does it affect the original behavior: No
Has unit tests been added: No
Has document been added or modified: No
Does it need to update dependencies: No
Are there any changes that cannot be rolled back: No

JNSimba · 2025-11-19T07:48:19Z

In most scenarios, it's probably 3 bytes? If it's only for specific scenarios, it might be better to add a configuration option to control it. https://doris.apache.org/zh-CN/docs/dev/sql-manual/basic-element/sql-data-types/string-type/VARCHAR

cs17899219 · 2025-11-25T08:16:40Z

In some countries, using characters like 𝓐𝓑𝓒𝓓𝓔𝓕𝓖𝓗𝓘𝓙 𝓚𝓛𝓜𝓝𝓞𝓟𝓠𝓡𝓢𝓣 𝓤𝓥𝓦𝓧𝓨𝓩𝓪𝓫𝓬𝓭 𝓮𝓯𝓰𝓱𝓲𝓳𝓴𝓵𝓶𝓷 𝓸𝓹𝓺𝓻𝓼𝓽𝓾𝓿𝔀𝔁 to store information is quite common. In addition, because utf8mb4 supports characters up to 4 bytes, using 4-byte Unicode characters (such as 𝓐𝓑𝓒𝓓𝓔…) can fully utilize the encoding range and ensures they are stored without loss or truncation. Another common scenario involves storing Chinese text combined with emojis, such as: "一起喝咖啡吧☕️☕️☕️"

…tf8mb4) (apache#620) The current logic in `TypeConverter.java` uses a multiplier of `3` to calculate the required byte length for the Doris `VARCHAR` type: ```java // Current implementation return length * 3 > 65533 ? DorisType.STRING : String.format("%s(%s)", DorisType.VARCHAR, length * 3); ``` This assumes a maximum of 3 bytes per character, which is insufficient for the widely used utf8mb4 character set (common in MySQL/MariaDB and other sources). The utf8mb4 encoding supports the full range of Unicode characters (including emojis), requiring up to 4 bytes per character. If a source column contains 4-byte characters, the calculated byte length may underestimate the required size, leading to: Data truncation or corruption during the synchronization process. Load failures with errors such as "data length exceeded" or "row size too large" when Doris enforces the byte limit. Proposed Solution This change updates the byte multiplier from 3 to 4 to safely accommodate the full utf8mb4 character set, ensuring the calculated byte length is always sufficient for the defined character length, thus guaranteeing data integrity and preventing sync failures.

cs17899219 force-pushed the fix-varchar-length branch from ba5ce06 to 4586599 Compare November 18, 2025 09:02

cs17899219 force-pushed the fix-varchar-length branch from 4586599 to cc533e8 Compare November 25, 2025 08:52

cs17899219 force-pushed the fix-varchar-length branch from cc533e8 to 550b627 Compare November 25, 2025 09:04

cs17899219 marked this pull request as draft December 1, 2025 08:11

cs17899219 marked this pull request as ready for review December 1, 2025 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Adjust VARCHAR byte length calculation to fully support UTF-8#621

Fix: Adjust VARCHAR byte length calculation to fully support UTF-8#621
cs17899219 wants to merge 1 commit intoapache:masterfrom
cs17899219:fix-varchar-length

cs17899219 commented Nov 18, 2025

Uh oh!

JNSimba commented Nov 19, 2025

Uh oh!

cs17899219 commented Nov 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cs17899219 commented Nov 18, 2025

Proposed changes

Problem Summary:

Proposed Solution

Checklist(Required)

Uh oh!

JNSimba commented Nov 19, 2025

Uh oh!

cs17899219 commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cs17899219 commented Nov 25, 2025 •

edited

Loading