HIVE-27370: support 4 bytes characters #5624

ryukobayashi · 2025-01-28T05:42:01Z

What changes were proposed in this pull request?

If a SUBSTR UDF has a 4-byte characters in its parameter, the behavior is different between vectorized and non-vectorized. The vectorized version handles 4-byte characters properly, but the non-vectorized version does not, so similar logic is needed.
And these fixes use vectorized logic:
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/StringSubstrColStartLen.java#L89-L130
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/StringSubstrColStart.java#L78-L109

Why are the changes needed?

Vectorized and non-vectorized have different results.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added pattern tests to itest for these to work correctly.

sonarqubecloud · 2025-01-28T11:20:04Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

deniskuzZ · 2025-02-01T18:30:12Z

@difin, please take a look when you have time

ql/src/java/org/apache/hadoop/hive/ql/udf/UDFSubstr.java

ryukobayashi · 2025-04-10T10:19:05Z

@deniskuzZ Sorry, I also overlooked this comments. I fixed it.

okumin · 2025-06-27T11:06:41Z

Sure. I was not confident that I should review coworker's patch to avoid bias. I will check this anyway.

okumin · 2025-07-01T11:20:13Z

ql/src/test/queries/clientpositive/udf_substr.q

+  substr('あa🤎いiうu', 1, 3) as b1,
+  substr('あa🤎いiうu', 3) as b2,
+  substr('あa🤎いiうu', -5) as b3
+FROM src tablesample (1 rows);


I verified the master branch can't pass this test case

+POSTHOOK: query: SELECT + substr('あa🤎いiうu', 1, 3) as b1, + substr('あa🤎いiうu', 3) as b2, + substr('あa🤎いiうu', -5) as b3 +FROM src tablesample (1 rows) +POSTHOOK: type: QUERY +POSTHOOK: Input: default@src +#### A masked pattern was here #### +あa? 🤎いiうu ?いiうu

okumin · 2025-07-01T11:26:14Z

ql/src/java/org/apache/hadoop/hive/ql/udf/UDFSubstr.java

-    String s = t.toString();
-    int[] index = makeIndex(pos, len, s.length());
-    if (index == null) {
+    byte[] utf8String = t.toString().getBytes();


Why not use Text#getBytes or Text#copyBytes? Probably, getBytes is preferable because we don't mutate it. Note that we have to use Text#getLength instead of the size of the byte array.
https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java#L116-L132

Ah, OK. I will fix it.

ql/src/java/org/apache/hadoop/hive/ql/udf/UDFSubstr.java

Co-authored-by: Shohei Okumiya <[email protected]>

okumin · 2025-07-04T09:46:07Z

Looks good to me, but we're running additional in-house integration tests just in case.

sonarqubecloud · 2025-07-15T07:07:04Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

HIVE-27370: support 4 bytes characters

a6a9ea8

asf-ci-hive added tests pending tests unstable and removed tests pending labels Jan 28, 2025

fixed test results

dcdf3e7

asf-ci-hive added tests pending and removed tests unstable labels Jan 28, 2025

asf-ci-hive added tests passed and removed tests pending labels Jan 28, 2025

deniskuzZ reviewed Mar 23, 2025

View reviewed changes

ql/src/java/org/apache/hadoop/hive/ql/udf/UDFSubstr.java Outdated Show resolved Hide resolved

deniskuzZ reviewed Mar 23, 2025

View reviewed changes

ql/src/java/org/apache/hadoop/hive/ql/udf/UDFSubstr.java Outdated Show resolved Hide resolved

fixed

a17e45b

asf-ci-hive added tests pending and removed tests passed labels Apr 10, 2025

asf-ci-hive added tests failed and removed tests pending labels Apr 10, 2025

Merge branch 'master' into HIVE-27370

d9c3197

asf-ci-hive added tests pending tests failed and removed tests failed tests pending labels Apr 11, 2025

Merge branch 'master' into HIVE-27370

d9ceff0

asf-ci-hive added tests pending tests failed and removed tests failed tests pending labels Apr 14, 2025

fixed javadoc

c5292c8

asf-ci-hive added tests passed and removed tests pending labels Jun 26, 2025

okumin reviewed Jul 1, 2025

View reviewed changes

ryukobayashi added 2 commits July 2, 2025 11:56

Merge branch 'master' into HIVE-27370

90b71aa

fixed

e75df14

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels Jul 2, 2025

okumin reviewed Jul 2, 2025

View reviewed changes

ql/src/java/org/apache/hadoop/hive/ql/udf/UDFSubstr.java Outdated Show resolved Hide resolved

okumin reviewed Jul 2, 2025

View reviewed changes

ql/src/java/org/apache/hadoop/hive/ql/udf/UDFSubstr.java Outdated Show resolved Hide resolved

fixed

371b02b

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending labels Jul 3, 2025

okumin reviewed Jul 3, 2025

View reviewed changes

ql/src/java/org/apache/hadoop/hive/ql/udf/UDFSubstr.java Outdated Show resolved Hide resolved

Update ql/src/java/org/apache/hadoop/hive/ql/udf/UDFSubstr.java

7e57af6

Co-authored-by: Shohei Okumiya <[email protected]>

asf-ci-hive added tests pending tests failed and removed tests unstable tests pending labels Jul 4, 2025

Merge branch 'master' into HIVE-27370

a314251

asf-ci-hive added tests pending and removed tests failed labels Jul 15, 2025

asf-ci-hive added tests unstable and removed tests pending labels Jul 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HIVE-27370: support 4 bytes characters #5624

HIVE-27370: support 4 bytes characters #5624

ryukobayashi commented Jan 28, 2025

Uh oh!

sonarqubecloud bot commented Jan 28, 2025

Uh oh!

deniskuzZ commented Feb 1, 2025

Uh oh!

Uh oh!

Uh oh!

ryukobayashi commented Apr 10, 2025

Uh oh!

okumin commented Jun 27, 2025

Uh oh!

okumin Jul 1, 2025

Uh oh!

okumin Jul 1, 2025

Uh oh!

ryukobayashi Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

okumin commented Jul 4, 2025

Uh oh!

sonarqubecloud bot commented Jul 15, 2025

Uh oh!

Uh oh!

HIVE-27370: support 4 bytes characters #5624

Are you sure you want to change the base?

HIVE-27370: support 4 bytes characters #5624

Conversation

ryukobayashi commented Jan 28, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sonarqubecloud bot commented Jan 28, 2025

Quality Gate passed

Uh oh!

deniskuzZ commented Feb 1, 2025

Uh oh!

Uh oh!

Uh oh!

ryukobayashi commented Apr 10, 2025

Uh oh!

okumin commented Jun 27, 2025

Uh oh!

okumin Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

okumin Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

ryukobayashi Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

okumin commented Jul 4, 2025

Uh oh!

sonarqubecloud bot commented Jul 15, 2025

Quality Gate passed

Uh oh!

Uh oh!