Skip to content

Comments

[KYUUBI #7122] Support ORC hive table pushdown filter#7123

Closed
flaming-archer wants to merge 9 commits intoapache:masterfrom
flaming-archer:master_scanbuilder_new
Closed

[KYUUBI #7122] Support ORC hive table pushdown filter#7123
flaming-archer wants to merge 9 commits intoapache:masterfrom
flaming-archer:master_scanbuilder_new

Conversation

@flaming-archer
Copy link
Contributor

@flaming-archer flaming-archer commented Jun 30, 2025

Why are the changes needed?

Previously, the HiveScan class was used to read data. If it is determined to be ORC type, the ORCScan from Spark datasourcev2 can be used. ORCScan supports pushfilter down, but HiveScan does not yet support it.

In our testing, we are able to achieve approximately 2x performance improvement.

The conversation can be controlled by setting spark.sql.kyuubi.hive.connector.read.convertMetastoreOrc. When enabled, the data source ORC reader is used to process ORC tables created by using the HiveQL syntax, instead of Hive SerDe.

close #7122

How was this patch tested?

added unit test

Was this patch authored or co-authored using generative AI tooling?

No

support orc hive table pushdown filters
@codecov-commenter
Copy link

codecov-commenter commented Jun 30, 2025

Codecov Report

Attention: Patch coverage is 0% with 21 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (31bbb53) to head (c3f412f).
Report is 14 commits behind head on master.

Files with missing lines Patch % Lines
...apache/kyuubi/spark/connector/hive/HiveTable.scala 0.00% 14 Missing ⚠️
...spark/connector/hive/KyuubiHiveConnectorConf.scala 0.00% 5 Missing ⚠️
...uubi/spark/connector/hive/read/HiveFileIndex.scala 0.00% 2 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##           master   #7123   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         700     700           
  Lines       43364   43397   +33     
  Branches     5869    5878    +9     
======================================
- Misses      43364   43397   +33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@flaming-archer flaming-archer requested a review from pan3793 July 7, 2025 09:59
@pan3793 pan3793 requested a review from yikf July 7, 2025 13:32
@pan3793
Copy link
Member

pan3793 commented Jul 7, 2025

LGTM, test code should clean up before merging.

override def newScanBuilder(options: CaseInsensitiveStringMap): ScanBuilder = {
HiveScanBuilder(sparkSession, fileIndex, dataSchema, catalogTable)
convertedProvider match {
case Some("ORC") if sparkSession.sessionState.conf.getConf(READ_CONVERT_METASTORE_ORC) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not the implementation mechanism of ConvertRelation?

Copy link
Member

@pan3793 pan3793 Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that requires more work and is more invasive compared to the current approach, let's try this simple way first, and the extended rule approach can be evaluated later

val provider = catalogTable.provider.map(_.toUpperCase(Locale.ROOT))
if (orc || provider.contains("ORC")) {
Some("ORC")
} else if (parquet || provider.contains("PARQUET")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could provider possibly be parquet? It seems only can be hive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I copy from here , maybe parquet?

catalogTable.provider.map(_.toUpperCase(Locale.ROOT)).exists {
case "PARQUET" => true
case "ORC" => true

@yikf
Copy link
Contributor

yikf commented Jul 8, 2025

@flaming-archer Thanks for your effort, left some comments.

flaming-archer and others added 3 commits July 8, 2025 16:02
…g/apache/kyuubi/spark/connector/hive/KyuubiHiveConnectorConf.scala

Co-authored-by: Cheng Pan <pan3793@gmail.com>
@flaming-archer flaming-archer requested review from pan3793 and yikf July 8, 2025 08:55
Copy link
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only a minor test issue

@pan3793
Copy link
Member

pan3793 commented Jul 8, 2025

@flaming-archer could you update the PR description to briefly describe how the code works? and mention the user-facing configuration?

@pan3793 pan3793 added this to the v1.11.0 milestone Jul 8, 2025
@pan3793
Copy link
Member

pan3793 commented Jul 8, 2025

also cc @cfmcgrady

@pan3793
Copy link
Member

pan3793 commented Jul 9, 2025

will merge it this afternoon if no further comments

@pan3793 pan3793 closed this in 60371b5 Jul 9, 2025
@pan3793
Copy link
Member

pan3793 commented Jul 9, 2025

Thanks, merged to master

yangyuxia pushed a commit to yangyuxia/kyuubi that referenced this pull request Sep 22, 2025
### Why are the changes needed?

Previously, the `HiveScan` class was used to read data. If it is determined to be ORC type, the `ORCScan` from Spark datasourcev2 can be used. `ORCScan` supports pushfilter down, but `HiveScan` does not yet support it.

In our testing, we are able to achieve approximately 2x performance improvement.

The conversation can be controlled by setting `spark.sql.kyuubi.hive.connector.read.convertMetastoreOrc`. When enabled, the data source ORC reader is used to process ORC tables created by using the HiveQL syntax, instead of Hive SerDe.

close apache#7122

### How was this patch tested?

added unit test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#7123 from flaming-archer/master_scanbuilder_new.

Closes apache#7122

c3f412f [tian bao] add case _
2be4890 [tian bao] Merge branch 'master_scanbuilder_new' of github.com:flaming-archer/kyuubi into master_scanbuilder_new
c825d0f [tian bao] review change
8a26d6a [tian bao] Update extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/KyuubiHiveConnectorConf.scala
68d4196 [tian bao] review change
bed007f [tian bao] review change
b89e6e6 [tian bao] Optimize UT
5a8941b [tian bao] fix failed ut
dc1ba47 [tian bao] orc pushdown version 0

Authored-by: tian bao <2011xuesong@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
yangyuxia pushed a commit to yangyuxia/kyuubi that referenced this pull request Nov 12, 2025
### Why are the changes needed?

Previously, the `HiveScan` class was used to read data. If it is determined to be ORC type, the `ORCScan` from Spark datasourcev2 can be used. `ORCScan` supports pushfilter down, but `HiveScan` does not yet support it.

In our testing, we are able to achieve approximately 2x performance improvement.

The conversation can be controlled by setting `spark.sql.kyuubi.hive.connector.read.convertMetastoreOrc`. When enabled, the data source ORC reader is used to process ORC tables created by using the HiveQL syntax, instead of Hive SerDe.

close apache#7122

### How was this patch tested?

added unit test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#7123 from flaming-archer/master_scanbuilder_new.

Closes apache#7122

c3f412f [tian bao] add case _
2be4890 [tian bao] Merge branch 'master_scanbuilder_new' of github.com:flaming-archer/kyuubi into master_scanbuilder_new
c825d0f [tian bao] review change
8a26d6a [tian bao] Update extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/KyuubiHiveConnectorConf.scala
68d4196 [tian bao] review change
bed007f [tian bao] review change
b89e6e6 [tian bao] Optimize UT
5a8941b [tian bao] fix failed ut
dc1ba47 [tian bao] orc pushdown version 0

Authored-by: tian bao <2011xuesong@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
yangyuxia pushed a commit to yangyuxia/kyuubi that referenced this pull request Nov 13, 2025
### Why are the changes needed?

Previously, the `HiveScan` class was used to read data. If it is determined to be ORC type, the `ORCScan` from Spark datasourcev2 can be used. `ORCScan` supports pushfilter down, but `HiveScan` does not yet support it.

In our testing, we are able to achieve approximately 2x performance improvement.

The conversation can be controlled by setting `spark.sql.kyuubi.hive.connector.read.convertMetastoreOrc`. When enabled, the data source ORC reader is used to process ORC tables created by using the HiveQL syntax, instead of Hive SerDe.

close apache#7122

### How was this patch tested?

added unit test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#7123 from flaming-archer/master_scanbuilder_new.

Closes apache#7122

c3f412f [tian bao] add case _
2be4890 [tian bao] Merge branch 'master_scanbuilder_new' of github.com:flaming-archer/kyuubi into master_scanbuilder_new
c825d0f [tian bao] review change
8a26d6a [tian bao] Update extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/KyuubiHiveConnectorConf.scala
68d4196 [tian bao] review change
bed007f [tian bao] review change
b89e6e6 [tian bao] Optimize UT
5a8941b [tian bao] fix failed ut
dc1ba47 [tian bao] orc pushdown version 0

Authored-by: tian bao <2011xuesong@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Spark Hive connector supports ORC hive table pushdown filter

4 participants