[KYUUBI #7122] Support ORC hive table pushdown filter by flaming-archer · Pull Request #7123 · apache/kyuubi

flaming-archer · 2025-06-30T08:46:52Z

Why are the changes needed?

Previously, the HiveScan class was used to read data. If it is determined to be ORC type, the ORCScan from Spark datasourcev2 can be used. ORCScan supports pushfilter down, but HiveScan does not yet support it.

In our testing, we are able to achieve approximately 2x performance improvement.

The conversation can be controlled by setting spark.sql.kyuubi.hive.connector.read.convertMetastoreOrc. When enabled, the data source ORC reader is used to process ORC tables created by using the HiveQL syntax, instead of Hive SerDe.

close #7122

How was this patch tested?

added unit test

Was this patch authored or co-authored using generative AI tooling?

No

support orc hive table pushdown filters

codecov-commenter · 2025-06-30T18:24:03Z

Codecov Report

Attention: Patch coverage is 0% with 21 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (31bbb53) to head (c3f412f).
Report is 14 commits behind head on master.

Files with missing lines	Patch %	Lines
...apache/kyuubi/spark/connector/hive/HiveTable.scala	0.00%	14 Missing ⚠️
...spark/connector/hive/KyuubiHiveConnectorConf.scala	0.00%	5 Missing ⚠️
...uubi/spark/connector/hive/read/HiveFileIndex.scala	0.00%	2 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff           @@
##           master   #7123   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         700     700           
  Lines       43364   43397   +33     
  Branches     5869    5878    +9     
======================================
- Misses      43364   43397   +33

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

...i-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/HiveTable.scala

...rk-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveQuerySuite.scala

...i-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/HiveTable.scala

...onnector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/read/HiveFileIndex.scala

...-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveCatalogSuite.scala

...rk-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveQuerySuite.scala

pan3793 · 2025-07-07T13:33:59Z

LGTM, test code should clean up before merging.

yikf · 2025-07-08T06:09:22Z

...i-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/HiveTable.scala

  override def newScanBuilder(options: CaseInsensitiveStringMap): ScanBuilder = {
-    HiveScanBuilder(sparkSession, fileIndex, dataSchema, catalogTable)
+    convertedProvider match {
+      case Some("ORC") if sparkSession.sessionState.conf.getConf(READ_CONVERT_METASTORE_ORC) =>


Why not the implementation mechanism of ConvertRelation?

that requires more work and is more invasive compared to the current approach, let's try this simple way first, and the extended rule approach can be evaluated later

yikf · 2025-07-08T06:10:52Z

...i-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/HiveTable.scala

+    val provider = catalogTable.provider.map(_.toUpperCase(Locale.ROOT))
+    if (orc || provider.contains("ORC")) {
+      Some("ORC")
+    } else if (parquet || provider.contains("PARQUET")) {


Could provider possibly be parquet? It seems only can be hive

Actually, I copy from here , maybe parquet?

kyuubi/extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/read/HiveScan.scala

Lines 57 to 59 in 4e40f94

catalogTable.provider.map(_.toUpperCase(Locale.ROOT)).exists {

case "PARQUET" => true

case "ORC" => true

...-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveCatalogSuite.scala

yikf · 2025-07-08T06:13:20Z

@flaming-archer Thanks for your effort, left some comments.

...tor-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/KyuubiHiveConnectorConf.scala

...rk-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveQuerySuite.scala

...-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveCatalogSuite.scala

…g/apache/kyuubi/spark/connector/hive/KyuubiHiveConnectorConf.scala Co-authored-by: Cheng Pan <pan3793@gmail.com>

…uubi into master_scanbuilder_new

...-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveCatalogSuite.scala

pan3793

LGTM, only a minor test issue

pan3793 · 2025-07-08T09:03:37Z

@flaming-archer could you update the PR description to briefly describe how the code works? and mention the user-facing configuration?

pan3793 · 2025-07-08T10:08:56Z

also cc @cfmcgrady

pan3793 · 2025-07-09T03:01:59Z

will merge it this afternoon if no further comments

pan3793 · 2025-07-09T05:39:01Z

Thanks, merged to master

### Why are the changes needed? Previously, the `HiveScan` class was used to read data. If it is determined to be ORC type, the `ORCScan` from Spark datasourcev2 can be used. `ORCScan` supports pushfilter down, but `HiveScan` does not yet support it. In our testing, we are able to achieve approximately 2x performance improvement. The conversation can be controlled by setting `spark.sql.kyuubi.hive.connector.read.convertMetastoreOrc`. When enabled, the data source ORC reader is used to process ORC tables created by using the HiveQL syntax, instead of Hive SerDe. close apache#7122 ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#7123 from flaming-archer/master_scanbuilder_new. Closes apache#7122 c3f412f [tian bao] add case _ 2be4890 [tian bao] Merge branch 'master_scanbuilder_new' of github.com:flaming-archer/kyuubi into master_scanbuilder_new c825d0f [tian bao] review change 8a26d6a [tian bao] Update extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/KyuubiHiveConnectorConf.scala 68d4196 [tian bao] review change bed007f [tian bao] review change b89e6e6 [tian bao] Optimize UT 5a8941b [tian bao] fix failed ut dc1ba47 [tian bao] orc pushdown version 0 Authored-by: tian bao <2011xuesong@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>

orc pushdown version 0

dc1ba47

support orc hive table pushdown filters

github-actions bot added module:spark module:extensions labels Jun 30, 2025

fix failed ut

5a8941b

Optimize UT

b89e6e6