[KYUUBI #7122] Support ORC hive table pushdown filter#7123
[KYUUBI #7122] Support ORC hive table pushdown filter#7123flaming-archer wants to merge 9 commits intoapache:masterfrom
Conversation
support orc hive table pushdown filters
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7123 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 700 700
Lines 43364 43397 +33
Branches 5869 5878 +9
======================================
- Misses 43364 43397 +33 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
...i-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/HiveTable.scala
Outdated
Show resolved
Hide resolved
...i-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/HiveTable.scala
Show resolved
Hide resolved
...i-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/HiveTable.scala
Outdated
Show resolved
Hide resolved
...rk-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveQuerySuite.scala
Outdated
Show resolved
Hide resolved
...i-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/HiveTable.scala
Outdated
Show resolved
Hide resolved
...onnector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/read/HiveFileIndex.scala
Show resolved
Hide resolved
...-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveCatalogSuite.scala
Outdated
Show resolved
Hide resolved
...rk-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveQuerySuite.scala
Outdated
Show resolved
Hide resolved
|
LGTM, test code should clean up before merging. |
| override def newScanBuilder(options: CaseInsensitiveStringMap): ScanBuilder = { | ||
| HiveScanBuilder(sparkSession, fileIndex, dataSchema, catalogTable) | ||
| convertedProvider match { | ||
| case Some("ORC") if sparkSession.sessionState.conf.getConf(READ_CONVERT_METASTORE_ORC) => |
There was a problem hiding this comment.
Why not the implementation mechanism of ConvertRelation?
There was a problem hiding this comment.
that requires more work and is more invasive compared to the current approach, let's try this simple way first, and the extended rule approach can be evaluated later
| val provider = catalogTable.provider.map(_.toUpperCase(Locale.ROOT)) | ||
| if (orc || provider.contains("ORC")) { | ||
| Some("ORC") | ||
| } else if (parquet || provider.contains("PARQUET")) { |
There was a problem hiding this comment.
Could provider possibly be parquet? It seems only can be hive
There was a problem hiding this comment.
Actually, I copy from here , maybe parquet?
...-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveCatalogSuite.scala
Outdated
Show resolved
Hide resolved
|
@flaming-archer Thanks for your effort, left some comments. |
...tor-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/KyuubiHiveConnectorConf.scala
Outdated
Show resolved
Hide resolved
...rk-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveQuerySuite.scala
Outdated
Show resolved
Hide resolved
...-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveCatalogSuite.scala
Outdated
Show resolved
Hide resolved
…g/apache/kyuubi/spark/connector/hive/KyuubiHiveConnectorConf.scala Co-authored-by: Cheng Pan <pan3793@gmail.com>
…uubi into master_scanbuilder_new
...-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/HiveCatalogSuite.scala
Show resolved
Hide resolved
pan3793
left a comment
There was a problem hiding this comment.
LGTM, only a minor test issue
|
@flaming-archer could you update the PR description to briefly describe how the code works? and mention the user-facing configuration? |
|
also cc @cfmcgrady |
|
will merge it this afternoon if no further comments |
|
Thanks, merged to master |
### Why are the changes needed? Previously, the `HiveScan` class was used to read data. If it is determined to be ORC type, the `ORCScan` from Spark datasourcev2 can be used. `ORCScan` supports pushfilter down, but `HiveScan` does not yet support it. In our testing, we are able to achieve approximately 2x performance improvement. The conversation can be controlled by setting `spark.sql.kyuubi.hive.connector.read.convertMetastoreOrc`. When enabled, the data source ORC reader is used to process ORC tables created by using the HiveQL syntax, instead of Hive SerDe. close apache#7122 ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#7123 from flaming-archer/master_scanbuilder_new. Closes apache#7122 c3f412f [tian bao] add case _ 2be4890 [tian bao] Merge branch 'master_scanbuilder_new' of github.com:flaming-archer/kyuubi into master_scanbuilder_new c825d0f [tian bao] review change 8a26d6a [tian bao] Update extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/KyuubiHiveConnectorConf.scala 68d4196 [tian bao] review change bed007f [tian bao] review change b89e6e6 [tian bao] Optimize UT 5a8941b [tian bao] fix failed ut dc1ba47 [tian bao] orc pushdown version 0 Authored-by: tian bao <2011xuesong@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>
### Why are the changes needed? Previously, the `HiveScan` class was used to read data. If it is determined to be ORC type, the `ORCScan` from Spark datasourcev2 can be used. `ORCScan` supports pushfilter down, but `HiveScan` does not yet support it. In our testing, we are able to achieve approximately 2x performance improvement. The conversation can be controlled by setting `spark.sql.kyuubi.hive.connector.read.convertMetastoreOrc`. When enabled, the data source ORC reader is used to process ORC tables created by using the HiveQL syntax, instead of Hive SerDe. close apache#7122 ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#7123 from flaming-archer/master_scanbuilder_new. Closes apache#7122 c3f412f [tian bao] add case _ 2be4890 [tian bao] Merge branch 'master_scanbuilder_new' of github.com:flaming-archer/kyuubi into master_scanbuilder_new c825d0f [tian bao] review change 8a26d6a [tian bao] Update extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/KyuubiHiveConnectorConf.scala 68d4196 [tian bao] review change bed007f [tian bao] review change b89e6e6 [tian bao] Optimize UT 5a8941b [tian bao] fix failed ut dc1ba47 [tian bao] orc pushdown version 0 Authored-by: tian bao <2011xuesong@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>
### Why are the changes needed? Previously, the `HiveScan` class was used to read data. If it is determined to be ORC type, the `ORCScan` from Spark datasourcev2 can be used. `ORCScan` supports pushfilter down, but `HiveScan` does not yet support it. In our testing, we are able to achieve approximately 2x performance improvement. The conversation can be controlled by setting `spark.sql.kyuubi.hive.connector.read.convertMetastoreOrc`. When enabled, the data source ORC reader is used to process ORC tables created by using the HiveQL syntax, instead of Hive SerDe. close apache#7122 ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#7123 from flaming-archer/master_scanbuilder_new. Closes apache#7122 c3f412f [tian bao] add case _ 2be4890 [tian bao] Merge branch 'master_scanbuilder_new' of github.com:flaming-archer/kyuubi into master_scanbuilder_new c825d0f [tian bao] review change 8a26d6a [tian bao] Update extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/KyuubiHiveConnectorConf.scala 68d4196 [tian bao] review change bed007f [tian bao] review change b89e6e6 [tian bao] Optimize UT 5a8941b [tian bao] fix failed ut dc1ba47 [tian bao] orc pushdown version 0 Authored-by: tian bao <2011xuesong@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>
Why are the changes needed?
Previously, the
HiveScanclass was used to read data. If it is determined to be ORC type, theORCScanfrom Spark datasourcev2 can be used.ORCScansupports pushfilter down, butHiveScandoes not yet support it.In our testing, we are able to achieve approximately 2x performance improvement.
The conversation can be controlled by setting
spark.sql.kyuubi.hive.connector.read.convertMetastoreOrc. When enabled, the data source ORC reader is used to process ORC tables created by using the HiveQL syntax, instead of Hive SerDe.close #7122
How was this patch tested?
added unit test
Was this patch authored or co-authored using generative AI tooling?
No