-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52638][SQL] Allow preserving Hive-style column order to be configurable #51342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
197a129
to
6be4548
Compare
Is it safe? Do other places assume that partition columns always stay at the end? |
buildConf("spark.sql.hive.preserveLegacyColumnOrder.enabled") | ||
.internal() | ||
.doc("When true, tables returned from HiveExternalCatalog preserve Hive-style column order " + | ||
"where the partition columns are at the end. Otherwise, the user-specified column order " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. . O
-> . O
.
@@ -5989,6 +5989,16 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val HIVE_PRESERVE_LEGACY_COLUMN_ORDER = | |||
buildConf("spark.sql.hive.preserveLegacyColumnOrder.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it's not a mandatory, shall we have this at HiveUtils.scala
file like the following?
val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM with two minor comments. Thank you, @szehon-ho .
In general, I like this direction to make the behavior more reasonable, but let's also be careful about storage features. For existing tables, will we also change their column order? |
@@ -818,16 +819,20 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat | |||
// columns are not put at the end of schema. We need to reorder it when reading the schema | |||
// from the table properties. | |||
private def reorderSchema(schema: StructType, partColumnNames: Seq[String]): StructType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is called when we read out the table metadata. I think it's risky to directly change it.
My proposal is to introduce a hidden table property to decide if we want to order columns or not. For existing tables, they don't have this property and we don't change behavior. For newly created tables, we only add this property if the config is turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the other catalogs re-order the columns so that the partition columns are at the end ...
Hmm, but this also only targets HiveExternalCatalog
not other catalogs. I wonder why hive catalog doesn't want to follow Hive behavior? Shouldn't it be a normal behavior and expected by users?
I'm also worrying that, like other pointed it, by changing it, will it be safe if some places assume the expected order?
I think it's weird for a catalog to define its own column order behavior. I'm fine with still keep this behavior for Hive tables (provider == "hive"). |
so , iiuc we will have this config: spark.sql.hive.preserveLegacyColumnOrder.enabled, and if it is false then save a table property like "legacyColumnOrder"="false". Only then, read the table without re-ordering. I guess this logic will only be in HiveExternalCatalog as it is only this one that does column re-ordering ? Other catalogs should not be affected? |
updated as per comments |
What changes were proposed in this pull request?
Add a flag "spark.sql.hive.legacy.preserveHiveColumnOrder" to determine whether HiveExternalCatalog persists a table where schema preserves the Hive-style column order (schema at end), or the original user-specified column order when creating the table.
Why are the changes needed?
Previously Spark relied heavily on Hive behavior. Nowadays, there are more catalogs, especially with DSV2 API. None of the other catalogs re-order the columns so that the partition columns are at the end. Many systems now expect to work on any catalog, and this makes the expectation hard.
This was from the discussion here: #51280 (comment)
Does this PR introduce any user-facing change?
No
How was this patch tested?
Add test to HiveDDLSuite
Was this patch authored or co-authored using generative AI tooling?
No