[SPARK-53880][SQL] Support DSv2 in PushVariantIntoScan #52578

viirya · 2025-10-12T18:21:32Z

What changes were proposed in this pull request?

This patch goes to add DSv2 support to the optimization rule PushVariantIntoScan. The PushVariantIntoScan rule only supports DSv1 Parquet (ParquetFileFormat) source. It limits the effectiveness of variant type usage on DSv2.

Why are the changes needed?

Although #52522 tried to add DSv2 support recently, the implementation implicitly binds pruneColumns to this variant access pushdown which could cause unexpected errors on the DSv2 datasources which don't support that. It also breaks the API semantics. We need an explicit API between Spark and DSv2 datasource for the feature.

#52522 also didn't test through this DSv2 variant pushdown feature actually on the built-in DSv2 Parquet datasource but on InMemoryTable. This patch reverts #52522 and proposes a new approach with comprehensive test coverage.

Does this PR introduce any user-facing change?

Yes. After this PR, if users enable spark.sql.variant.pushVariantIntoScan, they can push down variant column accesses into DSv2 datasource if it is supported.

How was this patch tested?

Added new unit test suites PushVariantIntoScanV2Suite and PushVariantIntoScanV2VectorizedSuite.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.0.13

This reverts commit a35c9f3.

dongjoon-hyun

Oh, thank you, @viirya .

dongjoon-hyun · 2025-10-12T18:27:03Z

BTW, maybe the test case is generated by Claude Code?

Generated-by: Claude Code v2.0.13

dongjoon-hyun · 2025-10-12T18:27:59Z

cc @huaxingao , @cloud-fan too from the previous approach.

[SPARK-53805][SQL] Push Variant into DSv2 scan #52522

viirya · 2025-10-12T18:28:59Z

BTW, maybe the test case is generated by Claude Code?

Generated-by: Claude Code v2.0.13

Yes. This is co-authored using Claude Code.

viirya · 2025-10-12T20:58:20Z

I saw all tests are passing in last CI run but just two Java style issues. Fixed.

viirya · 2025-10-12T22:40:13Z

Hmm, all tests are passing again. But what is this error on documentation generation pipeline?

[error] java.lang.IllegalArgumentException: Illegal group reference
[error] 	at java.base/java.util.regex.Matcher.appendExpandedReplacement(Matcher.java:1067)
[error] 	at java.base/java.util.regex.Matcher.appendReplacement(Matcher.java:907)
...

viirya · 2025-10-13T02:10:59Z

It seems due to java doc. I'm looking into it.

dongjoon-hyun · 2025-10-13T02:22:42Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/VariantAccessInfo.java

+ * <p>
+ * For example, if a query accesses {@code variant_get(v, '$.a', 'int')} and
+ * {@code variant_get(v, '$.b', 'string')}, the extracted schema would be
+ * {@code struct<0:int, 1:string>} where field ordinals correspond to the access order.


Ya, these three lines seem to be the root cause.

Yea, {@code } blocks are the issues. Not sure why this javadoc tag cannot be parsed.

cloud-fan · 2025-10-13T11:58:08Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/VariantAccessInfo.java

+ * <p>
+ * For example, if a query accesses `variant_get(v, '$.a', 'int')` and
+ * `variant_get(v, '$.b', 'string')`, the extracted schema would be
+ * `struct<0:int, 1:string>` where field ordinals correspond to the access order.


Suggested change

* `struct<0:int, 1:string>` where field ordinals correspond to the access order.

* `struct<a:int, b:string>` where field ordinals correspond to the access order.

It is actually the ordinals. It is how PushVariantIntoScan rewrites the type.

cloud-fan · 2025-10-13T11:58:43Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/VariantAccessInfo.java

+ * @since 4.1.0
+ */
+@Evolving
+public final class VariantAccessInfo implements Serializable {


this can be a java record?

Looks okay. Is it necessary though?

cloud-fan · 2025-10-13T11:59:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PushVariantIntoScan.scala

 import org.apache.spark.sql.errors.QueryExecutionErrors
 import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
-import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+// BEGIN-V2-SUPPORT: DataSource V2 imports


what does this special comment marker mean?

Just added at the beginning to make the V2 code separately clear. I will remove it.

pan3793 · 2025-10-13T13:29:23Z

This patch reverts #52522 and proposes a new approach ...

I feel reverting the original PR first, then submitting a brand-new PR makes the git history clearer

viirya · 2025-10-13T15:00:45Z

This patch reverts #52522 and proposes a new approach ...

I feel reverting the original PR first, then submitting a brand-new PR makes the git history clearer

Okay for me. @dongjoon-hyun @cloud-fan @huaxingao What do you think?

viirya · 2025-10-13T15:23:30Z

I opened the revert PR #52590.

Once it is merged, I will rebase this PR.

dongjoon-hyun · 2025-10-13T16:10:36Z

I'm fine in either way, @viirya . I guess it's more up to @huaxingao 's preference.

huaxingao · 2025-10-13T16:28:42Z

Thanks @viirya for the PR! Appreciate you putting together an alternative. A couple of thoughts before we move forward:

Variant + column pruning are not special-cased semantics

Variant is just another column in the projected schema from Spark’s perspective. When a source supports column pruning, Spark should prune all columns it doesn’t need, including Variant, exactly like it does for other types. I don't think we need to introduce a new interface to control if Variant should be pruned.
If a particular DSv2 implementation does not support column pruning, then pushing Variant access down into that source is not meaningful; the primary benefit of Variant pushdown is to reduce the read surface via pruning.

On testing built-in Parquet vs. InMemoryTable

The InMemoryTable tests were designed to validate the planner and API plumbing, to test that Spark forms the right pruned schema and pushdown contracts and the source receives them. It should be sufficient for testing.
It's nice to add a smoke test with the built-in DSv2 Parquet to increase end-to-end confidence and catch regressions in common deployments. But I don't think it should be used as a judgement that a PR is not good.

viirya · 2025-10-13T17:23:19Z

Thanks @viirya for the PR! Appreciate you putting together an alternative. A couple of thoughts before we move forward:

1. Variant + column pruning are not special-cased semantics
* Variant is just another column in the projected schema from Spark’s perspective. When a source supports column pruning, Spark should prune all columns it doesn’t need, including Variant, exactly like it does for other types. I don't think we need to introduce a new interface to control if Variant should be pruned.
* If a particular DSv2 implementation does not support column pruning, then pushing Variant access down into that source is not meaningful; the primary benefit of Variant pushdown is to reduce the read surface via pruning.

Column pruning is not the same as this variant scan pushdown. This feature basically changes the datatype (variant type to struct type) not just pruning fields from it. Mixing column pruning with variant scan pushdown causes confusion on API semantics and unexpected errors on DSv2 datasource implementation. Actually you can access all fields from a variant column but this variant scan pushdown still works effectively. It is not about reducing read surface but optimize variant access and usage in the read and the query plan.

No to mention that with current approach if it pushes a variant scan into a DSv2 implementation doesn't know about it, there will be unexpected errors.

2. On testing built-in Parquet vs. InMemoryTable
* The InMemoryTable tests were designed to validate the planner and API plumbing, to test that Spark forms the right pruned schema and pushdown contracts and the source receives them. It should be sufficient for testing.
* It's nice to add a smoke test with the built-in DSv2 Parquet to increase end-to-end confidence and catch regressions in common deployments. But I don't think it should be used as a judgement that a PR is not good.

No. The test just verifies the schema was changed. InMemoryTable/Scan doesn't actually do the variant pushdown. In other words, it just makes sure that you changed the variant type to struct type in the InMemoryTable. It doesn't actually test if the variant scan happens correctly or not, isn't? If the table/scan doesn't support it, what will happen? In other words, for example, we can simply change a variant type to any other type and keep the test passing because the schema is matched. Btw, there are actually issues on the test, but that is not the only issue going to revert the PR.

huaxingao · 2025-10-13T18:13:45Z

Thanks @viirya for the reply! If i understand correctly, the core concern is: “If Spark pushes a variant scan into a DSv2 source that doesn’t understand it, we’ll see unexpected errors.” I agree — that’s exactly the failure mode we should avoid.

Would it be acceptable if I add a planner-side guard so this only happens when the source explicitly opts in?

On tests: I understand my InMemoryTable test might have issues. Can we fix it to exercise the planner contract correctly, or do you consider a built-in DSv2 Parquet test required for DSv2 change?

viirya · 2025-10-13T18:27:47Z

Thanks @viirya for the reply! If i understand correctly, the core concern is: “If Spark pushes a variant scan into a DSv2 source that doesn’t understand it, we’ll see unexpected errors.” I agree — that’s exactly the failure mode we should avoid.

Would it be acceptable if I add a planner-side guard so this only happens when the source explicitly opts in?

Thanks @huaxingao for the discussion and understanding.

I think we need an explicit DSv2 API to make the contract between Spark and datasource implementation around this variant pushdown feature. That is this PR proposed to do.

On tests: I understand my InMemoryTable test might have issues. Can we fix it to exercise the planner contract correctly, or do you consider a built-in DSv2 Parquet test required for DSv2 change?

That is also what this PR proposed to do, adding dedicated DSv2 Variant pushdown tests for both row-based and vectorized-based readers with good test coverage.

viirya added 3 commits October 12, 2025 10:51

Revert "[SPARK-53805][SQL] Push Variant into DSv2 scan"

feed33f

This reverts commit a35c9f3.

Support DSv2 in PushVariantIntoScan

c8f2b09

refactor

7cb012e

github-actions bot added the SQL label Oct 12, 2025

viirya changed the title ~~[SPARK-XXXXX][SQL] Support DSv2 in PushVariantIntoScan~~ [SPARK-53880][SQL] Support DSv2 in PushVariantIntoScan Oct 12, 2025

dongjoon-hyun reviewed Oct 12, 2025

View reviewed changes

Fix java style

b5e8875

dongjoon-hyun reviewed Oct 13, 2025

View reviewed changes

fix javadoc

ee6081f

cloud-fan reviewed Oct 13, 2025

View reviewed changes

viirya mentioned this pull request Oct 13, 2025

Revert "[SPARK-53805][SQL] Push Variant into DSv2 scan" #52590

Open

	* `struct<0:int, 1:string>` where field ordinals correspond to the access order.
	* `struct<a:int, b:string>` where field ordinals correspond to the access order.

[SPARK-53880][SQL] Support DSv2 in PushVariantIntoScan #52578

Are you sure you want to change the base?

[SPARK-53880][SQL] Support DSv2 in PushVariantIntoScan #52578

Conversation

viirya commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 12, 2025

Uh oh!

dongjoon-hyun commented Oct 12, 2025

Uh oh!

viirya commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Oct 12, 2025

Uh oh!

viirya commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Oct 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Oct 13, 2025

Uh oh!

viirya commented Oct 13, 2025

Uh oh!

viirya commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 13, 2025

Uh oh!

huaxingao commented Oct 13, 2025

Uh oh!

viirya commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huaxingao commented Oct 13, 2025

Uh oh!

viirya commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya commented Oct 12, 2025 •

edited

Loading

viirya commented Oct 12, 2025 •

edited

Loading

viirya commented Oct 12, 2025 •

edited

Loading

viirya commented Oct 13, 2025 •

edited

Loading

viirya commented Oct 13, 2025 •

edited

Loading