Skip to content

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented Oct 13, 2025

What changes were proposed in this pull request?

This is the second try of #52474, following the suggestion from cloud-fan

This PR fixes a bug in plannedWrite, where the query has foldable orderings in the partition columns.

CREATE TABLE t (i INT, j INT, k STRING) USING PARQUET PARTITIONED BY (k);

INSERT OVERWRITE t SELECT j AS i, i AS j, '0' as k FROM t0 SORT BY k, i;

The evaluation of FileFormatWriter.orderingMatched fails because SortOrder(Literal) is eliminated by EliminateSorts.

Why are the changes needed?

V1Writes will override the custom sort order when the query output ordering does not satisfy the required ordering. Before SPARK-53707, when the query's output contains literals in partition columns, the judgment produces a false-negative result, thus causing the sort order not to take effect.

SPARK-53707 partially fixes the issue on the logical plan by adding a Project of query in V1Writes.

Before SPARK-53707

Sort [0 ASC NULLS FIRST, i#280 ASC NULLS FIRST], false
+- Project [j#287 AS i#280, i#286 AS j#281, 0 AS k#282]
   +- Relation spark_catalog.default.t0[i#286,j#287,k#288] parquet

After SPARK-53707

Project [i#284, j#285, 0 AS k#290]
+- Sort [0 ASC NULLS FIRST, i#284 ASC NULLS FIRST], false
   +- Project [i#284, j#285]
      +- Relation spark_catalog.default.t0[i#284,j#285,k#286] parquet

Note, note the issue still exists because there is another place to check the ordering match again in FileFormatWriter.

This PR fixes the issue thoroughly, with new UTs added.

Does this PR introduce any user-facing change?

Yes, it's a bug fix.

How was this patch tested?

New UTs are added.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Oct 13, 2025
WholeStageCodegenExec(insertInputAdapter(plan))(codegenStageCounter.incrementAndGet())
val newId = codegenStageCounter.incrementAndGet()
val newPlan = WholeStageCodegenExec(insertInputAdapter(plan))(newId)
plan.logicalLink.foreach(newPlan.setLogicalLink)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that WholeStageCodegenExec misses setting logicalLink, is it by design?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, and it never caused issue with AQE before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't seen the real issues in both production and existing UT.

plan.logicalLink match {
case Some(WriteFiles(query, _, _, _, _, _)) =>
V1WritesUtils.eliminateFoldableOrdering(ordering, query).outputOrdering
case Some(query) =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the query can be WholeStageCodegenExec, that's why I set logicalLink on WholeStageCodegenExec


val listener = new QueryExecutionListener {
override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = {
val conf = qe.sparkSession.sessionState.conf
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bugfix, the listener runs in another thread, without this change, conf.getConf actually gets conf from the thread local, thus may cause issues on concurrency running tests

Comment on lines +470 to +475
def withOutput(newOutput: Seq[Attribute]): InMemoryRelation = {
val map = AttributeMap(output.zip(newOutput))
val newOutputOrdering = outputOrdering
.map(_.transform { case a: Attribute => map(a) })
.asInstanceOf[Seq[SortOrder]]
InMemoryRelation(newOutput, cacheBuilder, newOutputOrdering, statsOfPlanToCache)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue was identified in previous try, see #52474 (comment)


override def makeCopy(newArgs: Array[AnyRef]): LogicalPlan = {
val copied = super.makeCopy(newArgs).asInstanceOf[InMemoryRelation]
copied.statsOfPlanToCache = this.statsOfPlanToCache
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, issue was identified in previous try, see #52474 (comment)

@pan3793
Copy link
Member Author

pan3793 commented Oct 13, 2025

@pan3793
Copy link
Member Author

pan3793 commented Oct 13, 2025

@cloud-fan BTW, the "planned write" switch (an internal config) was added since 3.4, do we have a plan to remove it to simplify code, or tend to preserve it forever?

expressions.exists(_.exists(_.isInstanceOf[Empty2Null]))
}

def eliminateFoldableOrdering(ordering: Seq[SortOrder], query: LogicalPlan): LogicalPlan =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add comments to explain the reason behind it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

.getOrElse(materializeAdaptiveSparkPlan(plan))
.outputOrdering

val requiredOrdering = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the code path when planned write is disabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave it unfixed, as this code path is rarely reached and this fix is kind of an optimization: it's only about perf.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a necessary change for the "planned write" to make UT happy

if (Utils.isTesting) outputOrderingMatched = orderingMatched

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this is a necessary for the current codebase, but do we really need to do it in theory? The planned write should have added the sort already, ideally we don't need to try to add sort again here.

Copy link
Member Author

@pan3793 pan3793 Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The planned write should have added the sort already, ideally we don't need to try to add sort again here.

yes, exactly

Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants