-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53738][SQL] Fix planned write when query output contains foldable orderings #52584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…n query output contains literal
WholeStageCodegenExec(insertInputAdapter(plan))(codegenStageCounter.incrementAndGet()) | ||
val newId = codegenStageCounter.incrementAndGet() | ||
val newPlan = WholeStageCodegenExec(insertInputAdapter(plan))(newId) | ||
plan.logicalLink.foreach(newPlan.setLogicalLink) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that WholeStageCodegenExec
misses setting logicalLink, is it by design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, and it never caused issue with AQE before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't seen the real issues in both production and existing UT.
plan.logicalLink match { | ||
case Some(WriteFiles(query, _, _, _, _, _)) => | ||
V1WritesUtils.eliminateFoldableOrdering(ordering, query).outputOrdering | ||
case Some(query) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the query can be WholeStageCodegenExec
, that's why I set logicalLink on WholeStageCodegenExec
|
||
val listener = new QueryExecutionListener { | ||
override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = { | ||
val conf = qe.sparkSession.sessionState.conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bugfix, the listener runs in another thread, without this change, conf.getConf
actually gets conf from the thread local, thus may cause issues on concurrency running tests
def withOutput(newOutput: Seq[Attribute]): InMemoryRelation = { | ||
val map = AttributeMap(output.zip(newOutput)) | ||
val newOutputOrdering = outputOrdering | ||
.map(_.transform { case a: Attribute => map(a) }) | ||
.asInstanceOf[Seq[SortOrder]] | ||
InMemoryRelation(newOutput, cacheBuilder, newOutputOrdering, statsOfPlanToCache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue was identified in previous try, see #52474 (comment)
|
||
override def makeCopy(newArgs: Array[AnyRef]): LogicalPlan = { | ||
val copied = super.makeCopy(newArgs).asInstanceOf[InMemoryRelation] | ||
copied.statsOfPlanToCache = this.statsOfPlanToCache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, issue was identified in previous try, see #52474 (comment)
@cloud-fan BTW, the "planned write" switch (an internal config) was added since 3.4, do we have a plan to remove it to simplify code, or tend to preserve it forever? |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala
Outdated
Show resolved
Hide resolved
expressions.exists(_.exists(_.isInstanceOf[Empty2Null])) | ||
} | ||
|
||
def eliminateFoldableOrdering(ordering: Seq[SortOrder], query: LogicalPlan): LogicalPlan = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add comments to explain the reason behind it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
.getOrElse(materializeAdaptiveSparkPlan(plan)) | ||
.outputOrdering | ||
|
||
val requiredOrdering = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the code path when planned write is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave it unfixed, as this code path is rarely reached and this fix is kind of an optimization: it's only about perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a necessary change for the "planned write" to make UT happy
if (Utils.isTesting) outputOrderingMatched = orderingMatched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this is a necessary for the current codebase, but do we really need to do it in theory? The planned write should have added the sort already, ideally we don't need to try to add sort again here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The planned write should have added the sort already, ideally we don't need to try to add sort again here.
yes, exactly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
What changes were proposed in this pull request?
This is the second try of #52474, following the suggestion from cloud-fan
This PR fixes a bug in
plannedWrite
, where thequery
has foldable orderings in the partition columns.The evaluation of
FileFormatWriter.orderingMatched
fails becauseSortOrder(Literal)
is eliminated byEliminateSorts
.Why are the changes needed?
V1Writes
will override the custom sort order when the query output ordering does not satisfy the required ordering. Before SPARK-53707, when the query's output contains literals in partition columns, the judgment produces a false-negative result, thus causing the sort order not to take effect.SPARK-53707 partially fixes the issue on the logical plan by adding a
Project
of query inV1Writes
.Before SPARK-53707
After SPARK-53707
Note, note the issue still exists because there is another place to check the ordering match again in
FileFormatWriter
.This PR fixes the issue thoroughly, with new UTs added.
Does this PR introduce any user-facing change?
Yes, it's a bug fix.
How was this patch tested?
New UTs are added.
Was this patch authored or co-authored using generative AI tooling?
No.