-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53878][SQL][CONNECT] Fix race condition issue related to ObservedMetrics #52575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @sarutak . The JIRA ID looks wrong to me.
I changed the PR title to prevent accidental merging. Please let me know when you correct the PR title. |
@dongjoon-hyun Thank you for letting me know. I've modified it. |
Thank you, @sarutak . |
|
||
/** Get the metrics observed during the execution of the query plan. */ | ||
def observedMetrics: Map[String, Row] = CollectMetricsExec.collect(executedPlan) | ||
def observedMetrics: Map[String, Row] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this was designed as def observedMetrics
instead of lazy val
, this method can return different values at every invocation. Are we sure that this PR are not going to lose the AS-IS capability?
If the return value is static, why don't we use lazy val
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I walked through the related code and I confirmed observedMetrics
can return different value at every invocation so I'll consider it.
|
||
/** Get the metrics observed during the execution of the query plan. */ | ||
def observedMetrics: Map[String, Row] = CollectMetricsExec.collect(executedPlan) | ||
def observedMetrics: Map[String, Row] = observedMetricsLock.synchronized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @sarutak . Please add the following annotation, javax.annotation.concurrent.GuardedBy
.
@GuardedBy("observedMetricsLock")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (except the above annotation comment.)
cc @peter-toth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we revert #52566 now?
Otherwise, LGTM.
@ueshin Let's revert that PR separately to make git-history clean. |
What changes were proposed in this pull request?
In Spark Connect environment,
QueryExecution#observedMetrics
can be called by two threads concurrently.This can cause race condition issues. We can see CI failure caused by this issue.
https://github.com/apache/spark/actions/runs/18422173471/job/52497913985
This test failure can be reproduced by inserting sleep into
ArrayBasedMapBuilder.scala
like as follows.And then, run the test as follows.
To fix this issue, this PR proposes to protect
QueryExecution#observedMdetrics
using synchronized block.Why are the changes needed?
Bug fix.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Ran the problematic test with inserting sleep like as mentioned above, and confirmed the test passed.
Was this patch authored or co-authored using generative AI tooling?
No.