-
Notifications
You must be signed in to change notification settings - Fork 311
Fix data races found by ThreadSanitizer #7673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
test:mpich/ch4/most |
aa05bc8 to
382a90f
Compare
|
test:mpich/ch4/most |
382a90f to
0bc89e1
Compare
|
test:mpich/ch4/most |
0bc89e1 to
7518aa2
Compare
|
test:mpich/ch4/most |
|
There's one remaining race reported that I'm not quite sure how to solve: |
hzhou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. A few points for more accurate commit messages.
I am glad threadsanitizer become helpful. Is it clean after the fix? We should consider add to CI tests if it can be clean.
| MPL_DBG_MSG_P(MPIR_DBG_THREAD,VERBOSE,"exit MPIDU_Thread_mutex_lock %p", &mutex); \ | ||
| MPIR_Assert(err_ == 0); \ | ||
| MPIR_Assert(mutex.count == 0); \ | ||
| MPL_thread_self(&mutex.owner); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message is inaccurate. It is removing the ownership check. The recursive lock check is still on. And the reason for that change is because we no longer support recursive locking and the owner checking is no longer needed. The race condition is a trigger for the removal, but not the reason. If it is the reason, we should fix it rather than removing it.
| MPIDI_global.avt_mgr.av_tables[*avtid] = new_av_table; | ||
|
|
||
| MPID_THREAD_CS_EXIT(VCI, MPIDIU_THREAD_DYNPROC_MUTEX); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar error in commit message.
We were reading the mutex owner value before acquiring the lock which could conflict with an update from the thread that holds the lock. Remove the ownership check since we do not allow recursive locking any longer. Resolves a data race detected by ThreadSanitizer.
To prevent recursive progress, use a thread-local variable. This avoids data races with other threads which might also be making progress. Race found with ThreadSanitizer.
vci_seq is a shared variable that could be accessed by multiple threads concurrently. Use atomics to prevent data races. Race found with ThreadSanitizer.
There is no use for these global variables tracking the number of communicators with striping or hashing enabled. ThreadSanitizer correctly flagged their accesses as racey. Remove them.
Several places where we access global variables were not protected by mutex as reported by ThreadSanitizer.
The volatile qualifier is not a thread safety mechanism. Accesses to these variables should already be protected by mutex.
a1e1971 to
d76eea9
Compare
There is one race left I am not sure how to solve #7673 (comment). There are also a few instances in the testsuite of using |
Multiple threads could call be creating dynamic processes concurrently, so this needs protection.
d76eea9 to
f682e40
Compare
Add CS in |
Added in f682e40 but that didn't seem to solve it. Let me double check to be sure. |
The values of the thread package and proc mutex package settings were not consistent and could lead to skipped checks for useful pthread attributes. Consistently use "posix" and adjust conditions to reflect the correct values.
|
test:mpich/ch4/most |
Pull Request Description
ThreadSanitizer flags some races when running the testsuite. Fix them.
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.