Skip to content

Conversation

@raffenet
Copy link
Contributor

@raffenet raffenet commented Nov 18, 2025

Pull Request Description

ThreadSanitizer flags some races when running the testsuite. Fix them.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@raffenet
Copy link
Contributor Author

test:mpich/ch4/most
test:mpich/ch3/most

@raffenet
Copy link
Contributor Author

test:mpich/ch4/most
test:mpich/ch3/most

@raffenet raffenet changed the title mpid/thread: Fix thread unsafe recursive lock check Fix data races found by ThreadSanitizer Nov 18, 2025
@raffenet
Copy link
Contributor Author

test:mpich/ch4/most
test:mpich/ch3/most

@raffenet
Copy link
Contributor Author

test:mpich/ch4/most
test:mpich/ch3/most

@raffenet
Copy link
Contributor Author

There's one remaining race reported that I'm not quite sure how to solve:

Unexpected output in th_taskmanager:
    ==================
    WARNING: ThreadSanitizer: data race (pid=53799)
      Write of size 8 at 0x000104cacca0 by thread T12 (mutexes: write M0, write M1):
        #0 get_next_avtid ch4_proc.c:78 (libpmpi.0.dylib:arm64+0x40335c)
        #1 MPIDIU_new_avt ch4_proc.c:96 (libpmpi.0.dylib:arm64+0x402fcc)
        #2 MPIDIU_lpid_to_av_slow ch4_proc.c:302 (libpmpi.0.dylib:arm64+0x403e5c)
        #3 MPIDI_OFI_insert_upid ofi_spawn.c:255 (libpmpi.0.dylib:arm64+0x4463a0)
        #4 MPID_Intercomm_exchange ch4_comm.c:420 (libpmpi.0.dylib:arm64+0x3fadf8)
        #5 MPIR_Intercomm_create_timeout comm_impl.c:949 (libpmpi.0.dylib:arm64+0x31d900)
        #6 dynamic_intercomm_create ch4_spawn.c:391 (libpmpi.0.dylib:arm64+0x3fda04)
    ... ...
Unexpected output in th_taskmanager:
    ==================
    WARNING: ThreadSanitizer: data race (pid=54395)
      Write of size 8 at 0x00010295cca0 by thread T12 (mutexes: write M0, write M1):
        #0 get_next_avtid ch4_proc.c:78 (libpmpi.0.dylib:arm64+0x40335c)
        #1 MPIDIU_new_avt ch4_proc.c:96 (libpmpi.0.dylib:arm64+0x402fcc)
        #2 MPIDIU_lpid_to_av_slow ch4_proc.c:302 (libpmpi.0.dylib:arm64+0x403e5c)
        #3 MPIDI_OFI_insert_upid ofi_spawn.c:255 (libpmpi.0.dylib:arm64+0x4463a0)
        #4 MPID_Intercomm_exchange ch4_comm.c:420 (libpmpi.0.dylib:arm64+0x3fadf8)
        #5 MPIR_Intercomm_create_timeout comm_impl.c:949 (libpmpi.0.dylib:arm64+0x31d900)
        #6 dynamic_intercomm_create ch4_spawn.c:391 (libpmpi.0.dylib:arm64+0x3fda04)
    ... ...
2 tests failed out of 3 (total runtime: 1 min 32 sec)

Copy link
Contributor

@hzhou hzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A few points for more accurate commit messages.

I am glad threadsanitizer become helpful. Is it clean after the fix? We should consider add to CI tests if it can be clean.

MPL_DBG_MSG_P(MPIR_DBG_THREAD,VERBOSE,"exit MPIDU_Thread_mutex_lock %p", &mutex); \
MPIR_Assert(err_ == 0); \
MPIR_Assert(mutex.count == 0); \
MPL_thread_self(&mutex.owner); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message is inaccurate. It is removing the ownership check. The recursive lock check is still on. And the reason for that change is because we no longer support recursive locking and the owner checking is no longer needed. The race condition is a trigger for the removal, but not the reason. If it is the reason, we should fix it rather than removing it.

MPIDI_global.avt_mgr.av_tables[*avtid] = new_av_table;

MPID_THREAD_CS_EXIT(VCI, MPIDIU_THREAD_DYNPROC_MUTEX);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar error in commit message.

We were reading the mutex owner value before acquiring the lock which
could conflict with an update from the thread that holds the
lock. Remove the ownership check since we do not allow recursive locking
any longer. Resolves a data race detected by ThreadSanitizer.
To prevent recursive progress, use a thread-local variable. This avoids
data races with other threads which might also be making progress. Race
found with ThreadSanitizer.
vci_seq is a shared variable that could be accessed by multiple threads
concurrently. Use atomics to prevent data races. Race found with
ThreadSanitizer.
There is no use for these global variables tracking the number of
communicators with striping or hashing enabled. ThreadSanitizer
correctly flagged their accesses as racey. Remove them.
Several places where we access global variables were not protected by
mutex as reported by ThreadSanitizer.
The volatile qualifier is not a thread safety mechanism. Accesses to
these variables should already be protected by mutex.
@raffenet
Copy link
Contributor Author

Looks good. A few points for more accurate commit messages.

I am glad threadsanitizer become helpful. Is it clean after the fix? We should consider add to CI tests if it can be clean.

There is one race left I am not sure how to solve #7673 (comment). There are also a few instances in the testsuite of using volatile for thread safety that should be replaced by atomics. I will get to those when I have time.

Multiple threads could call be creating dynamic processes concurrently,
so this needs protection.
@hzhou
Copy link
Contributor

hzhou commented Nov 19, 2025

There's one remaining race reported that I'm not quite sure how to solve:

Add CS in MPIDIU_new_avt?

@raffenet
Copy link
Contributor Author

raffenet commented Nov 19, 2025

There's one remaining race reported that I'm not quite sure how to solve:

Add CS in MPIDIU_new_avt?

Added in f682e40 but that didn't seem to solve it. Let me double check to be sure.

The values of the thread package and proc mutex package settings were
not consistent and could lead to skipped checks for useful pthread
attributes. Consistently use "posix" and adjust conditions to reflect
the correct values.
@raffenet
Copy link
Contributor Author

test:mpich/ch4/most
test:mpich/ch3/most

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants