-
Notifications
You must be signed in to change notification settings - Fork 37
Description
#186 fixed several race conditions that happen during the "disconnecting and blocking" unit test, but a new one was reported in Sjors/bitcoin#90 (comment) with details in Sjors/bitcoin#90 (comment) and I want to write more notes about it. There is also a branch with previously attempted fixes
The problem is not too hard to reproduce locally: if mptest binary is run in in a loop in bash, the tsan error happens in a few minutes. With attempted fixes so far running the mptest binary in a loop results in a deadlock after about ~20 minutes for me.
Background
If a Connection object is destroyed while a ProxyClient<Thread> object is still using it (due to a sudden disconnect), the ProxyClient<Thread>::m_disconnect_cb callback will be called from ~Connection destructor, deleting the ProxyClient<Thread> object from whichever ConnThreads map it is in.
If a ProxyClient<Thread> object is destroyed before its Connection object (the more normal case), the ProxyClient<Thread> destructor will unregister m_disconnect_cb so when the Connection is later destroyed it will not try to delete the ProxyClient<Thread> object.
If the two destructors are called in sequence in either order, everything is fine. But if the destructors are called from different threads in a race, numerous problems occur.
Problem 1: invalid m_disconnect_cb iterator erase
If ~Connection is destroyed first and is about to call the m_disconnect_cb callback, but then the ~ProxyClient destructor is called in another thread, it will currently (as of a11e690) try to delete the m_disconnect_cb using an iterator which is no longer valid. 79c09e4 fixes this by changing the ~ProxyClient<Thread> destructor to avoid making deleting m_disconnect_cb once it is invalidated.
Problem 2: ProxyClient<Thread> use-after-free
Second problem is if ~Connection is destroyed first and invokes the m_disconnect_cb callback, and ~ProxyClient<Thread> destructor runs and finishes before m_disconnect_cb has a chance to do anything, when the callback resumes, the ProxyClient<Thread> object will be gone and the callbacks's attempt to access and erase it will be invalid. 79c09e4 fixes this by making ~ProxyClient<Thread> set a destroyed bit when it runs and having the m_disconnect_cb callback just return early if the bit is set.
Problem 3: ThreadContext use after free
Just fixing the race between ~Connection and ~ProxyClient<Thread> is actually not enough because after the ProxyClient<Thread> object is destroyed, the thread owning the ProxyClient<Thread> can exit, destroying the thread_local ThreadContext struct that contained the ProxyClient<Thread> object, and also destroying the ThreadContext.waiter::m_mutex mutex the ProxyClient<Thread> code uses for synchronization. If this happens while an m_disconnect_db callback is still in progress, that callback will fail when it tries to lock the mutex and the mutex longer exists. (Commit 85df929 was a broken attempt to fix this and commit 30431bc has another description of the problem).
Possible Fixes
Adding EventLoop::sync call
The most direct fix for problem 3 is probably to have the thread that destroyed the ProxyClient<Thread> object and is about to exit and destroy ThreadContext::waiter.m_mutex call EventLoop::sync to force m_disconnect_cb to finish running before this happens. This would work because m_disconnect_cb always runs on the event loop thread. This would require reverting 85df929 to avoid reintroducing Problem 2 (so the m_disconnect_cb callback checks destroyed bit after it acquires ThreadContext::waiter.m_mutex instead of before.)
Calling removeSyncCleanup from EventLoop thread
If removeSyncCleanup(m_disconnect_cb) could be called from EventLoop thread I think all previous fixes could be reverted and there would be no problem here because m_disconnect_cb would only be accessed from one thread, and couldn't be removed while it might be running. This would somehow have to be done without ThreadContext::waiter.m_mutex being held (to avoid deadlocks) so I'm not sure if this is possible.
Switching mutexes
I am thinking that cleaner fixes for all the problems above could be possible by switching ProxyClient<Thread> and related code to use EventLoop::m_mutex for some locking instead of ThreadContext::waiter.m_mutex. This could simplify things by not needing to deal with two mutexes and needing to release them to lock in a consistent order. I was even thinking of switching to EventLoop::m_mutex exclusively for all locking, but I don't think I can do this because it's possible for a single thread to make IPC calls over multiple connections and for not all connections to use the same event loop.
Refactoring ThreadContext
ThreadContext callback_threads request_threads maps indexed by Connection* could be replaced with Connection maps indexed by thread_id which could allow not using ThreadContext::waiter.m_mutex at all and using EventLoop::m_mutex everywhere.
Summary
There is a bug and I am not sure how to fix it yet.