Races between `~ProxyClient<Thread>` and `~Connection` destructors

#186 fixed several race conditions that happen during the ["disconnecting and blocking"](https://github.com/bitcoin-core/libmultiprocess/blob/a11e6905c238dc35a8bbef995190296bc6329d49/test/mp/test/test.cpp#L251) unit test, but a new one was reported in https://github.com/Sjors/bitcoin/pull/90#issuecomment-3024942087 with details in https://github.com/Sjors/bitcoin/pull/90#issuecomment-3029059790 and I want to write more notes about it. There is also a [branch](https://github.com/ryanofsky/libmultiprocess/commits/pr/threadclean.2) with previously attempted fixes

The problem is not too hard to reproduce locally: if `mptest` binary is run in in a loop in bash, the tsan error happens in a few minutes. With [attempted fixes so far](https://github.com/ryanofsky/libmultiprocess/commits/pr/threadclean.2) running the mptest binary in a loop results in a deadlock after about ~20 minutes for me.

### Background
If a `Connection` object is destroyed while a `ProxyClient<Thread>` object is still using it (due to a sudden disconnect), the `ProxyClient<Thread>::m_disconnect_cb` callback will be called from `~Connection` destructor, deleting the `ProxyClient<Thread>` object from whichever `ConnThreads` map it is in.

If a `ProxyClient<Thread>` object is destroyed before its `Connection` object (the more normal case), the `ProxyClient<Thread>` destructor will unregister `m_disconnect_cb` so when the `Connection` is later destroyed it will not try to delete the `ProxyClient<Thread>` object.

If the two destructors are called in sequence in either order, everything is fine. But if the destructors are called from different threads in a race, numerous problems occur.

### Problem 1: invalid `m_disconnect_cb` iterator erase

If `~Connection` is destroyed first and is about to call the `m_disconnect_cb` callback, but then the `~ProxyClient` destructor is called in another thread, it will currently (as of a11e6905c238dc35a8bbef995190296bc6329d49) try to delete the `m_disconnect_cb` using an iterator which is no longer valid. 79c09e44420a79bb7ac56960fe94a16849f04bc3 fixes this by changing the `~ProxyClient<Thread>` destructor to avoid making deleting `m_disconnect_cb` once it is invalidated.

### Problem 2: `ProxyClient<Thread>` use-after-free

Second problem is if `~Connection` is destroyed first and invokes the `m_disconnect_cb` callback, and `~ProxyClient<Thread>` destructor runs and finishes before `m_disconnect_cb` has a chance to do anything, when the callback resumes, the `ProxyClient<Thread>` object will be gone and the callbacks's attempt to access and erase it will be invalid. 79c09e4 fixes this by making `~ProxyClient<Thread>` set a destroyed bit when it runs and having the `m_disconnect_cb` callback just return early if the bit is set.

### Problem 3: `ThreadContext` use after free

Just fixing the race between `~Connection` and `~ProxyClient<Thread>` is actually not enough because after the `ProxyClient<Thread>` object is destroyed, the thread owning the `ProxyClient<Thread>` can exit, destroying the `thread_local ThreadContext` struct that contained the `ProxyClient<Thread>` object, and also destroying the `ThreadContext.waiter::m_mutex` mutex the `ProxyClient<Thread>` code uses for synchronization. If this happens while an `m_disconnect_db` callback is still in progress, that callback will fail when it tries to lock the mutex and the mutex longer exists. (Commit 85df929b4e63d9e4a9271aaa7a3ba6458104a14b was a broken attempt to fix this and commit 30431bc0c6fa125eb438cb49856f0144af2ea8c5 has another description of the problem).

### Possible Fixes

#### Adding EventLoop::sync call

The most direct fix for problem 3 is probably to have the thread that destroyed the `ProxyClient<Thread>` object and is about to exit and destroy `ThreadContext::waiter.m_mutex` call `EventLoop::sync` to force `m_disconnect_cb` to finish running before this happens. This would work because `m_disconnect_cb` always runs on the event loop thread. This would require reverting 85df929b4e63d9e4a9271aaa7a3ba6458104a14b to avoid reintroducing Problem 2 (so the `m_disconnect_cb` callback checks destroyed bit after it acquires `ThreadContext::waiter.m_mutex` instead of before.)

#### Calling removeSyncCleanup from EventLoop thread

If `removeSyncCleanup(m_disconnect_cb)` could be called from EventLoop thread I think all previous fixes could be reverted and there would be no problem here because `m_disconnect_cb` would only be accessed from one thread, and couldn't be removed while it might be running. This would somehow have to be done without `ThreadContext::waiter.m_mutex` being held (to avoid deadlocks) so I'm not sure if this is possible.

#### Switching mutexes

I am thinking that cleaner fixes for all the problems above could be possible by switching `ProxyClient<Thread>` and related code to use `EventLoop::m_mutex` for some locking instead of `ThreadContext::waiter.m_mutex`. This could simplify things by not needing to deal with two mutexes and needing to release them to lock in a consistent order. I was even thinking of switching to `EventLoop::m_mutex` exclusively for all locking, but I don't think I can do this because it's possible for a single thread to make IPC calls over multiple connections and for not all connections to use the same event loop.

#### Refactoring ThreadContext

`ThreadContext` `callback_threads` `request_threads` maps indexed by `Connection*` could be replaced with `Connection` maps indexed by thread_id which could allow not using `ThreadContext::waiter.m_mutex` at all and using `EventLoop::m_mutex` everywhere.

### Summary

There is a bug and I am not sure how to fix it yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Races between `~ProxyClient<Thread>` and `~Connection` destructors #189

Background

Problem 1: invalid `m_disconnect_cb` iterator erase

Problem 2: `ProxyClient<Thread>` use-after-free

Problem 3: `ThreadContext` use after free

Possible Fixes

Adding EventLoop::sync call

Calling removeSyncCleanup from EventLoop thread

Switching mutexes

Refactoring ThreadContext

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Races between ~ProxyClient<Thread> and ~Connection destructors #189

Description

Background

Problem 1: invalid m_disconnect_cb iterator erase

Problem 2: ProxyClient<Thread> use-after-free

Problem 3: ThreadContext use after free

Possible Fixes

Adding EventLoop::sync call

Calling removeSyncCleanup from EventLoop thread

Switching mutexes

Refactoring ThreadContext

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Races between `~ProxyClient<Thread>` and `~Connection` destructors #189

Problem 1: invalid `m_disconnect_cb` iterator erase

Problem 2: `ProxyClient<Thread>` use-after-free

Problem 3: `ThreadContext` use after free