-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-thread bug when using usrsctp in N Worker threads in Rust #1352
Comments
This was referenced Mar 5, 2024
Usage of
That's all. Then something similar should be done in |
ibc
added a commit
that referenced
this issue
Mar 5, 2024
**WIP** Fixes #1352 ### Details - Basically as described in the ticket. But not everything is done at all. - Also, I'm testing this in Node by using UV async stuff (which doesn't make sense in mediasoup for Node but anyway). ### TODO - None of these changes should take effect when in Node, so we need to pass (or to NOT pass) some `define` only from Rust to enable this in the C++ code. We don't want to deal with UV async stuff when in Node because it's not needed at all, so let's see how to do it. - Missing thread X to initialize usrsctp and run the `Checker` singleton. And many other things. - Crash when a `SctpAssociation` is closed. I think it's because somehow the `onAsync` callback is invoked asynchronously (of course) so when it calls `sctpAssociation->OnUsrSctpSendSctpData()` it happens that such a `SctpAssociation` has already been freed. Not sure how to resolve it. Here the logs: ``` mediasoup:Transport close() +18s mediasoup:Channel request() [method:ROUTER_CLOSE_TRANSPORT] +8s mediasoup:Producer transportClosed() +19s mediasoup:DataProducer transportClosed() +18s mediasoup:DataProducer transportClosed() +0ms mediasoup:Transport close() +1ms mediasoup:Channel request() [method:ROUTER_CLOSE_TRANSPORT] +1ms mediasoup:Consumer transportClosed() +19s mediasoup:DataConsumer transportClosed() +18s mediasoup:DataConsumer transportClosed() +1ms mediasoup:Channel [pid:98040] RTC::SctpAssociation::ResetSctpStream() | SCTP_RESET_STREAMS sent [streamId:1] +1ms mediasoup:Channel request succeeded [method:ROUTER_CLOSE_TRANSPORT, id:39] +0ms DepUsrSCTP::onAsync() | ---------- onAsync!! DepUsrSCTP::onAsync() | ---------- onAsync, sending SCTP data!! mediasoup:Channel Producer Channel ended by the worker process +1ms mediasoup:ERROR:Worker worker process died unexpectedly [pid:98040, code:null, signal:SIGSEGV] +0ms ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
After investigating it in depth we have found a bad design in
DepUsrSCTP
class that allows a thread A make calls directly in thread B. This explains the crash reported in issue #1352.Overview of the usrsctp library integration in mediasoup
usrsctp was integrated in mediasoup once it merged the "Single thread" feature (PR sctplab/usrsctp#339). That PR added the
usrsctp_init_nothreads()
API to run usrsctp in mono thread mode, meaning that usrsctp no longer runs its own timer into an internal thread but instead exposes an API for the application to be aware of when it should "ping" usrsctp so it can send pending SCTP messages (basically SCTP retransmissions due to non received SCTP ACK from the remote endpoint).The way we integrated it was by wrapping the usrsctp API within the
DepUsrSCTP
andSctpAssociation
class in mediasoup.It's important to notice that this was designed having mediasoup for Node usage, this is: each mediasoup worker runs in a separate process with a single thread. From now on, this overview section describes the behavior in mediasoup for Node:
Worker
so a new process is created with a single libuv loop.Worker
creation,DepUsrSCTP::ClassInit()
is created which callsusrsctp_init_nothreads()
by passing a staticonSendSctpData()
callback as argument, which will be later invoked by usrsctp when it needs to send SCTP retransmissions to remote endpoints.Worker
C++ constructor calls toDepUsrSctp::CreateChecker()
which creates a singleton instance of theDepUsrSCTP::Checker
class which is stored as astatic
member inDepUsrSCTP
. Such aChecker
inherits fromTimerHandler
since it's intended to "ping" usrsctp every 10ms so usrsctp can send pending SCTP retransmissions.XxxxTransport
with SCTP support is created in the worker, itsSctpAssociation
instance calls toDepUsrSCTP::RegisterSctpAssociation(this)
to register itself inDepUsrSCTP
. If it's the first SCTP association then theChecker
is started, meaning that it runs a periodic timer every 10ms.usrsctp_handle_timers()
is called and usrsctp may invoke theonSendSctpData
callback to send pending SCTP retransmissions. In such a callback, the address of the associatedSctpAssociation
is given so we callsctpAssociation->OnUsrSctpSendSctpData(data, len)
to send the SCTP data to the endpoint.There is a single thread in the whole process (this is mediasoup for Node, remember) so everything is good. And it doesn't matter that the Node app creates many
Workers
since eachWorker
is a separate process with a single thread.Problem in Rust with multi threaded Workers
When in mefiasoup for Rust, each
Worker
doesn't run in a separate process but in a separate native thread. However usrsctp has global state so problems start showing up. Here an example:Worker
1 which creates a new thread 1 with its own libuv loop 1.DepUsrSCTP::ClassInit()
is called. Here aGlobalInstances
static member was added which is incremented every timeClassInit()
is called. If it was 0 thenusrsctp_init_nothreads()
is called by passing the staticonSendSctpData
callback as argument.SctpAssociation
is created,DepUsrSCTP::RegisterSctpAssociation()
is called from it. It increments a staticnumSctpAssociations
member and, if it was 0, then theChecker
is started.Checker::Start()
method runs a periodic timer. In which thread? In the thread from where the method was called. In this case the thread 1 and its associated libuv loop 1.Workers
. It createsWorker
2 which creates a new thread 2 with its own libuv loop 2.DepUsrSCTP::ClassInit()
is called but sinceGlobalInstances
static member was not 0 it won't callusrsctp_init_nothreads()
again.SctpAssociation
inWorker
2 is created,DepUsrSCTP::RegisterSctpAssociation()
is called from it. It increments a staticnumSctpAssociations
member and, since it was not 0, theChecker
is not started again (it's already started).Worker
1 andWorker 2
. Let's imagine there is packet loss so some SCTP messages created by usrsctp didn't reach the remote endpoints.Worker
1) fires, sousrsctp_handle_timers()
is called.usrsctp_handle_timers()
checks pending SCTP messages to be sent or resent and invokes theonSendSctpData()
static callback. Again: such a callback is invoked within thread 1.StcpAssociation
to which the SCTP message should be sent.SctpAssociation
could belong to aWebRtcTransport
inWorker
2. In that case, it callssctpAssociation->OnUsrSctpSendSctpData(data, len)
from thread 1, but such aSctpAssociation
belongs to thread 2 (which runs the libuv loop 2). This is a violation of libuv design since no other threads should directly invoke libuv methods on another thread. And it's also a problem in mediasoup C++ code since now we have thread 1 and thread 2 invoking methods in parallel in mediasoup C++ class instances inWorker
2. Memory corruption or whatever can happen in this case.So this is the problem.
Solution
Worker
is created (and hence its thread 2)DepUsrSCTP::ClassInit()
its called and it must create a separate thread X, create a separate libuv loop X into it, callusrsctp_init_nothreads(onSendSctpData)
from that thread X and create aChecker
singleton in that thread X.Workers
could be created (so we have thread 2, 3, etc).SctpAssociation
is created in any of thoseWorkers
and it callsDepUsrSCTP::RegisterSctpAssociation()
, such a method callsDepUsrSCTP::checker->Start()
as usual, but it should useuv_async_send()
to make such aStart()
method run within the thread X (instead than running it within the callling thread 1 or 2 or whatever).onSendSctpData()
will be called by usrsctp (from different threads X, 1, 2, etc).onSendSctpData()
callback, we get the targetSctpAssociation
and from it we need to get its associated libuv loop, or perhaps create auv_async_t
handle within the previously calledDepUsrSCTP::RegisterSctpAssociation()
, and retrieve it among theSctpAssociation
instance itself.uv_async_send()
to callsctpAssociation->OnUsrSctpSendSctpData(data, len)
within the thread associated to thatuv_async_t
handle. This way, each call toSctpAssociation::OnUsrSctpSendSctpData()
is guaranteed to be called within the same thread in which such aSctpAssociation
lives.DepUsrSCTP::ClassDestroy()
is called by the last remainingWorker
(soGlobalInstances
was 1), we must again useuv_async_send()
to invokeusrsctp_finish()
and callchecker->Stop()
and should also destroy/close/free the thread X. All those things will be created again if a newWorker
is created later.The text was updated successfully, but these errors were encountered: