Skip to content

Fix race condition in WebSocketClient connection handling #781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

sampepose
Copy link

@sampepose sampepose commented Apr 29, 2025

🔗 Issue Links

Provide all JIRA tickets and/or GitHub issues related to this PR, if applicable.

🎯 Goal

The current implementation uses shared mutable state (connected and timeout flags) that can be accessed concurrently from different threads, causing a data race when the WebSocket connects while the main task is checking connection status.

This can be seen from the logs:

WARNING: ThreadSanitizer: data race (pid=75749)
  Write of size 1 at 0x00010aad29f0 by thread T1:
    #0 closure #2 () -> () in StreamVideo.StreamVideo.(connectWebSocketClient in _57FBE95933FF925DEA7AF207F70F5588)() async throws -> () <null> (MyTestApp.debug.dylib:arm64+0x1554088)
    #1 partial apply forwarder for closure #2 () -> () in StreamVideo.StreamVideo.(connectWebSocketClient in _57FBE95933FF925DEA7AF207F70F5588)() async throws -> () <null> (MyTestApp.debug.dylib:arm64+0x15623e4)
    #2 StreamVideo.WebSocketClient.(handle in _EA224C36B0C3BE21B3EC405BB58853DD)(healthcheck: StreamVideo.WrappedEvent, info: StreamVideo.HealthCheckInfo) -> () <null> (MyTestApp.debug.dylib:arm64+0x18cff14)
    #3 StreamVideo.WebSocketClient.webSocketDidReceiveMessage(Foundation.Data) -> () <null> (MyTestApp.debug.dylib:arm64+0x18ceba8)
    #4 protocol witness for StreamVideo.WebSocketEngineDelegate.webSocketDidReceiveMessage(Foundation.Data) -> () in conformance StreamVideo.WebSocketClient : StreamVideo.WebSocketEngineDelegate in StreamVideo <null> (MyTestApp.debug.dylib:arm64+0x18d183c)
    #5 closure #2 @Sendable () -> () in closure #1 @Sendable (Swift.Result<(extension in Foundation):__C.NSURLSessionWebSocketTask.Message, Swift.Error>) -> () in StreamVideo.URLSessionWebSocketEngine.(doRead in _E102C83AE68006C0360AB6BB3CADE04A)() -> () <null> (MyTestApp.debug.dylib:arm64+0x18bfc14)
    #6 partial apply forwarder for closure #2 @Sendable () -> () in closure #1 @Sendable (Swift.Result<(extension in Foundation):__C.NSURLSessionWebSocketTask.Message, Swift.Error>) -> () in StreamVideo.URLSessionWebSocketEngine.(doRead in _E102C83AE68006C0360AB6BB3CADE04A)() -> () <null> (MyTestApp.debug.dylib:arm64+0x18c52a8)
    #7 reabstraction thunk helper from @escaping @callee_guaranteed @Sendable () -> () to @escaping @callee_unowned @convention(block) @Sendable () -> () <null> (MyTestApp.debug.dylib:arm64+0x10b740c)
    #8 __tsan::invoke_and_release_block(void*) <null> (libclang_rt.tsan_iossim_dynamic.dylib:arm64+0x7bccc)
    #9 _dispatch_client_callout <null> (libdispatch.dylib:arm64+0x5738)

  Previous read of size 1 at 0x00010aad29f0 by thread T5:
    #0 (3) suspend resume partial function for StreamVideo.StreamVideo.(connectWebSocketClient in _57FBE95933FF925DEA7AF207F70F5588)() async throws -> () <null> (MyTestApp.debug.dylib:arm64+0x15535cc)
    #1 swift::runJobInEstablishedExecutorContext(swift::Job*) <null> (libswift_Concurrency.dylib:arm64+0x38108)

  Location is heap block of size 17 at 0x00010aad29e0 allocated by thread T1:
    #0 malloc <null> (libclang_rt.tsan_iossim_dynamic.dylib:arm64+0x5397c)
    #1 _malloc_type_malloc_outlined <null> (libsystem_malloc.dylib:arm64+0xf3ec)
    #2 swift::runJobInEstablishedExecutorContext(swift::Job*) <null> (libswift_Concurrency.dylib:arm64+0x38108)

  Thread T1 (tid=35619471, running) is a GCD worker thread

  Thread T5 (tid=35619547, running) is a GCD worker thread

SUMMARY: ThreadSanitizer: data race (/Users/sampepose/Library/Developer/CoreSimulator/Devices/B6999A6E-7380-48B5-85B9-A4E835706191/data/Containers/Bundle/Application/C68BB6C7-D8BF-402F-927A-1187AA422791/MyTestApp.app/MyTestApp.debug.dylib:arm64+0x1554088) in closure #2 () -> () in StreamVideo.StreamVideo.(connectWebSocketClient in _57FBE95933FF925DEA7AF207F70F5588)() async throws -> ()+0x118

📝 Summary

I've refactored the connection handling to use Swift's modern concurrency features to eliminate the data race while maintaining the same timeout and connection logic behavior.

🛠 Implementation

  • Replaced the polling loop with a task group that properly coordinates timeouts and connections
  • Implemented a safe continuation pattern to prevent potential double-resume issues
  • Used a shared handler reference that can be nullified to ensure the continuation is resumed exactly once
  • Added proper weak self references to avoid potential retain cycles

🧪 Manual Testing Notes

  1. Join a call successfully
  2. Attempt to join a call, but time out (must wait the 30 seconds)

See no crashes or ThreadSanitizer complaints in either case.

☑️ Contributor Checklist

  • I have signed the Stream CLA (required)
  • This change follows zero ⚠️ policy (required)
  • This change should receive manual QA
  • Changelog is updated with client-facing changes
  • New code is covered by unit tests
  • Comparison screenshots added for visual changes
  • Affected documentation updated (tutorial, CMS)

@sampepose sampepose requested a review from a team as a code owner April 29, 2025 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant