Skip to content

Conversation

@LevilkTheReal
Copy link

Problem / Description

The Waku SDK and core protocols lack proper cleanup mechanisms, leading to issues:

  1. Memory Leaks - Event listeners are never removed during stop operations, causing memory to accumulate over repeated start/stop cycles
  2. Resource Leaks - Stream managers, timers, and intervals are not properly cleaned up, leading to resource exhaustion
  3. Dangling Operations - Queries and background tasks continue running after stop() is called, preventing clean shutdown
  4. No Graceful Cancellation - Long running operations cannot be cancelled, resulting in slow shutdowns and unpredictable behavior

Solution

This PR tries to implement profound cleanup and resource management across all protocols and SDK layers.

Protocol Level Changes:

  • Added stop() methods to all protocols (Filter, LightPush, Store, Relay) to properly clean up stream managers and remove event listeners
  • Added abort signal support to Store queries via new abortSignal parameter in the IStore interface, enabling graceful cancellation of long running queries
  • Implemented proper disposal of protocol resources (streams, subscriptions, handlers)

SDK Level Changes:

  • WakuNode: Updated stop() to call protocol-level stop methods during shutdown
  • QueryOnConnect: Implemented active query tracking with proper cleanup, tracks running queries, waits for completion during stop, and removes all event listeners
  • MissingMessageRetriever: Added query tracking with abort signal support, ensuring all active queries are properly cancelled and awaited during cleanup
  • RetryManager: Added stopAllRetries() method to cancel all pending retry operations and clear timeouts
  • ReliableChannel: Implemented a comprehensive stop() method that orchestrates cleanup of all child components, waits for pending tasks, unsubscribes from message streams, and removes all event listeners

Breaking Changes:
The following methods are now async and must be awaited:

  • QueryOnConnect.stop()
  • MissingMessageRetriever.stop()
  • ReliableChannel.stop()

Migration is straightforward - add await when calling these methods.


At SolarPunk, we are currently using ReliableChannels with all the changes on this branch.
We have not experienced the following errors since the changes:

@solarpunkltd_swarm-chat-js.js?v=43444ad2:33999 Uncaught (in promise) Error: Message Channel must be started
    at _ReliableChannel.assertStarted (@solarpunkltd_swarm-chat-js.js?v=43444ad2:33999:13)
    at _ReliableChannel.sendSyncMessage (@solarpunkltd_swarm-chat-js.js?v=43444ad2:34032:10)
    at @solarpunkltd_swarm-chat-js.js?v=43444ad2:34013:14

@waku_sdk.js?v=43444ad2:12365 Uncaught (in promise) Error: Store query failed with status code: undefined, description: undefined
    at StoreCore.queryPerPage (@waku_sdk.js?v=43444ad2:12365:15)
    at async Store.queryGenerator (@waku_sdk.js?v=43444ad2:21944:24)
    at async MissingMessageRetriever.retrieveMissingMessage (@solarpunkltd_swarm-chat-js.js?v=43444ad2:33764:24)
queryPerPage @ @waku_sdk.js?v=43444ad2:12365
await in queryPerPage
queryGenerator @ @waku_sdk.js?v=43444ad2:21944
await in queryGenerator
retrieveMissingMessage @ @solarpunkltd_swarm-chat-js.js?v=43444ad2:33764
(anonymous) @ @solarpunkltd_swarm-chat-js.js?v=43444ad2:33748Understand this error

@waku_sdk.js?v=43444ad2:10653 Uncaught (in promise) AbortError: The operation was aborted
    at raceSignal (@waku_sdk.js?v=43444ad2:10653:17)
    at Object.read (@waku_sdk.js?v=43444ad2:22792:39)
    at Object.read (@waku_sdk.js?v=43444ad2:22868:41)
    at read3 (@waku_sdk.js?v=43444ad2:39226:28)
    at readString (@waku_sdk.js?v=43444ad2:39234:21)
    at select (@waku_sdk.js?v=43444ad2:39259:24)
    at async ConnectionImpl.newStream [as _newStream] (@waku_sdk.js?v=43444ad2:39964:40)
    at async ConnectionImpl.newStream (@waku_sdk.js?v=43444ad2:39598:20)
    at async @waku_sdk.js?v=43444ad2:38306:28

I have not done anything regarding to testing. After review, if the changes are correct and acceptable, I'll do the required tests.

Checklist

  • Code changes are covered by unit tests.
  • Code changes are covered by e2e tests, if applicable.
  • Dogfooding has been performed, if feasible.
  • A test version has been published, if required.
  • All CI checks pass successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant