Skip to content

[ZEPPELIN-6260] Fix memory leak in WebSocket watcher connections #5001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

renechoi
Copy link
Contributor

What is this PR for?

This PR fixes a memory leak issue where WebSocket connections that are switched to watcher mode are never removed from the watcherSockets queue when the connection is closed. This causes the queue to grow indefinitely, leading to increased memory
usage over time and potential OutOfMemoryError in long-running Zeppelin servers.

What type of PR is it?

Bug Fix

What is the Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN-6260

How should this be tested?

Steps to reproduce the issue:

  1. Open a WebSocket connection to Zeppelin
  2. Send a WATCHER message to switch the connection to watcher mode
  3. Close the connection
  4. The connection remains in watcherSockets queue indefinitely

Verification after fix:

  • Unit tests have been added to verify the fix:
    • removeWatcherConnectionCleansQueue: Basic functionality test
    • removeWatcherConnectionWithMultipleWatchers: Tests selective removal
    • removeWatcherConnectionConcurrentTest: Tests thread safety
    • switchConnectionToWatcherAndRemove: Tests complete lifecycle

Screenshots (if appropriate)

N/A

Questions:

  • Does the license files need update? No
  • Is there breaking changes for older versions? No
  • Does this needs documentation? No

Description of changes:

  1. Added removeWatcherConnection() method to ConnectionManager to safely remove connections from the watcherSockets queue
  2. Modified NotebookServer.removeConnection() to call removeWatcherConnection() when any connection is closed
  3. Added debug logging to track watcher connection removal
  4. Added comprehensive unit tests including concurrent access scenarios

The fix is minimal and safe:

  • The ConcurrentLinkedQueue.remove() operation is safe even if the element doesn't exist
  • The synchronization block is very short, minimizing performance impact
  • The approach ensures no watcher connections are leaked, regardless of how they were closed

Related observations:

During the investigation, I noticed that broadcastToWatchers() doesn't remove watchers when IOException occurs. This could cause performance degradation as closed connections would repeatedly fail. However, this is a separate issue and should be
addressed in a different PR to keep this fix focused.

Copy link
Contributor

@Reamer Reamer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Reamer Reamer merged commit 6edb9f1 into apache:master Jul 25, 2025
16 of 18 checks passed
asf-gitbox-commits pushed a commit that referenced this pull request Jul 25, 2025
###  What is this PR for?

  This PR fixes a memory leak issue where WebSocket connections that are switched to watcher mode are never removed from the watcherSockets queue when the connection is closed. This causes the queue to grow indefinitely, leading to increased memory
  usage over time and potential OutOfMemoryError in long-running Zeppelin servers.

###  What type of PR is it?

  Bug Fix

###  What is the Jira issue?

  https://issues.apache.org/jira/browse/ZEPPELIN-6260

###  How should this be tested?

  Steps to reproduce the issue:
  1. Open a WebSocket connection to Zeppelin
  2. Send a WATCHER message to switch the connection to watcher mode
  3. Close the connection
  4. The connection remains in watcherSockets queue indefinitely

  Verification after fix:
  - Unit tests have been added to verify the fix:
    - removeWatcherConnectionCleansQueue: Basic functionality test
    - removeWatcherConnectionWithMultipleWatchers: Tests selective removal
    - removeWatcherConnectionConcurrentTest: Tests thread safety
    - switchConnectionToWatcherAndRemove: Tests complete lifecycle

###  Screenshots (if appropriate)

  N/A

###  Questions:

  - Does the license files need update? No
  - Is there breaking changes for older versions? No
  - Does this needs documentation? No

###  Description of changes:

  1. Added removeWatcherConnection() method to ConnectionManager to safely remove connections from the watcherSockets queue
  2. Modified NotebookServer.removeConnection() to call removeWatcherConnection() when any connection is closed
  3. Added debug logging to track watcher connection removal
  4. Added comprehensive unit tests including concurrent access scenarios

  The fix is minimal and safe:
  - The ConcurrentLinkedQueue.remove() operation is safe even if the element doesn't exist
  - The synchronization block is very short, minimizing performance impact
  - The approach ensures no watcher connections are leaked, regardless of how they were closed

###  Related observations:

  During the investigation, I noticed that broadcastToWatchers() doesn't remove watchers when IOException occurs. This could cause performance degradation as closed connections would repeatedly fail. However, this is a separate issue and should be
  addressed in a different PR to keep this fix focused.

Closes #5001 from renechoi/ZEPPELIN-6260-fix-watcher-memory-leak.

Signed-off-by: Philipp Dallig <[email protected]>
(cherry picked from commit 6edb9f1)
Signed-off-by: Philipp Dallig <[email protected]>
@Reamer
Copy link
Contributor

Reamer commented Jul 25, 2025

Merged into master and branch-0.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants