Subscription Retry #1218

gwbischof · 2025-11-07T21:19:52Z

Fixes #1198
CI failures don't look related to this PR.

danielballan

Good coverage!

tiled/client/stream.py

danielballan · 2025-11-12T20:09:15Z

tiled/client/stream.py

-            if data is None:
-                self.stream_closed.process(self)
-                self._disconnect()
+        for attempt in self._websocket_retry_context():


I see another structural issue here. We're using a single "retry context" for the entire lifecycle of the subscription. Imagine that we run for days and get disconnected once a day. We want a fresh "retry context" each time we get disconnected.

There are three loops:

Loop forever (until self._disconnect_event.is_set()) where "disconnect" here refers to the Subscription, not to a particular websocket.

Loop over retries when disconnected, and give up if retries are exhausted.

Loop over polling recv() calls.

Here is a sketch that introduces one new method, _run for the outer loop.

def _run(self): "This runs once for the lifecycle of the Subscription." while not self._disconnect_event.is_set(): self._connect() try: self._receive() except (websockets.exceptions.ConnectionClosedError, OSError): logger.debug("Disconnected! Will attempt to reconnect") continue # reconnect def _receive(self): "Receive and process websocket messages." while not self._disconnect_event.is_set(): try: data = self._websocket.recv(timeout=RECEIVE_TIMEOUT) except (TimeoutError, anyio.EndOfStream): continue ... def _connect(self): for attempt in stamina.retry_context( on=( websockets.exceptions.ConnectionClosedError, OSError, ), ... ): with attempt: ... else: logger.warning("exhausted attempts...") # This will break the _run loop. self._disconnect()

Isn't this the same as setting the _connect retries to unlimited?

No. The difference is, once you get a successful connection, the retry counter restarts fresh the next time the connection drops. But if you use up your N retries with no success, it (correctly) gives up.

Ohh, so for something like the end-of-run-consumer we set the retries to unlimited? But for a typical client it retries N times and gives up if it doesn't connect? (And if its able to connect the number of retries get reset)

Based on my reading, I am appreciating that unlimited retries is basically never the right solution. If something is persistently down, eventually the client should give up, and manual recovery becomes a feature rather than a bug. For a passive service like the end-of-run consumer, the retries might be generous—more generous than a client running in an interactive session like a Jupyter notebook—but still finite.

And if its able to connect the number of retries get reset

Yes. The feature of the two loops (well, three loops if you could the recv loop) is riding out an unlimited number of dropped-connection events, but a limited number of re-connection attempts per dropped-connection event.

A service running for a very long time may see an unlimited number of disconnection events, and that's fine. But it should not hammer a distressed service indefinitely before giving up.

danielballan · 2025-11-14T19:46:33Z

tiled/_tests/test_subscription.py

+@pytest.fixture(autouse=True)
+def fast_retries(monkeypatch):
+    """Set retry attempts to 2 for faster tests (down from default 10)."""
+    monkeypatch.setattr("tiled.client.stream.TILED_RETRY_ATTEMPTS", 2)


"There's an app for that." :-D

https://stamina.hynek.me/en/stable/testing.html#limiting-retries

danielballan · 2025-11-14T19:49:11Z

tiled/client/stream.py

        self._disconnect_event = threading.Event()
        self._thread = None
+        self._last_received_sequence = None  # Track last sequence for reconnection
+        self._connected = False  # Track connection state


Under the general rule of "state is the root of all evil" I wonder if we can avoid adding this state. I think that the control flow ensures that self._connect(...) only gets called (1) when self._run(...) starts and (2) after a connection error is raised. So the if self._connected: return cut-out is protecting a codepath that, in fact, you can never hit.

gwbischof added 5 commits November 7, 2025 16:19

fix to get tests running on mac

cd45679

add retry

d13b84d

touch ups

94ab4d7

touch ups

427dce2

touch ups

3fdb31b

danielballan reviewed Nov 10, 2025

View reviewed changes

tiled/client/stream.py Outdated Show resolved Hide resolved

tiled/client/stream.py Outdated Show resolved Hide resolved

gwbischof added 3 commits November 12, 2025 12:03

fix up retry loop

fc969c1

Subscription retry progress

a282fa5

touch ups

c1fc8c1

gwbischof marked this pull request as ready for review November 12, 2025 18:42

gwbischof requested a review from danielballan November 12, 2025 18:50

danielballan reviewed Nov 12, 2025

View reviewed changes

gwbischof added 9 commits November 14, 2025 12:02

use a stamina decorator, and add run loop

4c3bdbc

makeing some progress

0ff572e

touch ups

5a5e905

more progress

0eefde9

moving some things around

dc02b66

refresh retries

9e1cd94

connect on main thread

89d10ad

touch up

cc64f66

touch up

dace1a3

danielballan reviewed Nov 14, 2025

View reviewed changes

trying to figure out why the test is hanging

6c9c9fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Subscription Retry #1218

Subscription Retry #1218

Uh oh!

gwbischof commented Nov 7, 2025 •

edited

Loading

Uh oh!

danielballan left a comment

Uh oh!

Uh oh!

Uh oh!

danielballan Nov 12, 2025 •

edited

Loading

Uh oh!

gwbischof Nov 14, 2025

Uh oh!

danielballan Nov 14, 2025

Uh oh!

gwbischof Nov 14, 2025

Uh oh!

danielballan Nov 14, 2025

Uh oh!

danielballan Nov 14, 2025

Uh oh!

gwbischof Nov 14, 2025

Uh oh!

danielballan Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Subscription Retry #1218

Are you sure you want to change the base?

Subscription Retry #1218

Uh oh!

Conversation

gwbischof commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielballan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielballan Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gwbischof Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

danielballan Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

gwbischof Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

danielballan Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

danielballan Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

gwbischof Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

danielballan Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gwbischof commented Nov 7, 2025 •

edited

Loading

danielballan Nov 12, 2025 •

edited

Loading