Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle PONG messages before closing connection due to heartbeat timeout #10544

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

mstegmaier
Copy link

@mstegmaier mstegmaier commented Mar 13, 2025

What do these changes do?

Currently, if receive is never called (or not in time), PONG messages aren't handled and aiohttp closes the connection for missing PONG after half of the heartbeat time has passed.
This commit changes the code that closes the connection for missing PONG so that instead of immediately closing the connection, it checks whether a PONG has been sent. If a PONG has been sent, the connection doesn't get closed.

Are there changes in behavior for the user?

The heartbeat mechanism is supposed to help identify if a websocket connection is still alive. When a PING message isn't responded to by a PONG message, the connection is closed. Websocket clients such as Web Browsers implement the heartbeat mechanism as well and for that reason, enabling heartbeat when using aiohttp can also be used to keep a websocket connection alive.

The current implementation of the heartbeat mechanism in aiohttp relies on the user to regularly call the receive() function in order to handle incoming PONG messages. If the user fails to call receive() within the given time frame, the connection gets closed by aiohttp. So when heartbeat in aiohttp is enabled in order to keep a websocket alive, it can actually cause the opposite, i.e. closing the websocket. Specifically, this happens for unidirectional communication via websockets (in our case a video stream).

The user will not notice any different behavior for "regular" use of websockets which used to work without issues. The user will only notice a change in behavior if originally their websocket connection was closed by aiohttp prematurely due to missing PONG even though as PONG has been sent by the other side.

If the user is calling receive() too late, e.g., a long running task prevents receive() from being called in time, the original implementation closed the websocket. The new implementation from this PR won't close the websocket immediately but will log a warning that a non-PONG message was received.

Is it a substantial burden for the maintainers to support this?

It's not a very big change so the burden should be fairly small.

Related issue number

There's no specific issue for this PR but the following issues are somewhat related:
#7508
#7320

Checklist

  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt
    • The format is <Name> <Surname>.
    • Please keep alphabetical order, the file is sorted by names.
  • Add a new news fragment into the CHANGES/ folder
    • name it <issue_or_pr_num>.<type>.rst (e.g. 588.bugfix.rst)

    • if you don't have an issue number, change it to the pull request
      number after creating the PR

      • .bugfix: A bug fix for something the maintainers deemed an
        improper undesired behavior that got corrected to match
        pre-agreed expectations.
      • .feature: A new behavior, public APIs. That sort of stuff.
      • .deprecation: A declaration of future API removals and breaking
        changes in behavior.
      • .breaking: When something public is removed in a breaking way.
        Could be deprecated in an earlier release.
      • .doc: Notable updates to the documentation structure or build
        process.
      • .packaging: Notes for downstreams about unobvious side effects
        and tooling. Changes in the test invocation considerations and
        runtime assumptions.
      • .contrib: Stuff that affects the contributor experience. e.g.
        Running tests, building the docs, setting up the development
        environment.
      • .misc: Changes that are hard to assign to any of the above
        categories.
    • Make sure to use full sentences with correct case and punctuation,
      for example:

      Fixed issue with non-ascii contents in doctest text files
      -- by :user:`contributor-gh-handle`.

      Use the past tense or the present tense a non-imperative mood,
      referring to what's changed compared to the last released version
      of this project.

Michael Stegmaier and others added 6 commits March 7, 2025 10:16
Currently, if receive is never called, PONG messages aren't handled and aiohttp closes the connection for missing PONG after half of the heartbeat time has passed.
This commit changes the code that closes the connection for missing PONG so that instead of immediately closing the connection, it checks whether a PONG has been sent. If a PONG has been sent, the connection doesn't get closed.
Handle exceptions in pong_not_received_coro() analogous to receive()
@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Mar 13, 2025
@mstegmaier mstegmaier marked this pull request as ready for review March 13, 2025 09:35
Copy link

codspeed-hq bot commented Mar 13, 2025

CodSpeed Performance Report

Merging #10544 will not alter performance

Comparing mstegmaier:master (29a439e) with master (45b861f)

Summary

✅ 46 untouched benchmarks

Copy link

codecov bot commented Mar 13, 2025

Codecov Report

Attention: Patch coverage is 54.54545% with 15 lines in your changes missing coverage. Please review.

Project coverage is 98.65%. Comparing base (6d205da) to head (8f3de6f).
Report is 3 commits behind head on master.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
aiohttp/web_ws.py 54.54% 12 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10544      +/-   ##
==========================================
- Coverage   98.70%   98.65%   -0.05%     
==========================================
  Files         122      122              
  Lines       37230    37262      +32     
  Branches     2064     2068       +4     
==========================================
+ Hits        36748    36762      +14     
- Misses        335      350      +15     
- Partials      147      150       +3     
Flag Coverage Δ
CI-GHA 98.54% <54.54%> (-0.05%) ⬇️
OS-Linux 98.20% <54.54%> (-0.05%) ⬇️
OS-Windows 96.14% <54.54%> (-0.05%) ⬇️
OS-macOS 97.33% <54.54%> (-0.03%) ⬇️
Py-3.10.11 97.22% <54.54%> (-0.05%) ⬇️
Py-3.10.16 97.77% <54.54%> (-0.04%) ⬇️
Py-3.11.11 97.85% <54.54%> (-0.05%) ⬇️
Py-3.11.9 97.31% <54.54%> (-0.04%) ⬇️
Py-3.12.8 ?
Py-3.12.9 98.31% <51.51%> (-0.05%) ⬇️
Py-3.13.1 ?
Py-3.13.2 98.29% <51.51%> (-0.05%) ⬇️
Py-3.9.13 97.11% <54.54%> (-0.05%) ⬇️
Py-3.9.21 97.64% <54.54%> (-0.04%) ⬇️
Py-pypy7.3.16 86.63% <51.51%> (-4.98%) ⬇️
VM-macos 97.33% <54.54%> (-0.03%) ⬇️
VM-ubuntu 98.20% <54.54%> (-0.05%) ⬇️
VM-windows 96.14% <54.54%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +215 to +217
async with async_timeout.timeout(self._pong_heartbeat / 10.0):
msg = await reader.read()
self._reset_heartbeat()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is adding a race condition where we would violate the concurrent calls to reading requirement:

raise RuntimeError("Concurrent call to receive() is not allowed")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be solved by the latest commit.

mstegmaier and others added 2 commits March 14, 2025 17:01
If self._waiting is set, we already are in the receive loop reading from the reader and would have read a PONG if one was already there.

This avoids concurrent reads on the reader.
Copy link
Member

@bdraco bdraco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still the possibility of concurrent reads with this implementation.

However there is also another problem as reading here can consume any message type which means a call to receive after this may find their the message they are looking for missing because it was read and discarded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear this would have a noticeable benefit to the user. They are likely to still get the same result regardless, this just seems to catch a (probably rare) edge case that might reduce the frequency of connection closes very slightly.

assert reader is not None
try:
async with async_timeout.timeout(self._pong_heartbeat / 10.0):
msg = await reader.read()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for a user to catch the exception or anything and avoid the connection close? If it is, then this would break their code, as we've now read a message that they should be processing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the same point as #10544 (review)

self._reset_heartbeat()
if msg.type is not WSMsgType.PONG:
ws_logger.warning(
f"Received {msg} while waiting for PONG. It seems like you haven't called `receive` within {self._pong_heartbeat} seconds."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't exactly accurate. The user's code could be reading a message with receive() every second, but if there are more than pong_heartbeat messages between PONGs, then it would still get tripped up here.

@Dreamsorcerer
Copy link
Member

I suspect the complexity of this change and the likelihood of it breaking something, combined with it likely only making a very slight (if any) reduction in closes will probably mean this won't get merged.

As mentioned in the linked discussion, a doc PR better explaining these limitations and a code sample to avoid them if memory issues are not a concern would be appreciated instead.

@Dreamsorcerer
Copy link
Member

it checks whether a PONG has been sent. If a PONG has been sent, the connection doesn't get closed.

Maybe there's some confusion here, but this description is not correct. I would describe the change as:

If code has failed to call receive() in a timely manner and a PONG message is at the top of the buffer, the connection will no longer be closed.

If code is calling receive() normally, then this change likely has zero impact. Either the receive() call is happening frequently enough that it processes the PONG messages and this code is never run, or the receive() call is too slow in processing lots of messages in which case this code will likely receive another non-PONG message when running and still close the connection (the chances that the next message just happens to be a PONG when overloaded seems extremely slim unless the user is doing something odd).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot:chronographer:provided There is a change note present in this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants