Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for unexpected socket closures and data leakage under heavy load #646

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

todddialpad
Copy link

This is to address issue #645 and in aiohttp/aiohappyeyeballs#93 and aiohttp/aiohappyeyeballs#112

# libuv will make socket non-blocking
tr._open(sock.fileno())
tr._open(sockfd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks correct -- but I'm wondering how vanilla asyncio handles the same thing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think vanilla asyncio has an easier problem in that it can just have python sockets "all the way down", so just let reference counting take care of cleanup, while here we need to manage the disconnect with libuv dealing in file descriptors. I am suspecting there is some error handling path where a file descriptor is closed while the python socket object remains alive and not detached, so when it is finally closed, it messes up any new socket that happens to have the same file descriptor.
e.g. create socket s, call a loop method passing in an explicit socket, <bad error path which will end with sock.close()> overlapping with an .accept. I think the .accept never results in a python socket object being created.

So with the methods accepting sockets and other methods that internally work directly in file descriptors can there be a discrepancy?

@MarkusSintonen
Copy link

@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?

Any possibility to add some test here?

@todddialpad
Copy link
Author

@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?

Any possibility to add some test here?

I am trying to get a stable test. It is tricky because it is a race condition, if my guess is correct. I think it is a race if TLS negotiation during a call to loop.create_connection with an explicit socket is cancelled, and a subsequent incoming connection is accepted before the CancelledError is propagated. I think both libuv (or uvloop) first and aiohttp second close the underlying file descriptor.

So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.

@MarkusSintonen
Copy link

So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.

Ok I see, the linked issue was also concerning as it looked as it was trying to write data into some incorrect socket. The error was also something we observed at similar time instances when we observed the response data getting leaked to incorrect requests. But we dont know is that issue actually related to the data leakage or just something else. (These RuntimeErrors dont happen with vanilla asyncio)

@todddialpad
Copy link
Author

@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?
Any possibility to add some test here?

I am trying to get a stable test. It is tricky because it is a race condition, if my guess is correct. I think it is a race if TLS negotiation during a call to loop.create_connection with an explicit socket is cancelled, and a subsequent incoming connection is accepted before the CancelledError is propagated. I think both libuv (or uvloop) first and aiohttp second close the underlying file descriptor.

So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.

I still haven't been able to isolate a standalone, self-contained test. The test environment in which I generated the same error we see in production involves 2 VMs with significant network latency between them. The first of the VMs is just a web server, the second is a web server that accepts requests, and then makes outgoing client requests (using aiohttp) to the first webserver with TLS and a short timeout (around 1 second).

With this setup, I quite reliably get a failure within 250 connections. When I run with this patch applied, I have never had a failure in 20,000 connections.

We have also run this in our production environment. When we first encountered this failure, we hit it within 1 hour of using aiohttp >= 3.10. Since running with this patch we have been running for 5 days with no failures.

@todddialpad
Copy link
Author

Is accepting this blocked on the tests that are failing? I don't think those failures are related to this change, as they are also failing for PR #644, which is solely a documentation change.

I looked at the test logs and I would guess that a dependency is causing the changed results. Related to this, I notice that in the failing tests, and alpha release of Cython 3.1 is being used (Using cached Cython-3.1.0a1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata). Is this intentional?
In the last test run that passed, the release version was used (Using cached Cython-3.0.11-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants