-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for unexpected socket closures and data leakage under heavy load #646
base: master
Are you sure you want to change the base?
Conversation
# libuv will make socket non-blocking | ||
tr._open(sock.fileno()) | ||
tr._open(sockfd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks correct -- but I'm wondering how vanilla asyncio handles the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think vanilla asyncio has an easier problem in that it can just have python sockets "all the way down", so just let reference counting take care of cleanup, while here we need to manage the disconnect with libuv
dealing in file descriptors. I am suspecting there is some error handling path where a file descriptor is closed while the python socket object remains alive and not detached, so when it is finally closed, it messes up any new socket that happens to have the same file descriptor.
e.g. create socket s
, call a loop method passing in an explicit socket, <bad error path which will end with sock.close()
> overlapping with an .accept
. I think the .accept
never results in a python socket object being created.
So with the methods accepting sockets
and other methods that internally work directly in file descriptors can there be a discrepancy?
@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc? Any possibility to add some test here? |
I am trying to get a stable test. It is tricky because it is a race condition, if my guess is correct. I think it is a race if TLS negotiation during a call to So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause. |
Ok I see, the linked issue was also concerning as it looked as it was trying to write data into some incorrect socket. The error was also something we observed at similar time instances when we observed the response data getting leaked to incorrect requests. But we dont know is that issue actually related to the data leakage or just something else. (These RuntimeErrors dont happen with vanilla asyncio) |
I still haven't been able to isolate a standalone, self-contained test. The test environment in which I generated the same error we see in production involves 2 VMs with significant network latency between them. The first of the VMs is just a web server, the second is a web server that accepts requests, and then makes outgoing client requests (using aiohttp) to the first webserver with TLS and a short timeout (around 1 second). With this setup, I quite reliably get a failure within 250 connections. When I run with this patch applied, I have never had a failure in 20,000 connections. We have also run this in our production environment. When we first encountered this failure, we hit it within 1 hour of using aiohttp >= 3.10. Since running with this patch we have been running for 5 days with no failures. |
Is accepting this blocked on the tests that are failing? I don't think those failures are related to this change, as they are also failing for PR #644, which is solely a documentation change. I looked at the test logs and I would guess that a dependency is causing the changed results. Related to this, I notice that in the failing tests, and alpha release of Cython 3.1 is being used ( |
This is to address issue #645 and in aiohttp/aiohappyeyeballs#93 and aiohttp/aiohappyeyeballs#112