Fix for unexpected socket closures and data leakage under heavy load #646

todddialpad · 2024-11-25T17:20:42Z

This is to address issue #645 and in aiohttp/aiohappyeyeballs#93 and aiohttp/aiohappyeyeballs#112

1st1 · 2024-11-25T20:12:01Z

uvloop/loop.pyx

                # libuv will make socket non-blocking
-                tr._open(sock.fileno())
+                tr._open(sockfd)


The approach looks correct -- but I'm wondering how vanilla asyncio handles the same thing?

I think vanilla asyncio has an easier problem in that it can just have python sockets "all the way down", so just let reference counting take care of cleanup, while here we need to manage the disconnect with libuv dealing in file descriptors. I am suspecting there is some error handling path where a file descriptor is closed while the python socket object remains alive and not detached, so when it is finally closed, it messes up any new socket that happens to have the same file descriptor.
e.g. create socket s, call a loop method passing in an explicit socket, <bad error path which will end with sock.close()> overlapping with an .accept. I think the .accept never results in a python socket object being created.

So with the methods accepting sockets and other methods that internally work directly in file descriptors can there be a discrepancy?

MarkusSintonen · 2024-11-26T08:14:03Z

@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?

Any possibility to add some test here?

todddialpad · 2024-11-26T10:58:16Z

@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?

Any possibility to add some test here?

I am trying to get a stable test. It is tricky because it is a race condition, if my guess is correct. I think it is a race if TLS negotiation during a call to loop.create_connection with an explicit socket is cancelled, and a subsequent incoming connection is accepted before the CancelledError is propagated. I think both libuv (or uvloop) first and aiohttp second close the underlying file descriptor.

So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.

MarkusSintonen · 2024-11-26T11:56:08Z

So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.

Ok I see, the linked issue was also concerning as it looked as it was trying to write data into some incorrect socket. The error was also something we observed at similar time instances when we observed the response data getting leaked to incorrect requests. But we dont know is that issue actually related to the data leakage or just something else. (These RuntimeErrors dont happen with vanilla asyncio)

todddialpad · 2024-12-02T21:09:00Z

@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?
Any possibility to add some test here?

I am trying to get a stable test. It is tricky because it is a race condition, if my guess is correct. I think it is a race if TLS negotiation during a call to loop.create_connection with an explicit socket is cancelled, and a subsequent incoming connection is accepted before the CancelledError is propagated. I think both libuv (or uvloop) first and aiohttp second close the underlying file descriptor.

So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.

I still haven't been able to isolate a standalone, self-contained test. The test environment in which I generated the same error we see in production involves 2 VMs with significant network latency between them. The first of the VMs is just a web server, the second is a web server that accepts requests, and then makes outgoing client requests (using aiohttp) to the first webserver with TLS and a short timeout (around 1 second).

With this setup, I quite reliably get a failure within 250 connections. When I run with this patch applied, I have never had a failure in 20,000 connections.

We have also run this in our production environment. When we first encountered this failure, we hit it within 1 hour of using aiohttp >= 3.10. Since running with this patch we have been running for 5 days with no failures.

todddialpad · 2024-12-04T00:29:16Z

Is accepting this blocked on the tests that are failing? I don't think those failures are related to this change, as they are also failing for PR #644, which is solely a documentation change.

I looked at the test logs and I would guess that a dependency is causing the changed results. Related to this, I notice that in the failing tests, and alpha release of Cython 3.1 is being used (Using cached Cython-3.1.0a1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata). Is this intentional?
In the last test run that passed, the release version was used (Using cached Cython-3.0.11-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata)

Fix for unexpected socket closures and data leakage under heavy load

752f7c1

1st1 reviewed Nov 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for unexpected socket closures and data leakage under heavy load #646

Fix for unexpected socket closures and data leakage under heavy load #646

todddialpad commented Nov 25, 2024

1st1 Nov 25, 2024

todddialpad Nov 28, 2024

MarkusSintonen commented Nov 26, 2024

todddialpad commented Nov 26, 2024

MarkusSintonen commented Nov 26, 2024

todddialpad commented Dec 2, 2024

todddialpad commented Dec 4, 2024

Fix for unexpected socket closures and data leakage under heavy load #646

Are you sure you want to change the base?

Fix for unexpected socket closures and data leakage under heavy load #646

Conversation

todddialpad commented Nov 25, 2024

1st1 Nov 25, 2024

Choose a reason for hiding this comment

todddialpad Nov 28, 2024

Choose a reason for hiding this comment

MarkusSintonen commented Nov 26, 2024

todddialpad commented Nov 26, 2024

MarkusSintonen commented Nov 26, 2024

todddialpad commented Dec 2, 2024

todddialpad commented Dec 4, 2024