-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Performance issues caused by abnormal HTTP/2 SETTINGS frame exchange #11958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
(2) is easy and can be done. (1) would involve introducing a new sub-channel connectivity state since it is a non-transient failure, and have a way of marking pending transports when they go to waiting for the settings frame and a timeout mechanism when they don't receive one. It will need to be discussed. |
ServerImpl has a handshakeTimeoutMillis that it uses to kill connections that take too long to go READY. For client-side, we should have such a timeout as well, but it needed to grow as the backoff increases and that got tied up in #1943 where people couldn't agree with how that was supposed to work. We knew that I feel confident C has a handshaking timeout on client-side. When such things trigger the connection is killed and goes TRANSIENT_FAILURE (because the connection never went READY) |
Normally toRemove() on a LinkedHashSet would loops over the list and remove items on itself. But when for (PendingStream stream : toRemove) {
pendingStreams.remove(stream);
} (Really, in this case we should probably throw the old set away completely and replace it with a new set, because it will have a high water mark and waste memory. |
Title: Performance issues caused by abnormal HTTP/2 SETTINGS frame exchange
We experienced two problems in gRPC communication between our Java client and Python server:
1. HTTP/2 SETTINGS frame exchange issue
When creating the channel in the Java gRPC client, the client must wait for the server to return an HTTP/2 SETTINGS frame before updating the channel state to READY (this is specifically implemented in
io.grpc.netty.NettyClientHandler.FrameListener#onSettingsRead
). However, due to a version compatibility problem in our Python gRPC server, the server fails to correctly return the SETTINGS frame, causing the client's channel to remain in the CONNECTING state for a long time.2. Performance bottleneck analysis
In this state, if we send a large number of RPC requests without deadlines, these requests are accumulated as pending streams and stored in the
pendingStreams
property ofDelayedClientTransport
. In our environment, about one million such requests were queued.When the TCP connection fails to establish,
io.grpc.internal.DelayedClientTransport#reprocess
is triggered. Inside this method, the system callspendingStreams.removeAll(toRemove)
. SincetoRemove
is implemented as an ArrayList rather than a Set, it results in an O(n²) complexity. When n is very large (e.g., around one million), this blocks the IO thread for around 30 minutes, causing severe IO thread stalls.Suggestions for optimization
Based on these observations, we propose a couple of potential improvements for grpc-java:
toRemove
collection inDelayedClientTransport#reprocess
to a Set implementation, reducing the complexity from O(n²) to O(n).The text was updated successfully, but these errors were encountered: