Performance issues caused by abnormal HTTP/2 SETTINGS frame exchange #11958

lcz1998 · 2025-03-12T12:46:40Z

Title: Performance issues caused by abnormal HTTP/2 SETTINGS frame exchange

We experienced two problems in gRPC communication between our Java client and Python server:

Java client version: 1.51.0
Python server version: 1.70

1. HTTP/2 SETTINGS frame exchange issue

When creating the channel in the Java gRPC client, the client must wait for the server to return an HTTP/2 SETTINGS frame before updating the channel state to READY (this is specifically implemented in io.grpc.netty.NettyClientHandler.FrameListener#onSettingsRead). However, due to a version compatibility problem in our Python gRPC server, the server fails to correctly return the SETTINGS frame, causing the client's channel to remain in the CONNECTING state for a long time.

2. Performance bottleneck analysis

In this state, if we send a large number of RPC requests without deadlines, these requests are accumulated as pending streams and stored in the pendingStreams property of DelayedClientTransport. In our environment, about one million such requests were queued.

When the TCP connection fails to establish, io.grpc.internal.DelayedClientTransport#reprocess is triggered. Inside this method, the system calls pendingStreams.removeAll(toRemove). Since toRemove is implemented as an ArrayList rather than a Set, it results in an O(n²) complexity. When n is very large (e.g., around one million), this blocks the IO thread for around 30 minutes, causing severe IO thread stalls.

Suggestions for optimization

Based on these observations, we propose a couple of potential improvements for grpc-java:

Introduce a maximum waiting time for the HTTP/2 SETTINGS frame exchange to avoid waiting indefinitely for incompatible server responses.
Change the toRemove collection in DelayedClientTransport#reprocess to a Set implementation, reducing the complexity from O(n²) to O(n).

The text was updated successfully, but these errors were encountered:

kannanjgithub · 2025-03-19T13:19:23Z

(2) is easy and can be done. (1) would involve introducing a new sub-channel connectivity state since it is a non-transient failure, and have a way of marking pending transports when they go to waiting for the settings frame and a timeout mechanism when they don't receive one. It will need to be discussed.

ejona86 · 2025-03-31T15:45:57Z

ServerImpl has a handshakeTimeoutMillis that it uses to kill connections that take too long to go READY. For client-side, we should have such a timeout as well, but it needed to grow as the backoff increases and that got tied up in #1943 where people couldn't agree with how that was supposed to work. We knew that TryConnect() from connection-backoff.md includes the initial SETTINGS, but couldn't agree on other parts. With dual-stack and happy-eyeballs, much of that has been resolved,

I feel confident C has a handshaking timeout on client-side. When such things trigger the connection is killed and goes TRANSIENT_FAILURE (because the connection never went READY)

ejona86 · 2025-04-03T05:21:42Z

Since toRemove is implemented as an ArrayList rather than a Set, it results in an O(n²) complexity.

Normally toRemove() on a LinkedHashSet would loops over the list and remove items on itself. But when toRemove is equal in size to pendingStreams it chooses the slow toRemove.contains(). We can change the code to be:

for (PendingStream stream : toRemove) {
  pendingStreams.remove(stream);
}

(Really, in this case we should probably throw the old set away completely and replace it with a new set, because it will have a high water mark and waste memory. That would also let us use an ordinary HashMap instead of a LinkedHashMap, because we would be guaranteeing it isn't horribly over-sized. But that is more annoying than it should be; we either have to keep track of the maximum size manually or make the keys comparable to use TreeSet. Edit: We are doing that already, with the assumption that it will likely become completely empty.)

See grpc#11958

See #11958

…11994) See grpc#11958

… (#12002) See #11958

… (#12001) See #11958

… (#12000) See #11958

lcz1998 added the question label Mar 12, 2025

kannanjgithub added the api-review label Mar 19, 2025

kannanjgithub added this to the Next milestone Mar 19, 2025

ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 3, 2025

core: Avoid Set.removeAll() when passing a possibly-large List

d7052ed

See grpc#11958

ejona86 mentioned this issue Apr 3, 2025

core: Avoid Set.removeAll() when passing a possibly-large List #11994

Merged

kannanjgithub pushed a commit that referenced this issue Apr 4, 2025

core: Avoid Set.removeAll() when passing a possibly-large List (#11994)

5ca4d85

See #11958

ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 4, 2025

core: Avoid Set.removeAll() when passing a possibly-large List (grpc#…

72bbe60

…11994) See grpc#11958

ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 4, 2025

core: Avoid Set.removeAll() when passing a possibly-large List (grpc#…

07988f1

…11994) See grpc#11958

ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 4, 2025

core: Avoid Set.removeAll() when passing a possibly-large List (grpc#…

fd2c6df

…11994) See grpc#11958

ejona86 mentioned this issue Apr 4, 2025

core: Avoid Set.removeAll() when passing a possibly-large List (1.52.x backport) #11999

Closed

ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 4, 2025

core: Avoid Set.removeAll() when passing a possibly-large List (grpc#…

16a0741

…11994) See grpc#11958

ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 4, 2025

core: Avoid Set.removeAll() when passing a possibly-large List (grpc#…

5c30bf2

…11994) See grpc#11958

ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 4, 2025

core: Avoid Set.removeAll() when passing a possibly-large List (grpc#…

15b9ec8

…11994) See grpc#11958

kannanjgithub pushed a commit that referenced this issue Apr 7, 2025

core: Avoid Set.removeAll() when passing a possibly-large List (#11994)…

1a895cc

… (#12002) See #11958

kannanjgithub pushed a commit that referenced this issue Apr 7, 2025

core: Avoid Set.removeAll() when passing a possibly-large List (#11994)…

1435323

… (#12001) See #11958

kannanjgithub pushed a commit that referenced this issue Apr 7, 2025

core: Avoid Set.removeAll() when passing a possibly-large List (#11994)…

666136b

… (#12000) See #11958

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues caused by abnormal HTTP/2 SETTINGS frame exchange #11958

Performance issues caused by abnormal HTTP/2 SETTINGS frame exchange #11958

lcz1998 commented Mar 12, 2025

kannanjgithub commented Mar 19, 2025

ejona86 commented Mar 31, 2025

ejona86 commented Apr 3, 2025 •

edited

Loading

Performance issues caused by abnormal HTTP/2 SETTINGS frame exchange #11958

Performance issues caused by abnormal HTTP/2 SETTINGS frame exchange #11958

Comments

lcz1998 commented Mar 12, 2025

1. HTTP/2 SETTINGS frame exchange issue

2. Performance bottleneck analysis

Suggestions for optimization

kannanjgithub commented Mar 19, 2025

ejona86 commented Mar 31, 2025

ejona86 commented Apr 3, 2025 • edited Loading

ejona86 commented Apr 3, 2025 •

edited

Loading