transport: log network reconnects with same peer process #128415

schase-es · 2025-05-24T01:19:53Z

ClusterConnectionManager now caches the previous ephemeralId (created on process-start) of peer nodes on disconnect in a connection history table. On reconnect, when a peer has the same ephemeralId as it did previously, this is logged to indicate a network failure. The connectionHistory is trimmed to the current set of peers by NodeConnectionsService.

elasticsearchmachine · 2025-05-24T01:20:17Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

schase-es · 2025-05-24T01:20:33Z

I wasn't able to find a way to test the ClusterConnectionManager's connectionHistory table when integrated through the NodeConnectionsService.

nicktindall

Looking good, just a few questions and minor comments.

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

nicktindall · 2025-05-26T01:25:18Z

server/src/main/java/org/elasticsearch/transport/ConnectionManager.java

+    /**
+     * Keep the connection history for the nodes listed
+     */
+    void retainConnectionHistory(List<DiscoveryNode> nodes);


In the javadoc I think we should mention that we discard history for nodes not in the list? If you know the Set API then it's suggested by the name retain, but if you don't it might not be obvious.

nicktindall · 2025-05-26T01:33:49Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

@@ -120,6 +122,7 @@ public void connectToNodes(DiscoveryNodes discoveryNodes, Runnable onCompletion)
                        runnables.add(connectionTarget.connect(null));
                    }
                }
+                transportService.retainConnectionHistory(nodes);


We might be able to use DiscoveryNodes#getAllNodes() rather than building up an auxiliary collection, that might be marginally more efficient? Set#retainAll seems to take a Collection, but we'd need to change the ConnectionManager#retainConnectionHistory interface to accommodate.

Do we need a separate collection here at all? We could just pass discoveryNodes around I think.

But also, really this is cleaning out the nodes about which we no longer care, so I think we should be doing this in disconnectFromNodesExcept instead.

nicktindall · 2025-05-26T02:07:31Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

+                                    public void onFailure(Exception e) {
+                                        final NodeConnectionHistory hist = new NodeConnectionHistory(node.getEphemeralId(), e);
+                                        nodeHistory.put(conn.getNode().getId(), hist);
+                                    }


Do we want to store the connection history even when conn.hasReferences() == false ? I'm not 100% familiar with this code, but I wonder if we might get the occasional ungraceful disconnect after we've released all our references?

I guess in that case we would eventually discard the entry via retainConnectionHistory anyway.

Do we need to be careful with the timing of calls to retainConnectionHistory versus the these close handlers firing? I guess any entries that are added after a purge would not survive subsequent purges.

nicktindall · 2025-05-26T02:14:33Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

+                                                node.descriptionWithoutAttributes(),
+                                                e,
+                                                ReferenceDocs.NETWORK_DISCONNECT_TROUBLESHOOTING
+                                            );


It looks like previously we would only have logged at debug level in this scenario? unless I'm reading it wrong. I'm not sure how interesting this case is (as we were disconnecting from the node anyway)?

nicktindall · 2025-05-26T02:23:10Z

server/src/test/java/org/elasticsearch/transport/ClusterConnectionManagerTests.java

+            assertTrue("recent disconnects should be listed", connectionManager.connectionHistorySize() == 2);
+
+            connectionManager.retainConnectionHistory(Collections.emptyList());
+            assertTrue("connection history should be emptied", connectionManager.connectionHistorySize() == 0);


I wonder if it would be better to expose a read-only copy of the map for testing this, that would allow us to assert that the correct IDs were present?

DaveCTurner

I think ClusterConnectionManager isn't quite the right place to do this - the job of this connection manager is to look after all node-to-node connections including ones used for discovery and remote cluster connections too. There are situations where we might close and re-establish these kinds of connection without either end restarting without that being a problem worthy of logging.

NodeConnectionsService is the class that knows about connections to nodes in the cluster. I'd rather we implemented the logging about unexpected reconnects there. That does raise some difficulties about how to expose the exception that closed the connection, if such an exception exists. I did say that this bit would be tricky 😁 Nonetheless I'd rather we got the logging to happen in the right place first and then we can think about the plumbing needed to achieve this extra detail.

DaveCTurner · 2025-05-27T07:09:15Z

...tty4/src/internalClusterTest/java/org/elasticsearch/transport/netty4/ESLoggingHandlerIT.java

+        value = "org.elasticsearch.transport.ClusterConnectionManager:WARN",
+        reason = "to ensure we log cluster manager disconnect events on WARN level"
+    )
+    public void testExceptionalDisconnectLoggingInClusterConnectionManager() throws Exception {


Could we put this into its own test suite? This suite is supposed to be about ESLoggingHandler which is unrelated to the logging in ClusterConnectionManager. I think this test should work fine in the :server test suite, no need to hide it in the transport-netty4 module.

Also could you open a separate PR to move testConnectionLogging and testExceptionalDisconnectLogging out of this test suite - they're testing the logging in TcpTransport which is similarly unrelated to ESLoggingHandler. IIRC they were added here for historical reasons, but these days we use the Netty transport everywhere so these should work in :server too.

DaveCTurner · 2025-05-27T07:12:23Z

server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

@@ -120,6 +122,7 @@ public void connectToNodes(DiscoveryNodes discoveryNodes, Runnable onCompletion)
                        runnables.add(connectionTarget.connect(null));
                    }
                }
+                transportService.retainConnectionHistory(nodes);


Do we need a separate collection here at all? We could just pass discoveryNodes around I think.

But also, really this is cleaning out the nodes about which we no longer care, so I think we should be doing this in disconnectFromNodesExcept instead.

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

DaveCTurner · 2025-05-27T07:25:44Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

+                            NodeConnectionHistory hist = nodeHistory.remove(connNode.getId());
+                            if (hist != null && hist.ephemeralId.equals(connNode.getEphemeralId())) {


Could we extract this to a separate method rather than adding to this already over-long and over-nested code directly?

Also I'd rather use nodeConnectionHistory instead of hist. Abbreviated variable names are a hindrance to readers, particularly if they don't have English as a first language, and there's no disadvantage to using the full type name here.

(nit: also it can be final)

DaveCTurner · 2025-05-27T07:31:11Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

+                                if (hist.disconnectCause != null) {
+                                    logger.warn(
+                                        () -> format(
+                                            "transport connection reopened to node with same ephemeralId [%s], close exception:",


Users don't really know what ephemeralId is so I think will find this message confusing. Could we say something like reopened transport connection to node [%s] which disconnected exceptionally [%s/%dms] ago but did not restart, so the disconnection is unexpected? NB also tracking the disconnection duration here.

Similarly disconnected gracefully in the other branch.

Also can we link ReferenceDocs.NETWORK_DISCONNECT_TROUBLESHOOTING?

DaveCTurner · 2025-05-27T07:36:49Z

server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java

-                                        // that's a bug.
-                                    } else {
-                                        logger.debug("closing unused transport connection to [{}]", node);
+                                conn.addCloseListener(new ActionListener<Void>() {


nit: reduce duplication a bit here:

conn.addCloseListener(new ActionListener<>() { @Override public void onResponse(Void ignored) { addNewNodeConnectionHistory(null); } @Override public void onFailure(Exception e) { addNewNodeConnectionHistory(e); } private void addNewNodeConnectionHistory(@Nullable Exception e) { nodeHistory.put(node.getId(), new NodeConnectionHistory(node.getEphemeralId(), e)); } });

Also consider extracting this out to the top level to try and keep this method length/nesting depth from getting too much further out of hand.

schase-es requested review from nicktindall and DaveCTurner May 24, 2025 01:19

schase-es added >non-issue :Distributed Coordination/Network Http and internode communication implementations labels May 24, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label May 24, 2025

elasticsearchmachine added the v9.1.0 label May 24, 2025

[CI] Auto commit changes from spotless

bd688bb

nicktindall reviewed May 26, 2025

View reviewed changes

DaveCTurner reviewed May 27, 2025

View reviewed changes

		NodeConnectionHistory hist = nodeHistory.remove(connNode.getId());
		if (hist != null && hist.ephemeralId.equals(connNode.getEphemeralId())) {

transport: log network reconnects with same peer process #128415

Are you sure you want to change the base?

transport: log network reconnects with same peer process #128415

Uh oh!

Conversation

schase-es commented May 24, 2025

Uh oh!

elasticsearchmachine commented May 24, 2025

Uh oh!

schase-es commented May 24, 2025

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!