Refresh potential lost connections at query start for `_search` #130463

pawankartik-elastic · 2025-07-02T13:50:18Z

For CPS S2D9, we'd need _search to refresh potentially lost connections at start: by explicitly establishing a connection with a short timeout to avoid waiting for large duration.

Some preliminary questions that we haven't covered so far:

~~Is this method of implementing the differential behaviour acceptable?~~
- This is answered; registering a setting to toggle the differential behaviour is acceptable.
~~What setting should we use to decide if we want to refresh lost connections (general vs specific)?~~
- This is answered; we'd be using search.ccs.force_connect_timeout.

server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java

prdoyle · 2025-07-14T13:17:35Z

server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java

+     * @return boolean If we should always force reconnect.
+     */
+    private static boolean shouldEstablishConnection(TimeValue forceConnectTimeoutSecs, boolean skipUnavailable) {
+        return forceConnectTimeoutSecs != null || skipUnavailable == false;


This use of the timeout setting as a boolean flag has some undesirable effects, I think.

As a user, I would assume that if I omit a timeout setting, that would have the same behaviour as a very long timeout setting. For example, I'd think that no timeout would have the same effect as a timeout of 1000 years.

Using the presence/absence of the timeout setting in this way breaks that mental model, and it means that the default behaviour without the setting does not correspond to any possible value of the setting.

My initial approach was to use a simple boolean setting and then attempt to ensure that we have a connection. But then it was decided that this setting should allow a configurable timeout value.

Regarding the no timeout effect, I still think it aligns with some existing behaviour. If I'm not wrong, transport.connect_timeout controls how long to wait when opening a connection to a remote and under its absence, I think we default to 30s. There could be other settings that behave similarly (let me check on this).

But that's different, right? There exists no value of the forceConnectTimeoutSecs setting that would give the default behaviour, because the existence of that setting is affecting this shouldEstablishConnection.

IMO, this is not, in practice, a user/customer settable setting. This is something we (Elasticsearch) will hardcode on the serverless side. It should never be null in serverless. And "null" in stateful means "I'm not using that feature".

My feeling is, we should not be inferring whether a feature is in use from the existence/nonexistence of a timeout setting. The timeout setting should control only the timeout duration; whether a feature is in use or not should be dictated by some other mechanism clearly meant for that purpose.

I think I would need a clearer proposal of an alternative here. The original model in this PR was to have a boolean flag of whether we want this higher level timeout. But then the timeout is hardcoded in code constants. We may want to tweak that timeout. So to get both you would then have to have two settings - a boolean as to whether it is on and a setting of what the timeout duration is. That seems unnecessary to me.

The current setting allows handling both. If it unset/null then the environment is not requesting that feature.

Another option is to check first, somehow, an isServerless() setting, but even if that wasn't frowned up, I would vote agains that here since this feature might be used in stateful someday. In addition, for the use case in stateful, we would likely want a longer timeout that in serverless. So having this one setting provide the timeout (or be null) handles all those scenarios in place.

What alternatives can you propose?

We could call the setting connect_with_retry_after which makes it clearer that it does two things.

I hate to be a constant contrarian, but I don't think I like that for a couple of reasons:

"Connect with retry" is misleading. There is another ticket to handle "retries" after a search has started. This ticket focuses on stale connections. I don't consider that a "retry". This name could easily be confusing with the other use case.

there is no unit of time in the name. I think it needs to have "secs" or "seconds"

it doesn't add anything that the current name doesn't already do: forceConnectTimeoutSecs also indicates two things to me as much as connect_with_retry_after and it doesn't have the two flaws above.

My vote is to keep the current name and model in this PR.

not a strong opinion on my part -- forceConnectTimeoutSecs sounds fine to me

elasticsearchmachine · 2025-07-14T15:09:32Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

quux00

Two suggestions left.

server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java

quux00 · 2025-07-15T19:22:22Z

...nalClusterTest/java/org/elasticsearch/indices/cluster/RemoteSearchForceConnectTimeoutIT.java

+                Matchers.containsString("org.elasticsearch.ElasticsearchTimeoutException: timed out after [1s/1000ms]")
+            );
+            assertThat(failureReason, Matchers.containsString("SubscribableListener"));
+            latch.countDown();


A more convincing and full-featured test would show two things:

It times out when the re-connection cannot be completed

It successfully reconnects (when not blocked).

After doing this countDown on the latch, would you be able to add a test for this second scenario?

It successfully reconnects

This would imply that I somehow disconnect the remote cluster. Unfortunately, that's not possible. Connections are already established by the time the test runs. The only way to break the connection is to shutdown the cluster but that'd mean something entirely different -- cluster is offline. If I don't disconnect the remote but still proceed with the test, there's no point since the connection was established and continues to be established for the entirety of the test. By the time countDown() is hit, the connection goes through and is established. I can stall a connection but not break it for the underlying networking code to establish it later.

Then I don't understand what cluster(REMOTE_CLUSTER_1).fullRestart(); is doing and thus I don't understand this test at all. Doesn't the restart of remote_cluster_1 break the connection and make it stale?

Because I cannot manually close any cluster connections, I add a connection behaviour that gets invoked when a "new" connection is established. It does not apply to pre-existing connections. So I restart the remote cluster which then establishes new connection(s). When that happens, the connection behaviour I added previously gets invoked and the connection is stalled until I call countDown(). And since the countDown() is called at the end of the test, the subscribable listener has trouble talking to the remote. The timeout is 1s in the test and hence you get the timeout exception. Once countDown() is called, the connection goes through. So any search request after this point will not exercise the maybeEnsureConnectedAndGetConnection() since it'd realise that the connection is active. All we'd be able to prove is that we can talk to the remote and not that a connection was established by maybeEnsureConnectedAndGetConnection() (because the connection went through the moment the latch was released -- the same connection that was supposed to be established when the cluster was brought up -- it was simply delayed by the latch, mimicking a stalled cluster).

elasticsearchmachine · 2025-07-18T14:01:51Z

Hi @pawankartik-elastic, I've created a changelog YAML for you.

quux00

LGTM

Force reconnect to remote clusters with a short timeout for CPS

6fdf908

elasticsearchmachine added the v9.2.0 label Jul 2, 2025

Remove unnecessarily introduced new line

d2c727a

quux00 reviewed Jul 3, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java Outdated Show resolved Hide resolved

quux00 reviewed Jul 3, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java Outdated Show resolved Hide resolved

quux00 reviewed Jul 3, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java Outdated Show resolved Hide resolved

pawankartik-elastic added 5 commits July 10, 2025 11:06

Use a user configurable timeout than a boolean

da8bd52

Merge branch 'main' into pkar/cps-search-force-reconnect

b0d2776

Add IT

3a2bbd4

Merge branch 'main' into pkar/cps-search-force-reconnect

51a320b

try: cleanup resources at the end of test

d251fe1

prdoyle reviewed Jul 14, 2025

View reviewed changes

pawankartik-elastic added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations labels Jul 14, 2025

pawankartik-elastic marked this pull request as ready for review July 14, 2025 15:09

Merge branch 'main' into pkar/cps-search-force-reconnect

9dbd615

quux00 reviewed Jul 15, 2025

View reviewed changes

pawankartik-elastic added 3 commits July 18, 2025 12:10

Drop timeout logging

3354ff7

Merge branch 'main' into pkar/cps-search-force-reconnect

635c650

Merge branch 'main' into pkar/cps-search-force-reconnect

e7520f9

pawankartik-elastic added the >enhancement label Jul 18, 2025

Update docs/changelog/130463.yaml

a4cd905

Merge branch 'main' into pkar/cps-search-force-reconnect

e5f0618

quux00 approved these changes Jul 21, 2025

View reviewed changes

pawankartik-elastic added 2 commits July 21, 2025 14:02

Merge branch 'main' into pkar/cps-search-force-reconnect

55a0b9e

Merge branch 'main' into pkar/cps-search-force-reconnect

60481e7

pawankartik-elastic merged commit c1c7218 into elastic:main Jul 21, 2025
34 checks passed

Refresh potential lost connections at query start for _search #130463

Refresh potential lost connections at query start for _search #130463

Conversation

pawankartik-elastic commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

prdoyle Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

pawankartik-elastic Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

prdoyle Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

quux00 Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prdoyle Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

quux00 Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n1v0lg Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

quux00 Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n1v0lg Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jul 14, 2025

Uh oh!

quux00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quux00 Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

pawankartik-elastic Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quux00 Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pawankartik-elastic Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jul 18, 2025

Uh oh!

quux00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Refresh potential lost connections at query start for `_search` #130463

Refresh potential lost connections at query start for `_search` #130463

pawankartik-elastic commented Jul 2, 2025 •

edited

Loading

quux00 Jul 14, 2025 •

edited

Loading

quux00 Jul 15, 2025 •

edited

Loading

quux00 Jul 16, 2025 •

edited

Loading

pawankartik-elastic Jul 16, 2025 •

edited

Loading

quux00 Jul 16, 2025 •

edited

Loading

pawankartik-elastic Jul 16, 2025 •

edited

Loading