Skip to content

Refresh potential lost connections at query start for _search #130463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

pawankartik-elastic
Copy link
Contributor

@pawankartik-elastic pawankartik-elastic commented Jul 2, 2025

For CPS S2D9, we'd need _search to refresh potentially lost connections at start: by explicitly establishing a connection with a short timeout to avoid waiting for large duration.

Some preliminary questions that we haven't covered so far:

  • Is this method of implementing the differential behaviour acceptable?
    • This is answered; registering a setting to toggle the differential behaviour is acceptable.
  • What setting should we use to decide if we want to refresh lost connections (general vs specific)?
    • This is answered; we'd be using search.ccs.force_connect_timeout.

* @return boolean If we should always force reconnect.
*/
private static boolean shouldEstablishConnection(TimeValue forceConnectTimeoutSecs, boolean skipUnavailable) {
return forceConnectTimeoutSecs != null || skipUnavailable == false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use of the timeout setting as a boolean flag has some undesirable effects, I think.

As a user, I would assume that if I omit a timeout setting, that would have the same behaviour as a very long timeout setting. For example, I'd think that no timeout would have the same effect as a timeout of 1000 years.

Using the presence/absence of the timeout setting in this way breaks that mental model, and it means that the default behaviour without the setting does not correspond to any possible value of the setting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial approach was to use a simple boolean setting and then attempt to ensure that we have a connection. But then it was decided that this setting should allow a configurable timeout value.

Regarding the no timeout effect, I still think it aligns with some existing behaviour. If I'm not wrong, transport.connect_timeout controls how long to wait when opening a connection to a remote and under its absence, I think we default to 30s. There could be other settings that behave similarly (let me check on this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's different, right? There exists no value of the forceConnectTimeoutSecs setting that would give the default behaviour, because the existence of that setting is affecting this shouldEstablishConnection.

Copy link
Contributor

@quux00 quux00 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is not, in practice, a user/customer settable setting. This is something we (Elasticsearch) will hardcode on the serverless side. It should never be null in serverless. And "null" in stateful means "I'm not using that feature".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is, we should not be inferring whether a feature is in use from the existence/nonexistence of a timeout setting. The timeout setting should control only the timeout duration; whether a feature is in use or not should be dictated by some other mechanism clearly meant for that purpose.

Copy link
Contributor

@quux00 quux00 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would need a clearer proposal of an alternative here. The original model in this PR was to have a boolean flag of whether we want this higher level timeout. But then the timeout is hardcoded in code constants. We may want to tweak that timeout. So to get both you would then have to have two settings - a boolean as to whether it is on and a setting of what the timeout duration is. That seems unnecessary to me.

The current setting allows handling both. If it unset/null then the environment is not requesting that feature.

Another option is to check first, somehow, an isServerless() setting, but even if that wasn't frowned up, I would vote agains that here since this feature might be used in stateful someday. In addition, for the use case in stateful, we would likely want a longer timeout that in serverless. So having this one setting provide the timeout (or be null) handles all those scenarios in place.

What alternatives can you propose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could call the setting connect_with_retry_after which makes it clearer that it does two things.

Copy link
Contributor

@quux00 quux00 Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to be a constant contrarian, but I don't think I like that for a couple of reasons:

  1. "Connect with retry" is misleading. There is another ticket to handle "retries" after a search has started. This ticket focuses on stale connections. I don't consider that a "retry". This name could easily be confusing with the other use case.

  2. there is no unit of time in the name. I think it needs to have "secs" or "seconds"

  3. it doesn't add anything that the current name doesn't already do: forceConnectTimeoutSecs also indicates two things to me as much as connect_with_retry_after and it doesn't have the two flaws above.

My vote is to keep the current name and model in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a strong opinion on my part -- forceConnectTimeoutSecs sounds fine to me

@pawankartik-elastic pawankartik-elastic added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations labels Jul 14, 2025
@pawankartik-elastic pawankartik-elastic marked this pull request as ready for review July 14, 2025 15:09
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions left.

Matchers.containsString("org.elasticsearch.ElasticsearchTimeoutException: timed out after [1s/1000ms]")
);
assertThat(failureReason, Matchers.containsString("SubscribableListener"));
latch.countDown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more convincing and full-featured test would show two things:

  1. It times out when the re-connection cannot be completed
  2. It successfully reconnects (when not blocked).

After doing this countDown on the latch, would you be able to add a test for this second scenario?

Copy link
Contributor Author

@pawankartik-elastic pawankartik-elastic Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It successfully reconnects

This would imply that I somehow disconnect the remote cluster. Unfortunately, that's not possible. Connections are already established by the time the test runs. The only way to break the connection is to shutdown the cluster but that'd mean something entirely different -- cluster is offline. If I don't disconnect the remote but still proceed with the test, there's no point since the connection was established and continues to be established for the entirety of the test. By the time countDown() is hit, the connection goes through and is established. I can stall a connection but not break it for the underlying networking code to establish it later.

Copy link
Contributor

@quux00 quux00 Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I don't understand what cluster(REMOTE_CLUSTER_1).fullRestart(); is doing and thus I don't understand this test at all. Doesn't the restart of remote_cluster_1 break the connection and make it stale?

Copy link
Contributor Author

@pawankartik-elastic pawankartik-elastic Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I cannot manually close any cluster connections, I add a connection behaviour that gets invoked when a "new" connection is established. It does not apply to pre-existing connections. So I restart the remote cluster which then establishes new connection(s). When that happens, the connection behaviour I added previously gets invoked and the connection is stalled until I call countDown(). And since the countDown() is called at the end of the test, the subscribable listener has trouble talking to the remote. The timeout is 1s in the test and hence you get the timeout exception. Once countDown() is called, the connection goes through. So any search request after this point will not exercise the maybeEnsureConnectedAndGetConnection() since it'd realise that the connection is active. All we'd be able to prove is that we can talk to the remote and not that a connection was established by maybeEnsureConnectedAndGetConnection() (because the connection went through the moment the latch was released -- the same connection that was supposed to be established when the cluster was brought up -- it was simply delayed by the latch, mimicking a stalled cluster).

@elasticsearchmachine
Copy link
Collaborator

Hi @pawankartik-elastic, I've created a changelog YAML for you.

Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pawankartik-elastic pawankartik-elastic merged commit c1c7218 into elastic:main Jul 21, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants