-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Refresh potential lost connections at query start for _search
#130463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh potential lost connections at query start for _search
#130463
Conversation
server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java
Outdated
Show resolved
Hide resolved
* @return boolean If we should always force reconnect. | ||
*/ | ||
private static boolean shouldEstablishConnection(TimeValue forceConnectTimeoutSecs, boolean skipUnavailable) { | ||
return forceConnectTimeoutSecs != null || skipUnavailable == false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use of the timeout setting as a boolean flag has some undesirable effects, I think.
As a user, I would assume that if I omit a timeout setting, that would have the same behaviour as a very long timeout setting. For example, I'd think that no timeout would have the same effect as a timeout of 1000 years.
Using the presence/absence of the timeout setting in this way breaks that mental model, and it means that the default behaviour without the setting does not correspond to any possible value of the setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial approach was to use a simple boolean setting and then attempt to ensure that we have a connection. But then it was decided that this setting should allow a configurable timeout value.
Regarding the no timeout effect, I still think it aligns with some existing behaviour. If I'm not wrong, transport.connect_timeout
controls how long to wait when opening a connection to a remote and under its absence, I think we default to 30s
. There could be other settings that behave similarly (let me check on this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's different, right? There exists no value of the forceConnectTimeoutSecs
setting that would give the default behaviour, because the existence of that setting is affecting this shouldEstablishConnection
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this is not, in practice, a user/customer settable setting. This is something we (Elasticsearch) will hardcode on the serverless side. It should never be null in serverless. And "null" in stateful means "I'm not using that feature".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is, we should not be inferring whether a feature is in use from the existence/nonexistence of a timeout setting. The timeout setting should control only the timeout duration; whether a feature is in use or not should be dictated by some other mechanism clearly meant for that purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would need a clearer proposal of an alternative here. The original model in this PR was to have a boolean flag of whether we want this higher level timeout. But then the timeout is hardcoded in code constants. We may want to tweak that timeout. So to get both you would then have to have two settings - a boolean as to whether it is on and a setting of what the timeout duration is. That seems unnecessary to me.
The current setting allows handling both. If it unset/null then the environment is not requesting that feature.
Another option is to check first, somehow, an isServerless()
setting, but even if that wasn't frowned up, I would vote agains that here since this feature might be used in stateful someday. In addition, for the use case in stateful, we would likely want a longer timeout that in serverless. So having this one setting provide the timeout (or be null) handles all those scenarios in place.
What alternatives can you propose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could call the setting connect_with_retry_after
which makes it clearer that it does two things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate to be a constant contrarian, but I don't think I like that for a couple of reasons:
-
"Connect with retry" is misleading. There is another ticket to handle "retries" after a search has started. This ticket focuses on stale connections. I don't consider that a "retry". This name could easily be confusing with the other use case.
-
there is no unit of time in the name. I think it needs to have "secs" or "seconds"
-
it doesn't add anything that the current name doesn't already do:
forceConnectTimeoutSecs
also indicates two things to me as much asconnect_with_retry_after
and it doesn't have the two flaws above.
My vote is to keep the current name and model in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a strong opinion on my part -- forceConnectTimeoutSecs
sounds fine to me
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two suggestions left.
server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java
Outdated
Show resolved
Hide resolved
Matchers.containsString("org.elasticsearch.ElasticsearchTimeoutException: timed out after [1s/1000ms]") | ||
); | ||
assertThat(failureReason, Matchers.containsString("SubscribableListener")); | ||
latch.countDown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more convincing and full-featured test would show two things:
- It times out when the re-connection cannot be completed
- It successfully reconnects (when not blocked).
After doing this countDown on the latch, would you be able to add a test for this second scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It successfully reconnects
This would imply that I somehow disconnect the remote cluster. Unfortunately, that's not possible. Connections are already established by the time the test runs. The only way to break the connection is to shutdown the cluster but that'd mean something entirely different -- cluster is offline. If I don't disconnect the remote but still proceed with the test, there's no point since the connection was established and continues to be established for the entirety of the test. By the time countDown()
is hit, the connection goes through and is established. I can stall a connection but not break it for the underlying networking code to establish it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I don't understand what cluster(REMOTE_CLUSTER_1).fullRestart();
is doing and thus I don't understand this test at all. Doesn't the restart of remote_cluster_1 break the connection and make it stale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I cannot manually close any cluster connections, I add a connection behaviour that gets invoked when a "new" connection is established. It does not apply to pre-existing connections. So I restart the remote cluster which then establishes new connection(s). When that happens, the connection behaviour I added previously gets invoked and the connection is stalled until I call countDown()
. And since the countDown()
is called at the end of the test, the subscribable listener has trouble talking to the remote. The timeout is 1s in the test and hence you get the timeout exception. Once countDown()
is called, the connection goes through. So any search request after this point will not exercise the maybeEnsureConnectedAndGetConnection()
since it'd realise that the connection is active. All we'd be able to prove is that we can talk to the remote and not that a connection was established by maybeEnsureConnectedAndGetConnection()
(because the connection went through the moment the latch was released -- the same connection that was supposed to be established when the cluster was brought up -- it was simply delayed by the latch, mimicking a stalled cluster).
Hi @pawankartik-elastic, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
For CPS S2D9, we'd need
_search
to refresh potentially lost connections at start: by explicitly establishing a connection with a short timeout to avoid waiting for large duration.Some preliminary questions that we haven't covered so far:
Is this method of implementing the differential behaviour acceptable?What setting should we use to decide if we want to refresh lost connections (general vs specific)?search.ccs.force_connect_timeout
.