Skip to content

Clarify loadbalancer recommendation and keepalive sysctl settings#5324

Open
kunisen wants to merge 4 commits intomainfrom
kunisen-docpr-stl-1747
Open

Clarify loadbalancer recommendation and keepalive sysctl settings#5324
kunisen wants to merge 4 commits intomainfrom
kunisen-docpr-stl-1747

Conversation

@kunisen
Copy link
Contributor

@kunisen kunisen commented Mar 2, 2026

Summary

Background ticket: https://github.com/elastic/support-tech-lead/issues/1747#issuecomment-3982265243, https://github.com/elastic/sdh-control-plane/issues/12444#issuecomment-3973989792


Doc update details

[1] ECE loadbalancer algorithm and responsibility

Discussion and background: https://github.com/elastic/sdh-control-plane/issues/12444#issuecomment-3974287615

TL;DR:

  • in an ECE environment, load balancer setup is the customer’s responsibility and is outside the scope of the ECE installation
  • we can recommend Round Robin (or any algorithm that evenly distributes traffic across all proxies).

@ChallenHB @bobbybho may I trouble you to please review the content from ECE perspective?


[2] ES container sysctl settings according to ES recommendation in ECE doc

Discussion: https://github.com/elastic/sdh-control-plane/issues/12444#issuecomment-3973989792

Background:

TL;DR:

  • Below keepalive relevant settings are recommended by ES, but we don't have that mentioned in our ECE doc.
  • We add it to ECE doc > OS preparation > sysctl relevant config section to make this clear.
net.ipv4.tcp_keepalive_time=180                         --> from https://github.com/elastic/elasticsearch/pull/59278
net.ipv4.tcp_keepalive_intvl=60                         --> from https://github.com/elastic/elasticsearch/pull/59278
net.ipv4.tcp_keepalive_probes=20                        --> from https://github.com/elastic/elasticsearch/pull/59278

@DaveCTurner may I trouble you to please review the content from ECE perspective? This is based on your clear suggestion in this comment - https://github.com/elastic/cloud/issues/68217#issuecomment-1847097504 back in 2023. Thank you!

@ChallenHB may I also trouble you to please review from ECE perspective and see if the way of mentioning these keepalive relevant settings are appropriate?


Generative AI disclosure

  1. Did you use a generative AI (GenAI) tool to assist in creating this contribution?
  • Yes
  • No

Doc view / Preview

Before merge

After merge

@kunisen kunisen self-assigned this Mar 2, 2026
@kunisen kunisen requested a review from a team as a code owner March 2, 2026 05:49
@kunisen kunisen added documentation Improvements or additions to documentation supportability ability enable self-service or support of product ece Elastic Cloud Enterprise Team:Admin Issues owned by the Admin Docs Team labels Mar 2, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

✅ Vale Linting Results

No issues found on modified lines!


The Vale linter checks documentation changes against the Elastic Docs style guide.

To use Vale locally or report issues, refer to Elastic style guide for Vale.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Mar 2, 2026

@DaveCTurner may I trouble you to please review the content from ECE perspective? This is based on your clear suggestion in this comment - https://github.com/elastic/cloud/issues/68217#issuecomment-1847097504 back in 2023. Thank you!

Elasticsearch will override net.ipv4.tcp_keepalive_time and net.ipv4.tcp_keepalive_intvl if they are longer than 300s but we don't mind if they're shorter. net.ipv4.tcp_keepalive_probes=20 seems ludicrously high to me, can you explain how you've derived that from elastic/elasticsearch#59278?

Copy link
Contributor

@yetanothertw yetanothertw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this PR!

I've added a couple of minor suggestions, but looks good otherwise.

@kunisen kunisen closed this Mar 2, 2026
@kunisen
Copy link
Contributor Author

kunisen commented Mar 2, 2026

@DaveCTurner may I trouble you to please review the content from ECE perspective? This is based on your clear suggestion in this comment - https://github.com/elastic/cloud/issues/68217#issuecomment-1847097504 back in 2023. Thank you!

Elasticsearch will override net.ipv4.tcp_keepalive_time and net.ipv4.tcp_keepalive_intvl if they are longer than 300s but we don't mind if they're shorter. net.ipv4.tcp_keepalive_probes=20 seems ludicrously high to me, can you explain how you've derived that from elastic/elasticsearch#59278?

Thank you @DaveCTurner!

I am sorry if anything was missing in the communication, but I came here based on two things below:

[1]
In OP of this internal ticket - https://github.com/elastic/cloud/issues/68217, we have @alexsapran said:

The ES team suggests a certain system configuration be configured.
The ES suggests the following

net.ipv4.tcp_retries2=5                                 --> from https://github.com/elastic/elasticsearch/pull/59222
net.ipv4.tcp_keepalive_time=180                         --> from https://github.com/elastic/elasticsearch/pull/59278
net.ipv4.tcp_keepalive_intvl=60                         --> from https://github.com/elastic/elasticsearch/pull/59278
net.ipv4.tcp_keepalive_probes=20                        --> from https://github.com/elastic/elasticsearch/pull/59278
net.netfilter.nf_conntrack_tcp_timeout_established=7200 --> from https://www.elastic.co/guide/en/cloud-enterprise/current/ece-configure-hosts-sles12.html
net.netfilter.nf_conntrack_max=262140                   --> from https://www.elastic.co/guide/en/cloud-enterprise/current/ece-configure-hosts-sles12.html

We need to identify if we are missing some sysctl settings and if we need to adjust some others.

Where he wrote down below 3 lines indicating the 3 values are from the PR linked:

net.ipv4.tcp_keepalive_time=180                         --> from https://github.com/elastic/elasticsearch/pull/59278
net.ipv4.tcp_keepalive_intvl=60                         --> from https://github.com/elastic/elasticsearch/pull/59278
net.ipv4.tcp_keepalive_probes=20                        --> from https://github.com/elastic/elasticsearch/pull/59278

[2]

That was in 2020, and everyone's discussion was based on that later, and no one actually said the value was not appropriate.
Until later, I found your comment in 2023 - https://github.com/elastic/cloud/issues/68217#issuecomment-1847097504, saying

The ES team position on these settings remain unchanged. The default values for these timeouts and retries were set in stone in the late 1980s and make very little sense in a modern environment.

Which indicates we should make some change other than default. I understand you didn't mention yes or no to original parameter values.


The above [1] apparently mentioned the value other than default (ES suggests) and [2] logically made me took it as those value sets are the ones that ES team was suggesting...


Sorry for the back and forth, but if possible, may I trouble you to please shed some insights on what are the recommended value we should actually use?

We'd be love to make ourselves understand things correctly.


Thanks in advance!

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Mar 2, 2026

The really vital one is net.ipv4.tcp_retries2=5. The default on Linux is 15 (926s) which is ridiculous; RFC1122 says it should be at least 8 (i.e. 100s, still fairly daft) but I would argue that RFC1122 is inapplicable in our case.

I support reducing the timeouts for keepalives too although this is (a) less impactful and (b) overridden by ES in most cases anyway. I do not have a strong argument for reducing net.ipv4.tcp_keepalive_probes from its default of 9 but nor do I think there is any good reason to increase it to 20. I don't think we ever recommended that.

eedugon added a commit that referenced this pull request Mar 2, 2026
Clarified the responsibility for load balancer provisioning and configuration, and updated language for better readability.

Including an interesting paragraph by @kunisen in the cancelled PR #5324

@kunisen , let me know your thoughts, I've added it on top of the page and not just on the algorithms paragraph.

I think it was a good addition.
@kunisen kunisen reopened this Mar 3, 2026
@kunisen
Copy link
Contributor Author

kunisen commented Mar 3, 2026

Thanks @DaveCTurner again!
@eedugon thanks for the follow up. I didn't mean to close it but I must have wrongly clicked the button 🤦 sorry.


Regarding Dave your comment

The really vital one is net.ipv4.tcp_retries2=5. The default on Linux is 15 (926s) which is ridiculous; RFC1122 says it should be at least 8 (i.e. 100s, still fairly daft) but I would argue that RFC1122 is inapplicable in our case.

Thank you!
We already have this mention in our ECE docs, so I think we are fine here.


I support reducing the timeouts for keepalives too although this is (a) less impactful and (b) overridden by ES in most cases anyway. I do not have a strong argument for reducing net.ipv4.tcp_keepalive_probes from its default of 9 but nor do I think there is any good reason to increase it to 20. I don't think we ever recommended that.

Noted.

Given you said I support reducing the timeouts for keepalives too, I will include below two parameters and values in sysctl setting description.

net.ipv4.tcp_keepalive_time=180
net.ipv4.tcp_keepalive_intvl=60 

Note: I will not describe net.ipv4.tcp_keepalive_probes=20 since we got your insights that the default value 9 is good.


=> Dave, may I trouble you to kindly check again and see if this is still confusing or not please? 🙏


Side note: ES setting & ECE setting

Also I understand you might have question why we want to add this to ECE sysctl setting config where it takes effect on all containers in addition to ES containers only.

This is verified by our ECE dev in https://github.com/elastic/sdh-control-plane/issues/12444#issuecomment-3961376807, that there doesn't seem to be an explicit reason for this setting to be excluded from the RHEL and Ubuntu docs., where it's based on the original cloud ticket - https://github.com/elastic/cloud/issues/68217 - The ES team suggests a certain system configuration be configured.

So ECE dev agreed that we could use these settings and set it at ECE host level.


Please let me know if anything is missing, and thanks again for your patience.

remove net.ipv4.tcp_keepalive_probes per ES dev's confirmation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ece Elastic Cloud Enterprise supportability ability enable self-service or support of product Team:Admin Issues owned by the Admin Docs Team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants