Skip to content

Explain ignore_above better #129284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 17, 2025
Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Jun 11, 2025

This concept is complicated.

Closes #128991

This concept is complicated.

Closes elastic#128991
@nik9000 nik9000 requested a review from limotova June 11, 2025 18:42
@nik9000 nik9000 added >docs General docs changes :Search Foundations/Mapping Index mappings, including merging and defining field types v9.1.0 labels Jun 11, 2025
@elasticsearchmachine elasticsearchmachine added Team:Docs Meta label for docs team Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch labels Jun 11, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

If you need to never reject documents, this should have some value `<=8191`. All documents with
more characters will just skip building the index for this field.

The defaults are complicated. It's `2147483647` (effectively unbounded) in standard indices and
Copy link
Contributor

@leemthompo leemthompo Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using bullets for defaults/dynamic mapping info for readability

nik9000 and others added 2 commits June 12, 2025 13:38
Copy link
Member

@bmorelli25 bmorelli25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looked like fun so I couldn't resist. Sorry if I'm wrong! Also, hi Nik 👋 .

Comment on lines 73 to 77
: Do not index any field containing a string with more characters than this value. This is important because {{es}}
will reject entire documents if they contain keyword fields that exceed `32766` bytes when UTF-8 encoded.

To avoid any risk of document rejection, set this value to `8191` or less. Fields with strings exceeding this
length will be excluded from indexing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work on text fields? Or only keyword fields?

Also further down you say:

`logsdb` indices: `8191`. `keyword` fields longer than `8191` characters won't be indexed, but the documents are
      accepted and the values unindexed values are available from `_source.

Does the previous statement only apply to logsdb indices? Or to standard indices as well? If both, that feels important.

What about this:

Skip indexing of a keyword value whose UTF-8–encoded size is larger than ignore_above. The value is still kept in _source, but the field won’t be searchable or aggregatable.

If you do not set ignore_above, {es} will reject entire documents if they contain one or more keyword fields exceeding a UTF-8–encoded size of 32766.

To avoid any risk of document rejection, set this value to 8191 or less.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work on text fields? Or only keyword fields?

This setting is only available on keyword fields. But on text fields some tokenizers can have a max_token_length setting which doesn't ignore but instead splits tokens that exceed this length (so quite a bit different)

What about this:

I think it might be a bit clearer to specify characters/bytes, like "UTF-8–encoded size of 32766 bytes." and "set this value to 8191 characters or less."

Comment on lines 79 to 83
The defaults are complicated:
* Standard indices: `2147483647` (effectively unbounded). Documents containing `keyword` fields longer than `32766`
bytes will be rejected.
* `logsdb` indices: `8191`. `keyword` fields longer than `8191` characters won't be indexed, but the documents are
accepted and the values unindexed values are available from `_source.
Copy link
Member

@bmorelli25 bmorelli25 Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a table for this information?

The defaults are complicated:

Index type Default Effect
Standard indices 2147483647 (effectively unbounded) Documents will be rejected if any keyword exceeds 32766 bytes.
logsdb indices 8191 Documents are never rejected. Keywords exceding this limit are still kept in _source, but won’t be searchable or aggregatable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see this is in definition list already, so maybe a table won't work. But if you like my wording you can update accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me like that wording :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "Documents are never rejected" might be a bit too strongly worded; maybe something like:

Documents won't be rejected if a keyword field exceeds this limit and the field will still be kept in _source, but it won’t be searchable or aggregatable.

Comment on lines 84 to 87
* The [dynamic mapping](docs-content://manage-data/data-store/mapping/dynamic-mapping.md) for string fields
defaults to a `text` field with a [sub](/reference/elasticsearch/mapping-reference/multi-fields.md)-`keyword`
field with an `ignore_above` of `256`. String fields longer than 256 characters are available for full text
search but won't have a value in their `.keyword` sub-field they can not do exact matching over _search.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part I struggle to understand. But it feels separate from the defaults above? Maybe this can be in a new paragraph. I think you're saying that...

When ES finds a new string field without an explicit mapping, it automatically:

  1. Maps the field to a text field so the entire value is searchable with full-text search.
  2. Adds a sub keyword field with ignore_above set to 256 bytes. This means that values less than 256 bytes are available for exact matching over _search. Values longer than that are still searchable via the text field, but are not indexed as keywords.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I am a bit confused by the very last sentence in this paragraph.
@bmorelli25 I like your suggested rewrite, but I believe it should be "256 characters" not bytes

Copy link
Contributor

@limotova limotova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall! Left a couple of comments
I remember when I was adjusting the docs for ignore_above I had to make some changes in docs-content as well, are there going to be similar PRs for these changes?

Comment on lines 79 to 83
The defaults are complicated:
* Standard indices: `2147483647` (effectively unbounded). Documents containing `keyword` fields longer than `32766`
bytes will be rejected.
* `logsdb` indices: `8191`. `keyword` fields longer than `8191` characters won't be indexed, but the documents are
accepted and the values unindexed values are available from `_source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "Documents are never rejected" might be a bit too strongly worded; maybe something like:

Documents won't be rejected if a keyword field exceeds this limit and the field will still be kept in _source, but it won’t be searchable or aggregatable.

Comment on lines 84 to 87
* The [dynamic mapping](docs-content://manage-data/data-store/mapping/dynamic-mapping.md) for string fields
defaults to a `text` field with a [sub](/reference/elasticsearch/mapping-reference/multi-fields.md)-`keyword`
field with an `ignore_above` of `256`. String fields longer than 256 characters are available for full text
search but won't have a value in their `.keyword` sub-field they can not do exact matching over _search.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I am a bit confused by the very last sentence in this paragraph.
@bmorelli25 I like your suggested rewrite, but I believe it should be "256 characters" not bytes

Comment on lines 73 to 77
: Do not index any field containing a string with more characters than this value. This is important because {{es}}
will reject entire documents if they contain keyword fields that exceed `32766` bytes when UTF-8 encoded.

To avoid any risk of document rejection, set this value to `8191` or less. Fields with strings exceeding this
length will be excluded from indexing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work on text fields? Or only keyword fields?

This setting is only available on keyword fields. But on text fields some tokenizers can have a max_token_length setting which doesn't ignore but instead splits tokens that exceed this length (so quite a bit different)

What about this:

I think it might be a bit clearer to specify characters/bytes, like "UTF-8–encoded size of 32766 bytes." and "set this value to 8191 characters or less."

Copy link
Contributor

🔍 Preview links for changed docs:

🔔 The preview site may take up to 3 minutes to finish building. These links will become live once it completes.

@nik9000
Copy link
Member Author

nik9000 commented Jun 23, 2025

Thanks folks! I updated the wording and used a table. I like the table!

Copy link
Contributor

@limotova limotova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of nitpicks for sentences i had to reread a couple of times to understand but overall LGTM!

@leemthompo
Copy link
Contributor

@nik9000 just happened on this PR again randomly, not sure if it fell off radar :)

@nik9000 nik9000 enabled auto-merge (squash) July 17, 2025 19:25
Copy link
Contributor

🔍 Preview links for changed docs

@nik9000 nik9000 merged commit 6ed50e1 into elastic:main Jul 17, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Docs Meta label for docs team Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify docs around keyword ignore_above setting.
5 participants