-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Explain ignore_above
better
#129284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explain ignore_above
better
#129284
Conversation
This concept is complicated. Closes elastic#128991
Pinging @elastic/es-docs (Team:Docs) |
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
If you need to never reject documents, this should have some value `<=8191`. All documents with | ||
more characters will just skip building the index for this field. | ||
|
||
The defaults are complicated. It's `2147483647` (effectively unbounded) in standard indices and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using bullets for defaults/dynamic mapping info for readability
Co-authored-by: Liam Thompson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looked like fun so I couldn't resist. Sorry if I'm wrong! Also, hi Nik 👋 .
: Do not index any field containing a string with more characters than this value. This is important because {{es}} | ||
will reject entire documents if they contain keyword fields that exceed `32766` bytes when UTF-8 encoded. | ||
|
||
To avoid any risk of document rejection, set this value to `8191` or less. Fields with strings exceeding this | ||
length will be excluded from indexing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work on text fields? Or only keyword fields?
Also further down you say:
`logsdb` indices: `8191`. `keyword` fields longer than `8191` characters won't be indexed, but the documents are
accepted and the values unindexed values are available from `_source.
Does the previous statement only apply to logsdb indices? Or to standard indices as well? If both, that feels important.
What about this:
Skip indexing of a keyword value whose UTF-8–encoded size is larger than
ignore_above
. The value is still kept in_source
, but the field won’t be searchable or aggregatable.If you do not set
ignore_above
, {es} will reject entire documents if they contain one or morekeyword
fields exceeding a UTF-8–encoded size of32766
.To avoid any risk of document rejection, set this value to
8191
or less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work on text fields? Or only keyword fields?
This setting is only available on keyword fields. But on text fields some tokenizers can have a max_token_length
setting which doesn't ignore but instead splits tokens that exceed this length (so quite a bit different)
What about this:
I think it might be a bit clearer to specify characters/bytes, like "UTF-8–encoded size of 32766
bytes." and "set this value to 8191
characters or less."
The defaults are complicated: | ||
* Standard indices: `2147483647` (effectively unbounded). Documents containing `keyword` fields longer than `32766` | ||
bytes will be rejected. | ||
* `logsdb` indices: `8191`. `keyword` fields longer than `8191` characters won't be indexed, but the documents are | ||
accepted and the values unindexed values are available from `_source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a table for this information?
The defaults are complicated:
Index type Default Effect Standard indices 2147483647
(effectively unbounded)Documents will be rejected if any keyword exceeds 32766
bytes.logsdb
indices8191
Documents are never rejected. Keywords exceding this limit are still kept in _source
, but won’t be searchable or aggregatable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see this is in definition list already, so maybe a table won't work. But if you like my wording you can update accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me like that wording :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "Documents are never rejected" might be a bit too strongly worded; maybe something like:
Documents won't be rejected if a keyword field exceeds this limit and the field will still be kept in
_source
, but it won’t be searchable or aggregatable.
* The [dynamic mapping](docs-content://manage-data/data-store/mapping/dynamic-mapping.md) for string fields | ||
defaults to a `text` field with a [sub](/reference/elasticsearch/mapping-reference/multi-fields.md)-`keyword` | ||
field with an `ignore_above` of `256`. String fields longer than 256 characters are available for full text | ||
search but won't have a value in their `.keyword` sub-field they can not do exact matching over _search. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part I struggle to understand. But it feels separate from the defaults above? Maybe this can be in a new paragraph. I think you're saying that...
When ES finds a new string field without an explicit mapping, it automatically:
- Maps the field to a text field so the entire value is searchable with full-text search.
- Adds a sub keyword field with
ignore_above
set to256
bytes. This means that values less than 256 bytes are available for exact matching over_search
. Values longer than that are still searchable via thetext
field, but are not indexed as keywords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree I am a bit confused by the very last sentence in this paragraph.
@bmorelli25 I like your suggested rewrite, but I believe it should be "256
characters" not bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall! Left a couple of comments
I remember when I was adjusting the docs for ignore_above
I had to make some changes in docs-content as well, are there going to be similar PRs for these changes?
The defaults are complicated: | ||
* Standard indices: `2147483647` (effectively unbounded). Documents containing `keyword` fields longer than `32766` | ||
bytes will be rejected. | ||
* `logsdb` indices: `8191`. `keyword` fields longer than `8191` characters won't be indexed, but the documents are | ||
accepted and the values unindexed values are available from `_source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "Documents are never rejected" might be a bit too strongly worded; maybe something like:
Documents won't be rejected if a keyword field exceeds this limit and the field will still be kept in
_source
, but it won’t be searchable or aggregatable.
* The [dynamic mapping](docs-content://manage-data/data-store/mapping/dynamic-mapping.md) for string fields | ||
defaults to a `text` field with a [sub](/reference/elasticsearch/mapping-reference/multi-fields.md)-`keyword` | ||
field with an `ignore_above` of `256`. String fields longer than 256 characters are available for full text | ||
search but won't have a value in their `.keyword` sub-field they can not do exact matching over _search. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree I am a bit confused by the very last sentence in this paragraph.
@bmorelli25 I like your suggested rewrite, but I believe it should be "256
characters" not bytes
: Do not index any field containing a string with more characters than this value. This is important because {{es}} | ||
will reject entire documents if they contain keyword fields that exceed `32766` bytes when UTF-8 encoded. | ||
|
||
To avoid any risk of document rejection, set this value to `8191` or less. Fields with strings exceeding this | ||
length will be excluded from indexing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work on text fields? Or only keyword fields?
This setting is only available on keyword fields. But on text fields some tokenizers can have a max_token_length
setting which doesn't ignore but instead splits tokens that exceed this length (so quite a bit different)
What about this:
I think it might be a bit clearer to specify characters/bytes, like "UTF-8–encoded size of 32766
bytes." and "set this value to 8191
characters or less."
🔍 Preview links for changed docs: 🔔 The preview site may take up to 3 minutes to finish building. These links will become live once it completes. |
Thanks folks! I updated the wording and used a table. I like the table! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of nitpicks for sentences i had to reread a couple of times to understand but overall LGTM!
@nik9000 just happened on this PR again randomly, not sure if it fell off radar :) |
Co-authored-by: Larisa Motova <[email protected]>
🔍 Preview links for changed docs |
This concept is complicated.
Closes #128991