-
Notifications
You must be signed in to change notification settings - Fork 25.3k
ESQL: Split large pages on load sometimes #131053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds support for splitting `Page`s of large values when loading from single segment, non-descending hits. This is hottest code path as it's how we load data for aggregation. So! We had to make very very very sure this doesn't slow down the fast path of loading doc values. Caveat - this only defends against loading large values via the row-by-row load mechanism that we use for stored fields and _source. That covers the most common kinds of large values - mostly `text` and geo fields. If we need to split further on docs values, we'll have to invent something for them specifically. For now, just row-by-row. This works by flipping the order in which we load row-by-row and column-at-a-time values. Previously we loaded all column-at-a-time values first because that was simpler. Then we loaded all of the row-by-row values. Now we save the column-at-a-time values and instead load row-by-row until the `Page`'s estimated size is larger than a "jumbo" size which defaults to a megabyte. Once we load enough rows that we estimate the page is "jumbo", we then stop loading rows. The Page will look like this: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | <-- after loading this row | | | | | | we crossed to "jumbo" size | | | | | | | | | | | | | | | | | | <-- these rows are entirely empty | | | | | | | | | | | | ``` Then we chop the page to the last row: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | ``` Then fill in the column-at-a-time columns: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | 1 | XXXX | 11 | 1.0 | | XXXX | 2 | XXXX | 22 | -2.0 | | XXXX | 3 | XXXX | 33 | 1e9 | | XXXX | 4 | XXXX | 44 | 913 | | XXXX | 5 | XXXX | 55 | 0.1234 | | XXXX | 6 | XXXX | 66 | 3.1415 | ``` And then we return *that* `Page`. On the next `Driver` iteration we start from where we left off.
Hi @nik9000, I've created a changelog YAML for you. |
Open questions:
|
I think 1MB is quite small and may cause frequent chunking, especially with large mappings, even if the system has plenty of memory. Could we default this to |
@@ -89,19 +81,20 @@ public int get(int i) { | |||
} | |||
} | |||
|
|||
private void loadFromSingleLeaf(Block[] target, BlockLoader.Docs docs) throws IOException { | |||
private void loadFromSingleLeaf(long jumboBytes, Block[] target, ValuesReaderDocs docs, int offset) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Instead of "jumboBytes", should this be a more explanatory name, like "maxBytesPerPage" (Or similar)? The name is too generic IMO, and someone may need to blame the line to find its meaning 👀
Same for other usages around the PR 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really a max, it's more a "time to finish up!"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I was calling it dangerZone
- but it's not really dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking in a "page soft limit". The problem with "jumbo" and "danger" is that they don't show "intent" in any way. What is that for, only we know
@@ -149,7 +149,7 @@ public String toString() { | |||
*/ | |||
class ConstantNullsReader implements AllReader { | |||
@Override | |||
public Block read(BlockFactory factory, Docs docs) throws IOException { | |||
public Block read(BlockFactory factory, Docs docs, int offset) throws IOException { | |||
return factory.constantNulls(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those constantX()
methods ignoring the offset? Is it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are ignoring it and it's right. Constants always have the same value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The block has to have a specific position count right? And given the other methods, the count is docs.getCount() - offset
. As here offset is being ignored, the total position count could be wrong (?).
I'm not sure what that factory does exactly btw, maybe it already takes the offset into account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I see it was changed already
...sql-heap-attack/src/javaRestTest/java/org/elasticsearch/xpack/esql/heap_attack/Clusters.java
Show resolved
Hide resolved
Another thing to ask - do we like using the |
Are we ok using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments. This looks good - thanks, Nik!
...ck/plugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/read/ValuesReader.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/BlockDocValuesReader.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/BlockDocValuesReader.java
Show resolved
Hide resolved
* that was also slower. | ||
* </p> | ||
*/ | ||
class ValuesReaderDocs implements BlockLoader.Docs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we integrate the offset into this class? I feel that passing the offset to the BlockLoader requires more care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make an int offset()
method that we can use to init the loops.
I tried to put the offset in as part of the load and it cost a cycle on each load which doesn't seem worth it. This is one o the hottest bits of ESQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dnhatn do you think we should move the parameter offset
parameter into the docs thing? If I make it a member of the docs thing we don't have to pass it down. May be easier to explain it too. But we still need to init the loop at offset
to get the performance.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java
Outdated
Show resolved
Hide resolved
...esql/compute/src/main/java/org/elasticsearch/compute/lucene/read/ValuesFromSingleReader.java
Outdated
Show resolved
Hide resolved
...esql/compute/src/main/java/org/elasticsearch/compute/lucene/read/ValuesFromSingleReader.java
Outdated
Show resolved
Hide resolved
); | ||
|
||
public static final Setting<ByteSizeValue> VALUES_LOADING_JUMBO_SIZE = Setting.byteSizeSetting( | ||
"esql.values_loading_jumbo_size", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we occasionally use a small value for this setting with a large page_size in our tests to enable chunking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking I could randomize it on startup for, like, the single node tests? Or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - single-node tests and maybe our IT tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - single-node tests and maybe our IT tests.
I've modified the single node spec tests to randomly chunk at 1kb. It caught some bugs.
I think it should be fine if we also account for this overestimate in the default jumbo size setting. |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this!
@@ -286,7 +286,7 @@ public int count() { | |||
public int get(int i) { | |||
return 0; | |||
} | |||
}); | |||
}, randomInt()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the offset should be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, that test is failing in ci, so probably!
// Doubles from doc values ensures that the values are in order | ||
try (BlockLoader.FloatBuilder builder = factory.denseVectors(docs.count(), dimensions)) { | ||
for (int i = 0; i < docs.count(); i++) { | ||
for (int i = offset; i < docs.count(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we also minus the offset in the previous line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. Let me figure out why that didn't get caught by the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a BlockLoaderTestCase for dense vectors. I think that's a "for later" problem.
💔 Backport failed
You can use sqren/backport to manually backport by running |
Backport tool, you tried valiantly. Let me see what I can do. |
This adds support for splitting `Page`s of large values when loading from single segment, non-descending hits. This is hottest code path as it's how we load data for aggregation. So! We had to make very very very sure this doesn't slow down the fast path of loading doc values. Caveat - this only defends against loading large values via the row-by-row load mechanism that we use for stored fields and _source. That covers the most common kinds of large values - mostly `text` and geo fields. If we need to split further on docs values, we'll have to invent something for them specifically. For now, just row-by-row. This works by flipping the order in which we load row-by-row and column-at-a-time values. Previously we loaded all column-at-a-time values first because that was simpler. Then we loaded all of the row-by-row values. Now we save the column-at-a-time values and instead load row-by-row until the `Page`'s estimated size is larger than a "jumbo" size which defaults to a megabyte. Once we load enough rows that we estimate the page is "jumbo", we then stop loading rows. The Page will look like this: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | <-- after loading this row | | | | | | we crossed to "jumbo" size | | | | | | | | | | | | | | | | | | <-- these rows are entirely empty | | | | | | | | | | | | ``` Then we chop the page to the last row: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | ``` Then fill in the column-at-a-time columns: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | 1 | XXXX | 11 | 1.0 | | XXXX | 2 | XXXX | 22 | -2.0 | | XXXX | 3 | XXXX | 33 | 1e9 | | XXXX | 4 | XXXX | 44 | 913 | | XXXX | 5 | XXXX | 55 | 0.1234 | | XXXX | 6 | XXXX | 66 | 3.1415 | ``` And then we return *that* `Page`. On the next `Driver` iteration we start from where we left off.
9.1: #131532 |
This adds support for splitting `Page`s of large values when loading from single segment, non-descending hits. This is hottest code path as it's how we load data for aggregation. So! We had to make very very very sure this doesn't slow down the fast path of loading doc values. Caveat - this only defends against loading large values via the row-by-row load mechanism that we use for stored fields and _source. That covers the most common kinds of large values - mostly `text` and geo fields. If we need to split further on docs values, we'll have to invent something for them specifically. For now, just row-by-row. This works by flipping the order in which we load row-by-row and column-at-a-time values. Previously we loaded all column-at-a-time values first because that was simpler. Then we loaded all of the row-by-row values. Now we save the column-at-a-time values and instead load row-by-row until the `Page`'s estimated size is larger than a "jumbo" size which defaults to a megabyte. Once we load enough rows that we estimate the page is "jumbo", we then stop loading rows. The Page will look like this: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | <-- after loading this row | | | | | | we crossed to "jumbo" size | | | | | | | | | | | | | | | | | | <-- these rows are entirely empty | | | | | | | | | | | | ``` Then we chop the page to the last row: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | ``` Then fill in the column-at-a-time columns: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | 1 | XXXX | 11 | 1.0 | | XXXX | 2 | XXXX | 22 | -2.0 | | XXXX | 3 | XXXX | 33 | 1e9 | | XXXX | 4 | XXXX | 44 | 913 | | XXXX | 5 | XXXX | 55 | 0.1234 | | XXXX | 6 | XXXX | 66 | 3.1415 | ``` And then we return *that* `Page`. On the next `Driver` iteration we start from where we left off.
Abandoning the backport. All of the time series stuff breaks in 8.19. |
Sorry, We intentionally did not backport the time-series work to 9.0 and 8.19. |
It's all good. I think it'd be pretty complex to get the 8.19 time series stuff working. It looks like I'd have to turn off page splitting, right? |
Since time-series doesn't work in 8.19, I can open a PR to remove it entirely if that would help with your backport. |
This adds support for splitting
Page
s of large values when loading from single segment, non-descending hits. This is hottest code path as it's how we load data for aggregation. So! We had to make very very very sure this doesn't slow down the fast path of loading doc values.Caveat - this only defends against loading large values via the row-by-row load mechanism that we use for stored fields and _source. That covers the most common kinds of large values - mostly
text
and geo fields. If we need to split further on docs values, we'll have to invent something for them specifically. For now, just row-by-row.This works by flipping the order in which we load row-by-row and column-at-a-time values. Previously we loaded all column-at-a-time values first because that was simpler. Then we loaded all of the row-by-row values. Now we save the column-at-a-time values and instead load row-by-row until the
Page
's estimated size is larger than a "jumbo" size which defaults to a megabyte.Once we load enough rows that we estimate the page is "jumbo", we then stop loading rows. The Page will look like this:
Then we chop the page to the last row:
Then fill in the column-at-a-time columns:
And then we return that
Page
. On the nextDriver
iteration we start from where we left off.Solves the most common case of #129192.