Skip to content

ESQL: Split large pages on load sometimes #131053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jul 18, 2025
Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Jul 10, 2025

This adds support for splitting Pages of large values when loading from single segment, non-descending hits. This is hottest code path as it's how we load data for aggregation. So! We had to make very very very sure this doesn't slow down the fast path of loading doc values.

Caveat - this only defends against loading large values via the row-by-row load mechanism that we use for stored fields and _source. That covers the most common kinds of large values - mostly text and geo fields. If we need to split further on docs values, we'll have to invent something for them specifically. For now, just row-by-row.

This works by flipping the order in which we load row-by-row and column-at-a-time values. Previously we loaded all column-at-a-time values first because that was simpler. Then we loaded all of the row-by-row values. Now we save the column-at-a-time values and instead load row-by-row until the Page's estimated size is larger than a "jumbo" size which defaults to a megabyte.

Once we load enough rows that we estimate the page is "jumbo", we then stop loading rows. The Page will look like this:

| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        | <-- after loading this row
|      |     |      |      |        |     we crossed to "jumbo" size
|      |     |      |      |        |
|      |     |      |      |        |
|      |     |      |      |        | <-- these rows are entirely empty
|      |     |      |      |        |
|      |     |      |      |        |

Then we chop the page to the last row:

| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |

Then fill in the column-at-a-time columns:

| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |   1 | XXXX |   11 |    1.0 |
| XXXX |   2 | XXXX |   22 |   -2.0 |
| XXXX |   3 | XXXX |   33 |    1e9 |
| XXXX |   4 | XXXX |   44 |    913 |
| XXXX |   5 | XXXX |   55 | 0.1234 |
| XXXX |   6 | XXXX |   66 | 3.1415 |

And then we return that Page. On the next Driver iteration we start from where we left off.

Solves the most common case of #129192.

This adds support for splitting `Page`s of large values when loading
from single segment, non-descending hits. This is hottest code path as
it's how we load data for aggregation. So! We had to make very very very
sure this doesn't slow down the fast path of loading doc values.

Caveat - this only defends against loading large values via the
row-by-row load mechanism that we use for stored fields and _source.
That covers the most common kinds of large values - mostly `text` and
geo fields. If we need to split further on docs values, we'll have to
invent something for them specifically. For now, just row-by-row.

This works by flipping the order in which we load row-by-row and
column-at-a-time values. Previously we loaded all column-at-a-time
values first because that was simpler. Then we loaded all of the
row-by-row values. Now we save the column-at-a-time values and instead
load row-by-row until the `Page`'s estimated size is larger than a "jumbo"
size which defaults to a megabyte.

Once we load enough rows that we estimate the page is "jumbo", we then
stop loading rows. The Page will look like this:

```
| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        | <-- after loading this row
|      |     |      |      |        |     we crossed to "jumbo" size
|      |     |      |      |        |
|      |     |      |      |        |
|      |     |      |      |        | <-- these rows are entirely empty
|      |     |      |      |        |
|      |     |      |      |        |
```

Then we chop the page to the last row:
```
| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
```

Then fill in the column-at-a-time columns:
```
| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |   1 | XXXX |   11 |    1.0 |
| XXXX |   2 | XXXX |   22 |   -2.0 |
| XXXX |   3 | XXXX |   33 |    1e9 |
| XXXX |   4 | XXXX |   44 |    913 |
| XXXX |   5 | XXXX |   55 | 0.1234 |
| XXXX |   6 | XXXX |   66 | 3.1415 |
```

And then we return *that* `Page`. On the next `Driver` iteration we
start from where we left off.
@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

@nik9000 nik9000 requested a review from dnhatn July 10, 2025 20:03
@nik9000
Copy link
Member Author

nik9000 commented Jul 10, 2025

Open questions:

  1. Is 1mb a good default for the jumbo size? We shoot for 256kb blocks. At 1mb we've already overshot that 4x. While still having a ton of space to grow.
  2. This can slice into single-row blocks which really don't perform well. Are we ok with that? I think slow is better than broken though.

@dnhatn
Copy link
Member

dnhatn commented Jul 11, 2025

defaults to a megabyte.

I think 1MB is quite small and may cause frequent chunking, especially with large mappings, even if the system has plenty of memory. Could we default this to max(1/512 (or 1/1024) of the heap, 1MB) instead?

@@ -89,19 +81,20 @@ public int get(int i) {
}
}

private void loadFromSingleLeaf(Block[] target, BlockLoader.Docs docs) throws IOException {
private void loadFromSingleLeaf(long jumboBytes, Block[] target, ValuesReaderDocs docs, int offset) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Instead of "jumboBytes", should this be a more explanatory name, like "maxBytesPerPage" (Or similar)? The name is too generic IMO, and someone may need to blame the line to find its meaning 👀

Same for other usages around the PR 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really a max, it's more a "time to finish up!"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I was calling it dangerZone - but it's not really dangerous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking in a "page soft limit". The problem with "jumbo" and "danger" is that they don't show "intent" in any way. What is that for, only we know

@@ -149,7 +149,7 @@ public String toString() {
*/
class ConstantNullsReader implements AllReader {
@Override
public Block read(BlockFactory factory, Docs docs) throws IOException {
public Block read(BlockFactory factory, Docs docs, int offset) throws IOException {
return factory.constantNulls();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those constantX() methods ignoring the offset? Is it right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are ignoring it and it's right. Constants always have the same value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block has to have a specific position count right? And given the other methods, the count is docs.getCount() - offset. As here offset is being ignored, the total position count could be wrong (?).

I'm not sure what that factory does exactly btw, maybe it already takes the offset into account?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see it was changed already

@nik9000
Copy link
Member Author

nik9000 commented Jul 11, 2025

Another thing to ask - do we like using the estimatedBytes for this? It's an overestimate, sometimes a big one. Maybe if we make the limit bigger it's just fine? Or maybe want to track better?

@nik9000
Copy link
Member Author

nik9000 commented Jul 11, 2025

Are we ok using the estimatedSize on the builders? It's going to be an overestimate.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments. This looks good - thanks, Nik!

* that was also slower.
* </p>
*/
class ValuesReaderDocs implements BlockLoader.Docs {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we integrate the offset into this class? I feel that passing the offset to the BlockLoader requires more care.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make an int offset() method that we can use to init the loops.

I tried to put the offset in as part of the load and it cost a cycle on each load which doesn't seem worth it. This is one o the hottest bits of ESQL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnhatn do you think we should move the parameter offset parameter into the docs thing? If I make it a member of the docs thing we don't have to pass it down. May be easier to explain it too. But we still need to init the loop at offset to get the performance.

);

public static final Setting<ByteSizeValue> VALUES_LOADING_JUMBO_SIZE = Setting.byteSizeSetting(
"esql.values_loading_jumbo_size",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we occasionally use a small value for this setting with a large page_size in our tests to enable chunking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking I could randomize it on startup for, like, the single node tests? Or something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - single-node tests and maybe our IT tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - single-node tests and maybe our IT tests.

I've modified the single node spec tests to randomly chunk at 1kb. It caught some bugs.

@dnhatn dnhatn self-requested a review July 11, 2025 17:38
@dnhatn
Copy link
Member

dnhatn commented Jul 11, 2025

Are we ok using the estimatedSize on the builders? It's going to be an overestimate.

I think it should be fine if we also account for this overestimate in the default jumbo size setting.

@nik9000 nik9000 marked this pull request as ready for review July 15, 2025 16:22
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 15, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@nik9000 nik9000 requested a review from ivancea July 15, 2025 17:42
@nik9000
Copy link
Member Author

nik9000 commented Jul 15, 2025

@dnhatn and @ivancea, could you have another look? Have a think about the change I made to EsqlSpecIT. And the other tests that I changed. And the passing of the offset parameter.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this!

@@ -286,7 +286,7 @@ public int count() {
public int get(int i) {
return 0;
}
});
}, randomInt());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the offset should be 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, that test is failing in ci, so probably!

// Doubles from doc values ensures that the values are in order
try (BlockLoader.FloatBuilder builder = factory.denseVectors(docs.count(), dimensions)) {
for (int i = 0; i < docs.count(); i++) {
for (int i = offset; i < docs.count(); i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we also minus the offset in the previous line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. Let me figure out why that didn't get caught by the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a BlockLoaderTestCase for dense vectors. I think that's a "for later" problem.

@nik9000 nik9000 added the auto-backport Automatically create backport pull requests when merged label Jul 16, 2025
@nik9000 nik9000 merged commit 439b8e6 into elastic:main Jul 18, 2025
34 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.1 Commit could not be cherrypicked due to conflicts
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 131053

@nik9000
Copy link
Member Author

nik9000 commented Jul 18, 2025

Backport tool, you tried valiantly. Let me see what I can do.

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jul 18, 2025
This adds support for splitting `Page`s of large values when loading
from single segment, non-descending hits. This is hottest code path as
it's how we load data for aggregation. So! We had to make very very very
sure this doesn't slow down the fast path of loading doc values.

Caveat - this only defends against loading large values via the
row-by-row load mechanism that we use for stored fields and _source.
That covers the most common kinds of large values - mostly `text` and
geo fields. If we need to split further on docs values, we'll have to
invent something for them specifically. For now, just row-by-row.

This works by flipping the order in which we load row-by-row and
column-at-a-time values. Previously we loaded all column-at-a-time
values first because that was simpler. Then we loaded all of the
row-by-row values. Now we save the column-at-a-time values and instead
load row-by-row until the `Page`'s estimated size is larger than a "jumbo"
size which defaults to a megabyte.

Once we load enough rows that we estimate the page is "jumbo", we then
stop loading rows. The Page will look like this:

```
| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        | <-- after loading this row
|      |     |      |      |        |     we crossed to "jumbo" size
|      |     |      |      |        |
|      |     |      |      |        |
|      |     |      |      |        | <-- these rows are entirely empty
|      |     |      |      |        |
|      |     |      |      |        |
```

Then we chop the page to the last row:
```
| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
```

Then fill in the column-at-a-time columns:
```
| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |   1 | XXXX |   11 |    1.0 |
| XXXX |   2 | XXXX |   22 |   -2.0 |
| XXXX |   3 | XXXX |   33 |    1e9 |
| XXXX |   4 | XXXX |   44 |    913 |
| XXXX |   5 | XXXX |   55 | 0.1234 |
| XXXX |   6 | XXXX |   66 | 3.1415 |
```

And then we return *that* `Page`. On the next `Driver` iteration we
start from where we left off.
@nik9000
Copy link
Member Author

nik9000 commented Jul 18, 2025

9.1: #131532

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jul 18, 2025
This adds support for splitting `Page`s of large values when loading
from single segment, non-descending hits. This is hottest code path as
it's how we load data for aggregation. So! We had to make very very very
sure this doesn't slow down the fast path of loading doc values.

Caveat - this only defends against loading large values via the
row-by-row load mechanism that we use for stored fields and _source.
That covers the most common kinds of large values - mostly `text` and
geo fields. If we need to split further on docs values, we'll have to
invent something for them specifically. For now, just row-by-row.

This works by flipping the order in which we load row-by-row and
column-at-a-time values. Previously we loaded all column-at-a-time
values first because that was simpler. Then we loaded all of the
row-by-row values. Now we save the column-at-a-time values and instead
load row-by-row until the `Page`'s estimated size is larger than a "jumbo"
size which defaults to a megabyte.

Once we load enough rows that we estimate the page is "jumbo", we then
stop loading rows. The Page will look like this:

```
| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        | <-- after loading this row
|      |     |      |      |        |     we crossed to "jumbo" size
|      |     |      |      |        |
|      |     |      |      |        |
|      |     |      |      |        | <-- these rows are entirely empty
|      |     |      |      |        |
|      |     |      |      |        |
```

Then we chop the page to the last row:
```
| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
```

Then fill in the column-at-a-time columns:
```
| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |   1 | XXXX |   11 |    1.0 |
| XXXX |   2 | XXXX |   22 |   -2.0 |
| XXXX |   3 | XXXX |   33 |    1e9 |
| XXXX |   4 | XXXX |   44 |    913 |
| XXXX |   5 | XXXX |   55 | 0.1234 |
| XXXX |   6 | XXXX |   66 | 3.1415 |
```

And then we return *that* `Page`. On the next `Driver` iteration we
start from where we left off.
@nik9000
Copy link
Member Author

nik9000 commented Jul 18, 2025

Abandoning the backport. All of the time series stuff breaks in 8.19.

@dnhatn
Copy link
Member

dnhatn commented Jul 18, 2025

Sorry, We intentionally did not backport the time-series work to 9.0 and 8.19.

@nik9000
Copy link
Member Author

nik9000 commented Jul 18, 2025

It's all good. I think it'd be pretty complex to get the 8.19 time series stuff working. It looks like I'd have to turn off page splitting, right?

@dnhatn
Copy link
Member

dnhatn commented Jul 18, 2025

It looks like I'd have to turn off page splitting, right?

Since time-series doesn't work in 8.19, I can open a PR to remove it entirely if that would help with your backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants