ESQL: Split large pages on load sometimes #131053

nik9000 · 2025-07-10T20:01:50Z

This adds support for splitting Pages of large values when loading from single segment, non-descending hits. This is hottest code path as it's how we load data for aggregation. So! We had to make very very very sure this doesn't slow down the fast path of loading doc values.

Caveat - this only defends against loading large values via the row-by-row load mechanism that we use for stored fields and _source. That covers the most common kinds of large values - mostly text and geo fields. If we need to split further on docs values, we'll have to invent something for them specifically. For now, just row-by-row.

This works by flipping the order in which we load row-by-row and column-at-a-time values. Previously we loaded all column-at-a-time values first because that was simpler. Then we loaded all of the row-by-row values. Now we save the column-at-a-time values and instead load row-by-row until the Page's estimated size is larger than a "jumbo" size which defaults to a megabyte.

Once we load enough rows that we estimate the page is "jumbo", we then stop loading rows. The Page will look like this:

| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        | <-- after loading this row
|      |     |      |      |        |     we crossed to "jumbo" size
|      |     |      |      |        |
|      |     |      |      |        |
|      |     |      |      |        | <-- these rows are entirely empty
|      |     |      |      |        |
|      |     |      |      |        |

Then we chop the page to the last row:

| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |

Then fill in the column-at-a-time columns:

| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |   1 | XXXX |   11 |    1.0 |
| XXXX |   2 | XXXX |   22 |   -2.0 |
| XXXX |   3 | XXXX |   33 |    1e9 |
| XXXX |   4 | XXXX |   44 |    913 |
| XXXX |   5 | XXXX |   55 | 0.1234 |
| XXXX |   6 | XXXX |   66 | 3.1415 |

And then we return that Page. On the next Driver iteration we start from where we left off.

Solves the most common case of #129192.

This adds support for splitting `Page`s of large values when loading from single segment, non-descending hits. This is hottest code path as it's how we load data for aggregation. So! We had to make very very very sure this doesn't slow down the fast path of loading doc values. Caveat - this only defends against loading large values via the row-by-row load mechanism that we use for stored fields and _source. That covers the most common kinds of large values - mostly `text` and geo fields. If we need to split further on docs values, we'll have to invent something for them specifically. For now, just row-by-row. This works by flipping the order in which we load row-by-row and column-at-a-time values. Previously we loaded all column-at-a-time values first because that was simpler. Then we loaded all of the row-by-row values. Now we save the column-at-a-time values and instead load row-by-row until the `Page`'s estimated size is larger than a "jumbo" size which defaults to a megabyte. Once we load enough rows that we estimate the page is "jumbo", we then stop loading rows. The Page will look like this: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | <-- after loading this row | | | | | | we crossed to "jumbo" size | | | | | | | | | | | | | | | | | | <-- these rows are entirely empty | | | | | | | | | | | | ``` Then we chop the page to the last row: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | ``` Then fill in the column-at-a-time columns: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | 1 | XXXX | 11 | 1.0 | | XXXX | 2 | XXXX | 22 | -2.0 | | XXXX | 3 | XXXX | 33 | 1e9 | | XXXX | 4 | XXXX | 44 | 913 | | XXXX | 5 | XXXX | 55 | 0.1234 | | XXXX | 6 | XXXX | 66 | 3.1415 | ``` And then we return *that* `Page`. On the next `Driver` iteration we start from where we left off.

elasticsearchmachine · 2025-07-10T20:02:17Z

Hi @nik9000, I've created a changelog YAML for you.

nik9000 · 2025-07-10T20:08:40Z

Open questions:

Is 1mb a good default for the jumbo size? We shoot for 256kb blocks. At 1mb we've already overshot that 4x. While still having a ton of space to grow.
This can slice into single-row blocks which really don't perform well. Are we ok with that? I think slow is better than broken though.

dnhatn · 2025-07-11T07:34:31Z

defaults to a megabyte.

I think 1MB is quite small and may cause frequent chunking, especially with large mappings, even if the system has plenty of memory. Could we default this to max(1/512 (or 1/1024) of the heap, 1MB) instead?

ivancea · 2025-07-11T10:08:50Z

...esql/compute/src/main/java/org/elasticsearch/compute/lucene/read/ValuesFromSingleReader.java

@@ -89,19 +81,20 @@ public int get(int i) {
        }
    }

-    private void loadFromSingleLeaf(Block[] target, BlockLoader.Docs docs) throws IOException {
+    private void loadFromSingleLeaf(long jumboBytes, Block[] target, ValuesReaderDocs docs, int offset) throws IOException {


nit: Instead of "jumboBytes", should this be a more explanatory name, like "maxBytesPerPage" (Or similar)? The name is too generic IMO, and someone may need to blame the line to find its meaning 👀

Same for other usages around the PR 😅

It's not really a max, it's more a "time to finish up!"

Originally I was calling it dangerZone - but it's not really dangerous.

I was thinking in a "page soft limit". The problem with "jumbo" and "danger" is that they don't show "intent" in any way. What is that for, only we know

ivancea · 2025-07-11T11:04:45Z

server/src/main/java/org/elasticsearch/index/mapper/BlockLoader.java

@@ -149,7 +149,7 @@ public String toString() {
     */
    class ConstantNullsReader implements AllReader {
        @Override
-        public Block read(BlockFactory factory, Docs docs) throws IOException {
+        public Block read(BlockFactory factory, Docs docs, int offset) throws IOException {
            return factory.constantNulls();


Are those constantX() methods ignoring the offset? Is it right?

They are ignoring it and it's right. Constants always have the same value.

The block has to have a specific position count right? And given the other methods, the count is docs.getCount() - offset. As here offset is being ignored, the total position count could be wrong (?).

I'm not sure what that factory does exactly btw, maybe it already takes the offset into account?

Nevermind, I see it was changed already

...sql-heap-attack/src/javaRestTest/java/org/elasticsearch/xpack/esql/heap_attack/Clusters.java

nik9000 · 2025-07-11T12:03:28Z

Another thing to ask - do we like using the estimatedBytes for this? It's an overestimate, sometimes a big one. Maybe if we make the limit bigger it's just fine? Or maybe want to track better?

nik9000 · 2025-07-11T17:18:08Z

Are we ok using the estimatedSize on the builders? It's going to be an overestimate.

dnhatn

I've left some comments. This looks good - thanks, Nik!

...ck/plugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/read/ValuesReader.java

server/src/main/java/org/elasticsearch/index/mapper/BlockDocValuesReader.java

dnhatn · 2025-07-11T17:27:58Z

...lugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/read/ValuesReaderDocs.java

+ *     that was also slower.
+ * </p>
+ */
+class ValuesReaderDocs implements BlockLoader.Docs {


Can we integrate the offset into this class? I feel that passing the offset to the BlockLoader requires more care.

I can make an int offset() method that we can use to init the loops.

I tried to put the offset in as part of the load and it cost a cycle on each load which doesn't seem worth it. This is one o the hottest bits of ESQL.

@dnhatn do you think we should move the parameter offset parameter into the docs thing? If I make it a member of the docs thing we don't have to pass it down. May be easier to explain it too. But we still need to init the loop at offset to get the performance.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

...esql/compute/src/main/java/org/elasticsearch/compute/lucene/read/ValuesFromSingleReader.java

dnhatn · 2025-07-11T17:36:54Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/PhysicalSettings.java

+    );
+
+    public static final Setting<ByteSizeValue> VALUES_LOADING_JUMBO_SIZE = Setting.byteSizeSetting(
+        "esql.values_loading_jumbo_size",


Can we occasionally use a small value for this setting with a large page_size in our tests to enable chunking?

Are you thinking I could randomize it on startup for, like, the single node tests? Or something else?

Yes - single-node tests and maybe our IT tests.

Yes - single-node tests and maybe our IT tests.

I've modified the single node spec tests to randomly chunk at 1kb. It caught some bugs.

dnhatn · 2025-07-11T17:41:07Z

Are we ok using the estimatedSize on the builders? It's going to be an overestimate.

I think it should be fine if we also account for this overestimate in the default jumbo size setting.

…anger_zone_2

Don't need it any more

elasticsearchmachine · 2025-07-15T16:23:03Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

nik9000 · 2025-07-15T17:43:56Z

@dnhatn and @ivancea, could you have another look? Have a think about the change I made to EsqlSpecIT. And the other tests that I changed. And the passing of the offset parameter.

dnhatn

LGTM, thanks for fixing this!

dnhatn · 2025-07-15T18:18:42Z

...est/java/org/elasticsearch/xpack/constantkeyword/mapper/ConstantKeywordFieldMapperTests.java

@@ -286,7 +286,7 @@ public int count() {
                        public int get(int i) {
                            return 0;
                        }
-                    });
+                    }, randomInt());


nit: I think the offset should be 0.

well, that test is failing in ci, so probably!

dnhatn · 2025-07-15T18:31:10Z

server/src/main/java/org/elasticsearch/index/mapper/BlockDocValuesReader.java

            // Doubles from doc values ensures that the values are in order
            try (BlockLoader.FloatBuilder builder = factory.denseVectors(docs.count(), dimensions)) {
-                for (int i = 0; i < docs.count(); i++) {
+                for (int i = offset; i < docs.count(); i++) {


nit: can we also minus the offset in the previous line?

yeah. Let me figure out why that didn't get caught by the tests.

There isn't a BlockLoaderTestCase for dense vectors. I think that's a "for later" problem.

…anger_zone_2

elasticsearchmachine · 2025-07-18T12:38:33Z

💔 Backport failed

Status	Branch	Result
❌	9.1	Commit could not be cherrypicked due to conflicts
❌	8.19	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 131053

nik9000 · 2025-07-18T12:45:14Z

Backport tool, you tried valiantly. Let me see what I can do.

This adds support for splitting `Page`s of large values when loading from single segment, non-descending hits. This is hottest code path as it's how we load data for aggregation. So! We had to make very very very sure this doesn't slow down the fast path of loading doc values. Caveat - this only defends against loading large values via the row-by-row load mechanism that we use for stored fields and _source. That covers the most common kinds of large values - mostly `text` and geo fields. If we need to split further on docs values, we'll have to invent something for them specifically. For now, just row-by-row. This works by flipping the order in which we load row-by-row and column-at-a-time values. Previously we loaded all column-at-a-time values first because that was simpler. Then we loaded all of the row-by-row values. Now we save the column-at-a-time values and instead load row-by-row until the `Page`'s estimated size is larger than a "jumbo" size which defaults to a megabyte. Once we load enough rows that we estimate the page is "jumbo", we then stop loading rows. The Page will look like this: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | <-- after loading this row | | | | | | we crossed to "jumbo" size | | | | | | | | | | | | | | | | | | <-- these rows are entirely empty | | | | | | | | | | | | ``` Then we chop the page to the last row: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | ``` Then fill in the column-at-a-time columns: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | 1 | XXXX | 11 | 1.0 | | XXXX | 2 | XXXX | 22 | -2.0 | | XXXX | 3 | XXXX | 33 | 1e9 | | XXXX | 4 | XXXX | 44 | 913 | | XXXX | 5 | XXXX | 55 | 0.1234 | | XXXX | 6 | XXXX | 66 | 3.1415 | ``` And then we return *that* `Page`. On the next `Driver` iteration we start from where we left off.

nik9000 · 2025-07-18T14:21:48Z

9.1: #131532

This adds support for splitting `Page`s of large values when loading from single segment, non-descending hits. This is hottest code path as it's how we load data for aggregation. So! We had to make very very very sure this doesn't slow down the fast path of loading doc values. Caveat - this only defends against loading large values via the row-by-row load mechanism that we use for stored fields and _source. That covers the most common kinds of large values - mostly `text` and geo fields. If we need to split further on docs values, we'll have to invent something for them specifically. For now, just row-by-row. This works by flipping the order in which we load row-by-row and column-at-a-time values. Previously we loaded all column-at-a-time values first because that was simpler. Then we loaded all of the row-by-row values. Now we save the column-at-a-time values and instead load row-by-row until the `Page`'s estimated size is larger than a "jumbo" size which defaults to a megabyte. Once we load enough rows that we estimate the page is "jumbo", we then stop loading rows. The Page will look like this: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | <-- after loading this row | | | | | | we crossed to "jumbo" size | | | | | | | | | | | | | | | | | | <-- these rows are entirely empty | | | | | | | | | | | | ``` Then we chop the page to the last row: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | | XXXX | | XXXX | | | ``` Then fill in the column-at-a-time columns: ``` | txt1 | int | txt2 | long | double | |------|-----|------|------|--------| | XXXX | 1 | XXXX | 11 | 1.0 | | XXXX | 2 | XXXX | 22 | -2.0 | | XXXX | 3 | XXXX | 33 | 1e9 | | XXXX | 4 | XXXX | 44 | 913 | | XXXX | 5 | XXXX | 55 | 0.1234 | | XXXX | 6 | XXXX | 66 | 3.1415 | ``` And then we return *that* `Page`. On the next `Driver` iteration we start from where we left off.

nik9000 · 2025-07-18T17:00:02Z

Abandoning the backport. All of the time series stuff breaks in 8.19.

dnhatn · 2025-07-18T17:03:33Z

Sorry, We intentionally did not backport the time-series work to 9.0 and 8.19.

nik9000 · 2025-07-18T17:18:07Z

It's all good. I think it'd be pretty complex to get the 8.19 time series stuff working. It looks like I'd have to turn off page splitting, right?

dnhatn · 2025-07-18T22:17:30Z

It looks like I'd have to turn off page splitting, right?

Since time-series doesn't work in 8.19, I can open a PR to remove it entirely if that would help with your backport.

nik9000 added >bug :Analytics/ES|QL AKA ESQL v9.2.0 v9.1.1 v8.19.1 labels Jul 10, 2025

Update docs/changelog/131053.yaml

50f632f

nik9000 requested a review from dnhatn July 10, 2025 20:03

[CI] Auto commit changes from spotless

f24df78

ivancea reviewed Jul 11, 2025

View reviewed changes

nik9000 added 3 commits July 11, 2025 08:18

Iypdate

735ea24

Merge branch 'main' into esql_danger_zone_2

b1aebd5

update

fcfb421

dnhatn reviewed Jul 11, 2025

View reviewed changes

dnhatn self-requested a review July 11, 2025 17:38

nik9000 added 6 commits July 11, 2025 13:52

Fix size and build once

b58082a

Merge remote-tracking branch 'nik9000/esql_danger_zone_2' into esql_d…

5186b88

…anger_zone_2

move check

60b3314

Rename

ae089f8

Better tests and real offset

c82eaa0

Rework page_size

c2f656d

Don't need it any more

nik9000 marked this pull request as ready for review July 15, 2025 16:22

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 15, 2025

Explain

9641518

nik9000 requested a review from ivancea July 15, 2025 17:42

dnhatn approved these changes Jul 15, 2025

View reviewed changes

Fix

36c35a4

ivancea approved these changes Jul 16, 2025

View reviewed changes

nik9000 added 3 commits July 16, 2025 14:09

Merge branch 'main' into esql_danger_zone_2

74114a8

Check build

68d45d9

Fix test

53b4bdc

nik9000 added the auto-backport Automatically create backport pull requests when merged label Jul 16, 2025

nik9000 added 6 commits July 16, 2025 16:27

Wrong spot

fb0249a

Merge branch 'main' into esql_danger_zone_2

1dcb020

Merge branch 'main' into esql_danger_zone_2

a81d427

Fixup serverless

77ff429

Merge remote-tracking branch 'nik9000/esql_danger_zone_2' into esql_d…

9a7613f

…anger_zone_2

Merge branch 'main' into esql_danger_zone_2

aa7c608

nik9000 merged commit 439b8e6 into elastic:main Jul 18, 2025
34 checks passed

elasticsearchmachine added the backport pending label Jul 18, 2025

nik9000 removed backport pending v9.1.1 v8.19.1 labels Jul 18, 2025

ESQL: Split large pages on load sometimes #131053

ESQL: Split large pages on load sometimes #131053

Conversation

nik9000 commented Jul 10, 2025

Uh oh!

elasticsearchmachine commented Jul 10, 2025

Uh oh!

nik9000 commented Jul 10, 2025

Uh oh!

dnhatn commented Jul 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nik9000 commented Jul 11, 2025

Uh oh!

nik9000 commented Jul 11, 2025

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Jul 11, 2025

Uh oh!

elasticsearchmachine commented Jul 15, 2025

Uh oh!

nik9000 commented Jul 15, 2025

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 18, 2025

💔 Backport failed

Uh oh!

nik9000 commented Jul 18, 2025

Uh oh!