Skip to content

Commit

Permalink
Use SearchStats instead of field.isAggregatable in data node planning (
Browse files Browse the repository at this point in the history
…#115744)

Since ES|QL makes use of field-caps and only considers `isAggregatable` during Lucene pushdown, turning off doc-values disables Lucene pushdown. This is incorrect. The physical planning decision for Lucene pushdown is made during local planning on the data node, at which point `SearchStats` are known, and both `isIndexed` and `hasDocValues` are separately knowable. The Lucene pushdown should happen for `isIndexed` and not consider `hasDocValues` at all.

This PR adds hasDocValues to SearchStats and the uses isIndexed and hasDocValue separately during local physical planning on the data nodes. This immediately cleared up one issue for spatial data, which could not push down a lucene query when doc-values was disabled.

Summary of what `isAggregatable` means for different implementations of `MappedFieldType`:
* Default implementation of `isAggregatable` in `MappedFieldType` is `hasDocValues`, and does not consider `isIndexed`
* All classes that extend `AbstractScriptFieldType` (eg. `LongScriptFieldType`) hard coded `isAggregatable` to `true`. This presumably means Lucene is happy to mimic having doc-values
* `TestFieldType`, and classes that extend it, return the value of `fielddata`, so consider the field aggregatable if there is field-data.
* `AggregateDoubleMetricFieldType` and `ConstantFieldType` hard coded to `true`
* `DenseVectorFieldType` hard coded to `false`
* `IdFieldType` return the value of `fieldDataEnabled.getAsBoolean()`

In no case is `isIndexed` used for `isAggregatable`. However, for our Lucene pushdown of filters, `isIndexed` would make a lot more sense. But for pushdown of TopN, `hasDocValues` makes more sense.

Summarising the results of the various options for the various field types, where `?` means configrable:

| Class | isAggregatable | isIndexed | isStored | hasDocValues |
| --- | --- | --- | --- | --- |
| AbstractScriptFieldType                 | true  | false | false | false |
| AggregateDoubleMetricFieldType | true  | true  | false | false |
| DenseVectorFieldType                    | false | ?       | false | !indexed |
| IdFieldType                                      | fieldData | true | true | false |
| TsidExtractingIdField                       | false | true | true | false |
| TextFieldType                                   | fieldData | ? | ? | false |
| ? (the rest)                                        | hasDocValues | ? | ? | ? |

It has also been observed that we cannot push filters to source without checking `hasDocValues` when we use the `SingleValueQuery`. So this leads to three groups of conditions:

| Category | require `indexed` | require `docValues` |
| --- | --- | --- |
| Filters(single-value) | true | true |
| Filters(multi-value) | true | false |
| TopN | true | true |

And for all cases we will also consider `isAggregatable` as a disjunction to cover the script field types, leading to two possible combinations:

* `fa.isAggregatable() || searchStats.isIndexed(fa.name()) && searchStats.hasDocValues(fa.name())`
* `fa.isAggregatable() || searchStats.isIndexed(fa.name())`
  • Loading branch information
craigtaverner authored Nov 8, 2024
1 parent f7ee3fb commit 0f9ac9d
Show file tree
Hide file tree
Showing 26 changed files with 1,049 additions and 565 deletions.
6 changes: 6 additions & 0 deletions docs/changelog/115744.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
pr: 115744
summary: Use `SearchStats` instead of field.isAggregatable in data node planning
area: ES|QL
type: bug
issues:
- 115737
Original file line number Diff line number Diff line change
Expand Up @@ -968,15 +968,27 @@ public boolean isAggregatable() {
return fielddata;
}

public boolean canUseSyntheticSourceDelegateForQuerying() {
/**
* Returns true if the delegate sub-field can be used for loading and querying (ie. either isIndexed or isStored is true)
*/
public boolean canUseSyntheticSourceDelegateForLoading() {
return syntheticSourceDelegate != null
&& syntheticSourceDelegate.ignoreAbove() == Integer.MAX_VALUE
&& (syntheticSourceDelegate.isIndexed() || syntheticSourceDelegate.isStored());
}

/**
* Returns true if the delegate sub-field can be used for querying only (ie. isIndexed must be true)
*/
public boolean canUseSyntheticSourceDelegateForQuerying() {
return syntheticSourceDelegate != null
&& syntheticSourceDelegate.ignoreAbove() == Integer.MAX_VALUE
&& syntheticSourceDelegate.isIndexed();
}

@Override
public BlockLoader blockLoader(BlockLoaderContext blContext) {
if (canUseSyntheticSourceDelegateForQuerying()) {
if (canUseSyntheticSourceDelegateForLoading()) {
return new BlockLoader.Delegating(syntheticSourceDelegate.blockLoader(blContext)) {
@Override
protected String delegatingTo() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ public static MapperTestCase.BlockReaderSupport getSupportedReaders(MapperServic
TextFieldMapper.TextFieldType text = (TextFieldMapper.TextFieldType) ft;
boolean supportsColumnAtATimeReader = text.syntheticSourceDelegate() != null
&& text.syntheticSourceDelegate().hasDocValues()
&& text.canUseSyntheticSourceDelegateForQuerying();
&& text.canUseSyntheticSourceDelegateForLoading();
return new MapperTestCase.BlockReaderSupport(supportsColumnAtATimeReader, mapper, loaderFieldName);
}
MappedFieldType parent = mapper.fieldType(parentName);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,10 @@ public class CsvTestsDataLoader {
private static final TestsDataset DECADES = new TestsDataset("decades");
private static final TestsDataset AIRPORTS = new TestsDataset("airports");
private static final TestsDataset AIRPORTS_MP = AIRPORTS.withIndex("airports_mp").withData("airports_mp.csv");
private static final TestsDataset AIRPORTS_NO_DOC_VALUES = new TestsDataset("airports_no_doc_values").withData("airports.csv");
private static final TestsDataset AIRPORTS_NOT_INDEXED = new TestsDataset("airports_not_indexed").withData("airports.csv");
private static final TestsDataset AIRPORTS_NOT_INDEXED_NOR_DOC_VALUES = new TestsDataset("airports_not_indexed_nor_doc_values")
.withData("airports.csv");
private static final TestsDataset AIRPORTS_WEB = new TestsDataset("airports_web");
private static final TestsDataset DATE_NANOS = new TestsDataset("date_nanos");
private static final TestsDataset COUNTRIES_BBOX = new TestsDataset("countries_bbox");
Expand Down Expand Up @@ -105,6 +109,9 @@ public class CsvTestsDataLoader {
Map.entry(DECADES.indexName, DECADES),
Map.entry(AIRPORTS.indexName, AIRPORTS),
Map.entry(AIRPORTS_MP.indexName, AIRPORTS_MP),
Map.entry(AIRPORTS_NO_DOC_VALUES.indexName, AIRPORTS_NO_DOC_VALUES),
Map.entry(AIRPORTS_NOT_INDEXED.indexName, AIRPORTS_NOT_INDEXED),
Map.entry(AIRPORTS_NOT_INDEXED_NOR_DOC_VALUES.indexName, AIRPORTS_NOT_INDEXED_NOR_DOC_VALUES),
Map.entry(AIRPORTS_WEB.indexName, AIRPORTS_WEB),
Map.entry(COUNTRIES_BBOX.indexName, COUNTRIES_BBOX),
Map.entry(COUNTRIES_BBOX_WEB.indexName, COUNTRIES_BBOX_WEB),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,8 @@
import java.time.Period;
import java.util.ArrayList;
import java.util.EnumSet;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.List;
Expand Down Expand Up @@ -206,9 +208,30 @@ public static EsRelation relation() {
return new EsRelation(EMPTY, new EsIndex(randomAlphaOfLength(8), emptyMap()), IndexMode.STANDARD, randomBoolean());
}

public static class TestSearchStats extends SearchStats {
public TestSearchStats() {
super(emptyList());
/**
* This version of SearchStats always returns true for all fields for all boolean methods.
* For custom behaviour either use {@link TestConfigurableSearchStats} or override the specific methods.
*/
public static class TestSearchStats implements SearchStats {

@Override
public boolean exists(String field) {
return true;
}

@Override
public boolean isIndexed(String field) {
return exists(field);
}

@Override
public boolean hasDocValues(String field) {
return exists(field);
}

@Override
public boolean hasExactSubfield(String field) {
return exists(field);
}

@Override
Expand All @@ -226,11 +249,6 @@ public long count(String field, BytesRef value) {
return exists(field) ? -1 : 0;
}

@Override
public boolean exists(String field) {
return true;
}

@Override
public byte[] min(String field, DataType dataType) {
return null;
Expand All @@ -245,10 +263,76 @@ public byte[] max(String field, DataType dataType) {
public boolean isSingleValue(String field) {
return false;
}
}

/**
* This version of SearchStats can be preconfigured to return true/false for various combinations of the four field settings:
* <ol>
* <li>exists</li>
* <li>isIndexed</li>
* <li>hasDocValues</li>
* <li>hasExactSubfield</li>
* </ol>
* The default will return true for all fields. The include/exclude methods can be used to configure the settings for specific fields.
* If you call 'include' with no fields, it will switch to return false for all fields.
*/
public static class TestConfigurableSearchStats extends TestSearchStats {
public enum Config {
EXISTS,
INDEXED,
DOC_VALUES,
EXACT_SUBFIELD
}

private final Map<Config, Set<String>> includes = new HashMap<>();
private final Map<Config, Set<String>> excludes = new HashMap<>();

public TestConfigurableSearchStats include(Config key, String... fields) {
// If this method is called with no fields, it is interpreted to mean include none, so we include a dummy field
for (String field : fields.length == 0 ? new String[] { "-" } : fields) {
includes.computeIfAbsent(key, k -> new HashSet<>()).add(field);
excludes.computeIfAbsent(key, k -> new HashSet<>()).remove(field);
}
return this;
}

public TestConfigurableSearchStats exclude(Config key, String... fields) {
for (String field : fields) {
includes.computeIfAbsent(key, k -> new HashSet<>()).remove(field);
excludes.computeIfAbsent(key, k -> new HashSet<>()).add(field);
}
return this;
}

private boolean isConfigationSet(Config config, String field) {
Set<String> in = includes.getOrDefault(config, Set.of());
Set<String> ex = excludes.getOrDefault(config, Set.of());
return (in.isEmpty() || in.contains(field)) && ex.contains(field) == false;
}

@Override
public boolean exists(String field) {
return isConfigationSet(Config.EXISTS, field);
}

@Override
public boolean isIndexed(String field) {
return exists(field);
return isConfigationSet(Config.INDEXED, field);
}

@Override
public boolean hasDocValues(String field) {
return isConfigationSet(Config.DOC_VALUES, field);
}

@Override
public boolean hasExactSubfield(String field) {
return isConfigationSet(Config.EXACT_SUBFIELD, field);
}

@Override
public String toString() {
return "TestConfigurableSearchStats{" + "includes=" + includes + ", excludes=" + excludes + '}';
}
}

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"properties": {
"abbrev": {
"type": "keyword"
},
"name": {
"type": "text"
},
"scalerank": {
"type": "integer"
},
"type": {
"type": "keyword"
},
"location": {
"type": "geo_point",
"index": true,
"doc_values": false
},
"country": {
"type": "keyword"
},
"city": {
"type": "keyword"
},
"city_location": {
"type": "geo_point"
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"properties": {
"abbrev": {
"type": "keyword"
},
"name": {
"type": "text"
},
"scalerank": {
"type": "integer"
},
"type": {
"type": "keyword"
},
"location": {
"type": "geo_point",
"index": false,
"doc_values": true
},
"country": {
"type": "keyword"
},
"city": {
"type": "keyword"
},
"city_location": {
"type": "geo_point"
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -484,6 +484,42 @@ centroid:geo_point | count:long
POINT (42.97109629958868 14.7552534006536) | 1
;

centroidFromAirportsAfterIntersectsCompoundPredicateNoDocValues
required_capability: st_intersects

FROM airports_no_doc_values
| WHERE scalerank == 9 AND ST_INTERSECTS(location, TO_GEOSHAPE("POLYGON((42 14, 43 14, 43 15, 42 15, 42 14))")) AND country == "Yemen"
| STATS centroid=ST_CENTROID_AGG(location), count=COUNT()
;

centroid:geo_point | count:long
POINT (42.97109629958868 14.7552534006536) | 1
;

centroidFromAirportsAfterIntersectsCompoundPredicateNotIndexedNorDocValues
required_capability: st_intersects

FROM airports_not_indexed_nor_doc_values
| WHERE scalerank == 9 AND ST_INTERSECTS(location, TO_GEOSHAPE("POLYGON((42 14, 43 14, 43 15, 42 15, 42 14))")) AND country == "Yemen"
| STATS centroid=ST_CENTROID_AGG(location), count=COUNT()
;

centroid:geo_point | count:long
POINT (42.97109629958868 14.7552534006536) | 1
;

centroidFromAirportsAfterIntersectsCompoundPredicateNotIndexed
required_capability: st_intersects

FROM airports_not_indexed
| WHERE scalerank == 9 AND ST_INTERSECTS(location, TO_GEOSHAPE("POLYGON((42 14, 43 14, 43 15, 42 15, 42 14))")) AND country == "Yemen"
| STATS centroid=ST_CENTROID_AGG(location), count=COUNT()
;

centroid:geo_point | count:long
POINT (42.97109629958868 14.7552534006536) | 1
;

###############################################
# Tests for ST_INTERSECTS on GEO_POINT type

Expand Down
Loading

0 comments on commit 0f9ac9d

Please sign in to comment.