Use SearchStats instead of field.isAggregatable in data node planning (…

…#115744) Since ES|QL makes use of field-caps and only considers `isAggregatable` during Lucene pushdown, turning off doc-values disables Lucene pushdown. This is incorrect. The physical planning decision for Lucene pushdown is made during local planning on the data node, at which point `SearchStats` are known, and both `isIndexed` and `hasDocValues` are separately knowable. The Lucene pushdown should happen for `isIndexed` and not consider `hasDocValues` at all. This PR adds hasDocValues to SearchStats and the uses isIndexed and hasDocValue separately during local physical planning on the data nodes. This immediately cleared up one issue for spatial data, which could not push down a lucene query when doc-values was disabled. Summary of what `isAggregatable` means for different implementations of `MappedFieldType`: * Default implementation of `isAggregatable` in `MappedFieldType` is `hasDocValues`, and does not consider `isIndexed` * All classes that extend `AbstractScriptFieldType` (eg. `LongScriptFieldType`) hard coded `isAggregatable` to `true`. This presumably means Lucene is happy to mimic having doc-values * `TestFieldType`, and classes that extend it, return the value of `fielddata`, so consider the field aggregatable if there is field-data. * `AggregateDoubleMetricFieldType` and `ConstantFieldType` hard coded to `true` * `DenseVectorFieldType` hard coded to `false` * `IdFieldType` return the value of `fieldDataEnabled.getAsBoolean()` In no case is `isIndexed` used for `isAggregatable`. However, for our Lucene pushdown of filters, `isIndexed` would make a lot more sense. But for pushdown of TopN, `hasDocValues` makes more sense. Summarising the results of the various options for the various field types, where `?` means configrable: | Class | isAggregatable | isIndexed | isStored | hasDocValues | | --- | --- | --- | --- | --- | | AbstractScriptFieldType | true | false | false | false | | AggregateDoubleMetricFieldType | true | true | false | false | | DenseVectorFieldType | false | ? | false | !indexed | | IdFieldType | fieldData | true | true | false | | TsidExtractingIdField | false | true | true | false | | TextFieldType | fieldData | ? | ? | false | | ? (the rest) | hasDocValues | ? | ? | ? | It has also been observed that we cannot push filters to source without checking `hasDocValues` when we use the `SingleValueQuery`. So this leads to three groups of conditions: | Category | require `indexed` | require `docValues` | | --- | --- | --- | | Filters(single-value) | true | true | | Filters(multi-value) | true | false | | TopN | true | true | And for all cases we will also consider `isAggregatable` as a disjunction to cover the script field types, leading to two possible combinations: * `fa.isAggregatable() || searchStats.isIndexed(fa.name()) && searchStats.hasDocValues(fa.name())` * `fa.isAggregatable() || searchStats.isIndexed(fa.name())`
elastic · Nov 8, 2024 · 0f9ac9d · 0f9ac9d
1 parent f7ee3fb
commit 0f9ac9d
Show file tree

Hide file tree

Showing 26 changed files with 1,049 additions and 565 deletions.
diff --git a/docs/changelog/115744.yaml b/docs/changelog/115744.yaml
@@ -0,0 +1,6 @@
+pr: 115744
+summary: Use `SearchStats` instead of field.isAggregatable in data node planning
+area: ES|QL
+type: bug
+issues:
+ - 115737
diff --git a/server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java b/server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java
@@ -968,15 +968,27 @@ public boolean isAggregatable() {
             return fielddata;
         }
 
-        public boolean canUseSyntheticSourceDelegateForQuerying() {
+        /**
+         * Returns true if the delegate sub-field can be used for loading and querying (ie. either isIndexed or isStored is true)
+         */
+        public boolean canUseSyntheticSourceDelegateForLoading() {
             return syntheticSourceDelegate != null
                 && syntheticSourceDelegate.ignoreAbove() == Integer.MAX_VALUE
                 && (syntheticSourceDelegate.isIndexed() || syntheticSourceDelegate.isStored());
         }
 
+        /**
+         * Returns true if the delegate sub-field can be used for querying only (ie. isIndexed must be true)
+         */
+        public boolean canUseSyntheticSourceDelegateForQuerying() {
+            return syntheticSourceDelegate != null
+                && syntheticSourceDelegate.ignoreAbove() == Integer.MAX_VALUE
+                && syntheticSourceDelegate.isIndexed();
+        }
+
         @Override
         public BlockLoader blockLoader(BlockLoaderContext blContext) {
-            if (canUseSyntheticSourceDelegateForQuerying()) {
+            if (canUseSyntheticSourceDelegateForLoading()) {
                 return new BlockLoader.Delegating(syntheticSourceDelegate.blockLoader(blContext)) {
                     @Override
                     protected String delegatingTo() {

diff --git a/...src/main/java/org/elasticsearch/index/mapper/TextFieldFamilySyntheticSourceTestSetup.java b/...src/main/java/org/elasticsearch/index/mapper/TextFieldFamilySyntheticSourceTestSetup.java
@@ -39,7 +39,7 @@ public static MapperTestCase.BlockReaderSupport getSupportedReaders(MapperServic
             TextFieldMapper.TextFieldType text = (TextFieldMapper.TextFieldType) ft;
             boolean supportsColumnAtATimeReader = text.syntheticSourceDelegate() != null
                 && text.syntheticSourceDelegate().hasDocValues()
-                && text.canUseSyntheticSourceDelegateForQuerying();
+                && text.canUseSyntheticSourceDelegateForLoading();
             return new MapperTestCase.BlockReaderSupport(supportsColumnAtATimeReader, mapper, loaderFieldName);
         }
         MappedFieldType parent = mapper.fieldType(parentName);

diff --git a/...n/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/CsvTestsDataLoader.java b/...n/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/CsvTestsDataLoader.java
@@ -72,6 +72,10 @@ public class CsvTestsDataLoader {
     private static final TestsDataset DECADES = new TestsDataset("decades");
     private static final TestsDataset AIRPORTS = new TestsDataset("airports");
     private static final TestsDataset AIRPORTS_MP = AIRPORTS.withIndex("airports_mp").withData("airports_mp.csv");
+    private static final TestsDataset AIRPORTS_NO_DOC_VALUES = new TestsDataset("airports_no_doc_values").withData("airports.csv");
+    private static final TestsDataset AIRPORTS_NOT_INDEXED = new TestsDataset("airports_not_indexed").withData("airports.csv");
+    private static final TestsDataset AIRPORTS_NOT_INDEXED_NOR_DOC_VALUES = new TestsDataset("airports_not_indexed_nor_doc_values")
+        .withData("airports.csv");
     private static final TestsDataset AIRPORTS_WEB = new TestsDataset("airports_web");
     private static final TestsDataset DATE_NANOS = new TestsDataset("date_nanos");
     private static final TestsDataset COUNTRIES_BBOX = new TestsDataset("countries_bbox");
@@ -105,6 +109,9 @@ public class CsvTestsDataLoader {
         Map.entry(DECADES.indexName, DECADES),
         Map.entry(AIRPORTS.indexName, AIRPORTS),
         Map.entry(AIRPORTS_MP.indexName, AIRPORTS_MP),
+        Map.entry(AIRPORTS_NO_DOC_VALUES.indexName, AIRPORTS_NO_DOC_VALUES),
+        Map.entry(AIRPORTS_NOT_INDEXED.indexName, AIRPORTS_NOT_INDEXED),
+        Map.entry(AIRPORTS_NOT_INDEXED_NOR_DOC_VALUES.indexName, AIRPORTS_NOT_INDEXED_NOR_DOC_VALUES),
         Map.entry(AIRPORTS_WEB.indexName, AIRPORTS_WEB),
         Map.entry(COUNTRIES_BBOX.indexName, COUNTRIES_BBOX),
         Map.entry(COUNTRIES_BBOX_WEB.indexName, COUNTRIES_BBOX_WEB),

diff --git a/...plugin/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/EsqlTestUtils.java b/...plugin/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/EsqlTestUtils.java
@@ -89,6 +89,8 @@
 import java.time.Period;
 import java.util.ArrayList;
 import java.util.EnumSet;
+import java.util.HashMap;
+import java.util.HashSet;
 import java.util.Iterator;
 import java.util.LinkedHashMap;
 import java.util.List;
@@ -206,9 +208,30 @@ public static EsRelation relation() {
         return new EsRelation(EMPTY, new EsIndex(randomAlphaOfLength(8), emptyMap()), IndexMode.STANDARD, randomBoolean());
     }
 
-    public static class TestSearchStats extends SearchStats {
-        public TestSearchStats() {
-            super(emptyList());
+    /**
+     * This version of SearchStats always returns true for all fields for all boolean methods.
+     * For custom behaviour either use {@link TestConfigurableSearchStats} or override the specific methods.
+     */
+    public static class TestSearchStats implements SearchStats {
+
+        @Override
+        public boolean exists(String field) {
+            return true;
+        }
+
+        @Override
+        public boolean isIndexed(String field) {
+            return exists(field);
+        }
+
+        @Override
+        public boolean hasDocValues(String field) {
+            return exists(field);
+        }
+
+        @Override
+        public boolean hasExactSubfield(String field) {
+            return exists(field);
         }
 
         @Override
@@ -226,11 +249,6 @@ public long count(String field, BytesRef value) {
             return exists(field) ? -1 : 0;
         }
 
-        @Override
-        public boolean exists(String field) {
-            return true;
-        }
-
         @Override
         public byte[] min(String field, DataType dataType) {
             return null;
@@ -245,10 +263,76 @@ public byte[] max(String field, DataType dataType) {
         public boolean isSingleValue(String field) {
             return false;
         }
+    }
+
+    /**
+     * This version of SearchStats can be preconfigured to return true/false for various combinations of the four field settings:
+     * <ol>
+     *     <li>exists</li>
+     *     <li>isIndexed</li>
+     *     <li>hasDocValues</li>
+     *     <li>hasExactSubfield</li>
+     * </ol>
+     * The default will return true for all fields. The include/exclude methods can be used to configure the settings for specific fields.
+     * If you call 'include' with no fields, it will switch to return false for all fields.
+     */
+    public static class TestConfigurableSearchStats extends TestSearchStats {
+        public enum Config {
+            EXISTS,
+            INDEXED,
+            DOC_VALUES,
+            EXACT_SUBFIELD
+        }
+
+        private final Map<Config, Set<String>> includes = new HashMap<>();
+        private final Map<Config, Set<String>> excludes = new HashMap<>();
+
+        public TestConfigurableSearchStats include(Config key, String... fields) {
+            // If this method is called with no fields, it is interpreted to mean include none, so we include a dummy field
+            for (String field : fields.length == 0 ? new String[] { "-" } : fields) {
+                includes.computeIfAbsent(key, k -> new HashSet<>()).add(field);
+                excludes.computeIfAbsent(key, k -> new HashSet<>()).remove(field);
+            }
+            return this;
+        }
+
+        public TestConfigurableSearchStats exclude(Config key, String... fields) {
+            for (String field : fields) {
+                includes.computeIfAbsent(key, k -> new HashSet<>()).remove(field);
+                excludes.computeIfAbsent(key, k -> new HashSet<>()).add(field);
+            }
+            return this;
+        }
+
+        private boolean isConfigationSet(Config config, String field) {
+            Set<String> in = includes.getOrDefault(config, Set.of());
+            Set<String> ex = excludes.getOrDefault(config, Set.of());
+            return (in.isEmpty() || in.contains(field)) && ex.contains(field) == false;
+        }
+
+        @Override
+        public boolean exists(String field) {
+            return isConfigationSet(Config.EXISTS, field);
+        }
 
         @Override
         public boolean isIndexed(String field) {
-            return exists(field);
+            return isConfigationSet(Config.INDEXED, field);
+        }
+
+        @Override
+        public boolean hasDocValues(String field) {
+            return isConfigationSet(Config.DOC_VALUES, field);
+        }
+
+        @Override
+        public boolean hasExactSubfield(String field) {
+            return isConfigationSet(Config.EXACT_SUBFIELD, field);
+        }
+
+        @Override
+        public String toString() {
+            return "TestConfigurableSearchStats{" + "includes=" + includes + ", excludes=" + excludes + '}';
         }
     }
 

diff --git a/x-pack/plugin/esql/qa/testFixtures/src/main/resources/mapping-airports_no_doc_values.json b/x-pack/plugin/esql/qa/testFixtures/src/main/resources/mapping-airports_no_doc_values.json
@@ -0,0 +1,30 @@
+{
+  "properties": {
+    "abbrev": {
+      "type": "keyword"
+    },
+    "name": {
+      "type": "text"
+    },
+    "scalerank": {
+      "type": "integer"
+    },
+    "type": {
+      "type": "keyword"
+    },
+    "location": {
+      "type": "geo_point",
+      "index": true,
+      "doc_values": false
+    },
+    "country": {
+      "type": "keyword"
+    },
+    "city": {
+      "type": "keyword"
+    },
+    "city_location": {
+      "type": "geo_point"
+    }
+  }
+}
diff --git a/x-pack/plugin/esql/qa/testFixtures/src/main/resources/mapping-airports_not_indexed.json b/x-pack/plugin/esql/qa/testFixtures/src/main/resources/mapping-airports_not_indexed.json
@@ -0,0 +1,30 @@
+{
+  "properties": {
+    "abbrev": {
+      "type": "keyword"
+    },
+    "name": {
+      "type": "text"
+    },
+    "scalerank": {
+      "type": "integer"
+    },
+    "type": {
+      "type": "keyword"
+    },
+    "location": {
+      "type": "geo_point",
+      "index": false,
+      "doc_values": true
+    },
+    "country": {
+      "type": "keyword"
+    },
+    "city": {
+      "type": "keyword"
+    },
+    "city_location": {
+      "type": "geo_point"
+    }
+  }
+}
diff --git a/...urces/mapping-airports-no-doc-values.json → ...-airports_not_indexed_nor_doc_values.json b/...urces/mapping-airports-no-doc-values.json → ...-airports_not_indexed_nor_doc_values.json
diff --git a/x-pack/plugin/esql/qa/testFixtures/src/main/resources/spatial.csv-spec b/x-pack/plugin/esql/qa/testFixtures/src/main/resources/spatial.csv-spec
@@ -484,6 +484,42 @@ centroid:geo_point                            | count:long
 POINT (42.97109629958868 14.7552534006536)    | 1
 ;
 
+centroidFromAirportsAfterIntersectsCompoundPredicateNoDocValues
+required_capability: st_intersects
+
+FROM airports_no_doc_values
+| WHERE scalerank == 9 AND ST_INTERSECTS(location, TO_GEOSHAPE("POLYGON((42 14, 43 14, 43 15, 42 15, 42 14))")) AND country == "Yemen"
+| STATS centroid=ST_CENTROID_AGG(location), count=COUNT()
+;
+
+centroid:geo_point                            | count:long
+POINT (42.97109629958868 14.7552534006536)    | 1
+;
+
+centroidFromAirportsAfterIntersectsCompoundPredicateNotIndexedNorDocValues
+required_capability: st_intersects
+
+FROM airports_not_indexed_nor_doc_values
+| WHERE scalerank == 9 AND ST_INTERSECTS(location, TO_GEOSHAPE("POLYGON((42 14, 43 14, 43 15, 42 15, 42 14))")) AND country == "Yemen"
+| STATS centroid=ST_CENTROID_AGG(location), count=COUNT()
+;
+
+centroid:geo_point                            | count:long
+POINT (42.97109629958868 14.7552534006536)    | 1
+;
+
+centroidFromAirportsAfterIntersectsCompoundPredicateNotIndexed
+required_capability: st_intersects
+
+FROM airports_not_indexed
+| WHERE scalerank == 9 AND ST_INTERSECTS(location, TO_GEOSHAPE("POLYGON((42 14, 43 14, 43 15, 42 15, 42 14))")) AND country == "Yemen"
+| STATS centroid=ST_CENTROID_AGG(location), count=COUNT()
+;
+
+centroid:geo_point                            | count:long
+POINT (42.97109629958868 14.7552534006536)    | 1
+;
+
 ###############################################
 # Tests for ST_INTERSECTS on GEO_POINT type