Skip to content

Improve filter push-down #19929

@askalt

Description

@askalt

This issue covers two related filter push-down improvements.

Pass previously pushed filters to supports_filters_pushdown

Currently, the optimization does not pass filters that were pushed in a previous run (TableScan::filters) to TableProvider::supports_filters_pushdown(...).

If the optimizer runs multiple times, it may try to push filters into the table provider multiple times. In our DataFusion-based project, supports_filters_pushdown(...) has context-dependent behavior: the provider supports any single filter like column = value, but not multiple such filters at the same time.

Consider the following optimizer pipeline pattern:

  1. Try to push a = 1, b = 1.
    supports_filters_pushdown returns [Exact, Inexact]
    OK: the optimizer records that a = 1 is pushed and creates a filter node for b = 1.

...
Another optimization iteration.

  1. Try to push b = 1.
    supports_filters_pushdown returns [Exact]. Of course, the table provider can’t remember
    all previously pushed filters, so it has no choice but to answer Exact.
    Now, the optimizer thinks the conjunction a = 1 AND b = 1 is supported exactly, but it is not.

To prevent this problem, I suggest passing filters that were already pushed into the scan earlier to supports_filters_pushdown(...).

Do not assume that filter support decision is stable

Consider the next scenario:

  1. supports_filters_pushdown returns Exact on some filter, e.g. "a = 1", where column "a" is not
    required by the query projection.

  2. "a" is removed from the table provider projection by "optimize projection" rule.

  3. supports_filters_pushdown changes a decision and returns Inexact on this filter the next time.
    For example, input filters were changed and it prefers to use a new one.

  4. "a" is not returned to the table provider projection which leads to filter that references a column which is
    not a part of the schema.

Suggest to extend logic with the following actions:

  1. Collect columns that are not used in the current table provider projection, but required for filter
    expressions. Call it additional_projection.

  2. If additional_projection is empty -- leave all as is.

  3. Otherwise extend a table provider projection and wrap a plan with an additional projection node
    to preserve schema used prior to this rule.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions