Adding predicate to protocol and metadata log replay query #336

OussamaSaoudi-db · 2024-09-13T22:51:01Z

Pass a predicate hint to Engine in read_metadata so that log files are filtered to contain either metadata or protocol columns.

The motivation of this change is to filter log data at the engine level instead of needlessly passing EngineData back to the kernel.

In order to verify that an engine gets the predicate passed to log replay, I added a debug print statement in the sync engine that logs the predicate. The scan::tests::test_scan_data uses the sync engine and you can observe that the predicate is received. For example, running the test with RUST_LOG=DEBUG yields the following log:

 Reading json files: 
     ...
    with predicate Some(
        VariadicOperation {
            op: Or,
            exprs: [
                UnaryOperation {
                    op: Not,
                    expr: UnaryOperation {
                        op: IsNull,
                        expr: Column(
                            "metaData.id",
                        ),
                    },
                },
                UnaryOperation {
                    op: Not,
                    expr: UnaryOperation {
                        op: IsNull,
                        expr: Column(
                            "protocol.min_reader_version",
                        ),
                    },
                },
            ],
        },
    )

The predicate checks protocol and metadata leaf fields metaData.id and protocol.min_reader_version instead of checking for metaData and protocol. This is done to leverage row group skipping, since statistics are kept for leaf fields, and not internal nodes. This is especially useful for large checkpoint files which will have exactly one row group containing protocol and metadata info.

Closes: #74

Passes a predicate hint to `Engine` in `read_metadata` so that batches are filtered to contain either metadata or protocol columns.

codecov · 2024-09-13T22:54:47Z

Codecov Report

Attention: Patch coverage is 75.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 73.88%. Comparing base (ed1919a) to head (beb9850).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/engine/sync/json.rs	50.00%	0 Missing and 1 partial ⚠️
kernel/src/engine/sync/parquet.rs	50.00%	0 Missing and 1 partial ⚠️
kernel/src/snapshot.rs	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #336      +/-   ##
==========================================
+ Coverage   73.86%   73.88%   +0.02%     
==========================================
  Files          43       43              
  Lines        8078     8085       +7     
  Branches     8078     8085       +7     
==========================================
+ Hits         5967     5974       +7     
  Misses       1732     1732              
  Partials      379      379

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zachschuermann

awesome this is a great start :) one nit: we can use Expression::{whatever} and reduce the need for prepending Expression:: to everything. and we can loop back on testing!

OussamaSaoudi-db · 2024-09-17T20:28:35Z

In order to verify that an engine gets the predicate passed to log replay, I added a debug print statement in the default engine that logs the predicate. The reader acceptance tests use the default engine and you can observe that the predicate is received.

For example, running the reader_test::basic_append test yields the following print statement:

[2024-09-17T20:20:02Z DEBUG delta_kernel::engine::default::json] Reading json files with predicate: Some(VariadicOperation { op: Or, exprs: [UnaryOperation { op: Not, expr: UnaryOperation { op: IsNull, expr: Column("metaData.id") } }, UnaryOperation { op: Not, expr: UnaryOperation { op: IsNull, expr: Column("protocol.min_reader_version") } }] })
[2024-09-17T20:20:02Z DEBUG delta_kernel::engine::default::parquet] Reading parquet files with predicate: Some(VariadicOperation { op: Or, exprs: [UnaryOperation { op: Not, expr: UnaryOperation { op: IsNull, expr: Column("metaData.id") } }, UnaryOperation { op: Not, expr: UnaryOperation { op: IsNull, expr: Column("protocol.min_reader_version") } }] })

OussamaSaoudi-db · 2024-09-17T20:40:40Z

Code coverage is not happy with the debug print lines. This is likely because debug! is conditionally compiled depending on the RUST_LOG env variable.

nicklan

just a couple of small nits

kernel/src/engine/default/json.rs

kernel/src/snapshot.rs

kernel/src/engine/sync/json.rs

kernel/Cargo.toml

kernel/src/snapshot.rs

zachschuermann

lgtm!

zachschuermann · 2024-09-17T22:35:41Z

@OussamaSaoudi-db can you make a follow-up issue to actually implement the predicate pushdown and to test this more thoroughly (maybe by metrics or some other way)

nicklan

lgtm! thanks!

This is a followup to [this PR](#336) which patches a mistake in the filter. Protocol has a `protocol.minReaderVersion`, and no `protocol.min_reader_version` field. This PR also fixes an intermittent test failure caused by repeat initialization of tracing. This PR changes the `test_scan_data` test to instead use the test_log crate for initializing logs.

OussamaSaoudi-db added 3 commits September 13, 2024 15:23

Pass predicate in snapshot replay to filter metadata or protocol

dfdd756

Passes a predicate hint to `Engine` in `read_metadata` so that batches are filtered to contain either metadata or protocol columns.

Add clarifying comment for filter, rename filter variable

cf95385

Fix column names for filter

0dc7fb2

Change unary() call to just use not()

c31a716

zachschuermann reviewed Sep 14, 2024

View reviewed changes

Use Expr associated type instead of Expression

59c4988

zachschuermann requested a review from nicklan September 16, 2024 21:29

Validate that predicate is correctly passed to engine

a9b2375

OussamaSaoudi-db force-pushed the protocol-metadata-pushdown branch from 144a8d8 to a9b2375 Compare September 17, 2024 20:13

OussamaSaoudi-db changed the title ~~[WIP] Implement query pushdown for protocol and snapshot queries~~ Implement query pushdown for protocol and snapshot queries Sep 17, 2024

OussamaSaoudi-db marked this pull request as ready for review September 17, 2024 20:35

OussamaSaoudi-db changed the title ~~Implement query pushdown for protocol and snapshot queries~~ Adding predicate to protocol and metadata log replay query Sep 17, 2024

OussamaSaoudi-db added 3 commits September 17, 2024 14:05

Amend acceptance tests

2d763b5

Move print test to non-acceptance test

911437e

Remove unnecessary change

fe77903

OussamaSaoudi-db force-pushed the protocol-metadata-pushdown branch from 3dbdefd to fe77903 Compare September 17, 2024 21:20

nicklan requested changes Sep 17, 2024

View reviewed changes

kernel/src/engine/default/json.rs Outdated Show resolved Hide resolved

kernel/src/snapshot.rs Outdated Show resolved Hide resolved

nicklan reviewed Sep 17, 2024

View reviewed changes

kernel/src/engine/sync/json.rs Outdated Show resolved Hide resolved

OussamaSaoudi-db added 3 commits September 17, 2024 14:32

Address PR comments

49889e7

Remove unecessary changes in default engine

e461695

Remove unused imports

7868f58

zachschuermann reviewed Sep 17, 2024

View reviewed changes

kernel/Cargo.toml Outdated Show resolved Hide resolved

kernel/src/snapshot.rs Show resolved Hide resolved

Move to tracing_subscriber

beb9850

OussamaSaoudi-db requested a review from nicklan September 17, 2024 22:31

zachschuermann approved these changes Sep 17, 2024

View reviewed changes

OussamaSaoudi-db mentioned this pull request Sep 17, 2024

Implement and test predicate pushdown for parquet in engines #341

Open

nicklan approved these changes Sep 18, 2024

View reviewed changes

OussamaSaoudi-db merged commit f6b8942 into delta-io:main Sep 18, 2024
12 checks passed

OussamaSaoudi-db deleted the protocol-metadata-pushdown branch September 18, 2024 19:03

OussamaSaoudi-db mentioned this pull request Sep 18, 2024

Fix column name for protocol in P&M query #343

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding predicate to protocol and metadata log replay query #336

Adding predicate to protocol and metadata log replay query #336

Uh oh!

OussamaSaoudi-db commented Sep 13, 2024 •

edited

Loading

Uh oh!

codecov bot commented Sep 13, 2024 •

edited

Loading

Uh oh!

zachschuermann left a comment

Uh oh!

OussamaSaoudi-db commented Sep 17, 2024

Uh oh!

OussamaSaoudi-db commented Sep 17, 2024

Uh oh!

nicklan left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zachschuermann left a comment

Uh oh!

zachschuermann commented Sep 17, 2024

Uh oh!

nicklan left a comment

Uh oh!

Uh oh!

Uh oh!

Adding predicate to protocol and metadata log replay query #336

Adding predicate to protocol and metadata log replay query #336

Uh oh!

Conversation

OussamaSaoudi-db commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

OussamaSaoudi-db commented Sep 17, 2024

Uh oh!

OussamaSaoudi-db commented Sep 17, 2024

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

zachschuermann commented Sep 17, 2024

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

OussamaSaoudi-db commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading