Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding predicate to protocol and metadata log replay query #336

Merged

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Sep 13, 2024

Pass a predicate hint to Engine in read_metadata so that log files are filtered to contain either metadata or protocol columns.

The motivation of this change is to filter log data at the engine level instead of needlessly passing EngineData back to the kernel.

In order to verify that an engine gets the predicate passed to log replay, I added a debug print statement in the sync engine that logs the predicate. The scan::tests::test_scan_data uses the sync engine and you can observe that the predicate is received. For example, running the test with RUST_LOG=DEBUG yields the following log:

 Reading json files: 
     ...
    with predicate Some(
        VariadicOperation {
            op: Or,
            exprs: [
                UnaryOperation {
                    op: Not,
                    expr: UnaryOperation {
                        op: IsNull,
                        expr: Column(
                            "metaData.id",
                        ),
                    },
                },
                UnaryOperation {
                    op: Not,
                    expr: UnaryOperation {
                        op: IsNull,
                        expr: Column(
                            "protocol.min_reader_version",
                        ),
                    },
                },
            ],
        },
    )

The predicate checks protocol and metadata leaf fields metaData.id and protocol.min_reader_version instead of checking for metaData and protocol. This is done to leverage row group skipping, since statistics are kept for leaf fields, and not internal nodes. This is especially useful for large checkpoint files which will have exactly one row group containing protocol and metadata info.

Closes: #74

Passes a predicate hint to `Engine` in `read_metadata` so that
batches are filtered to contain either metadata or protocol
columns.
Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 73.88%. Comparing base (ed1919a) to head (beb9850).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/sync/json.rs 50.00% 0 Missing and 1 partial ⚠️
kernel/src/engine/sync/parquet.rs 50.00% 0 Missing and 1 partial ⚠️
kernel/src/snapshot.rs 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #336      +/-   ##
==========================================
+ Coverage   73.86%   73.88%   +0.02%     
==========================================
  Files          43       43              
  Lines        8078     8085       +7     
  Branches     8078     8085       +7     
==========================================
+ Hits         5967     5974       +7     
  Misses       1732     1732              
  Partials      379      379              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome this is a great start :) one nit: we can use Expression::{whatever} and reduce the need for prepending Expression:: to everything. and we can loop back on testing!

@OussamaSaoudi-db OussamaSaoudi-db force-pushed the protocol-metadata-pushdown branch from 144a8d8 to a9b2375 Compare September 17, 2024 20:13
@OussamaSaoudi-db
Copy link
Collaborator Author

In order to verify that an engine gets the predicate passed to log replay, I added a debug print statement in the default engine that logs the predicate. The reader acceptance tests use the default engine and you can observe that the predicate is received.

For example, running the reader_test::basic_append test yields the following print statement:

[2024-09-17T20:20:02Z DEBUG delta_kernel::engine::default::json] Reading json files with predicate: Some(VariadicOperation { op: Or, exprs: [UnaryOperation { op: Not, expr: UnaryOperation { op: IsNull, expr: Column("metaData.id") } }, UnaryOperation { op: Not, expr: UnaryOperation { op: IsNull, expr: Column("protocol.min_reader_version") } }] })
[2024-09-17T20:20:02Z DEBUG delta_kernel::engine::default::parquet] Reading parquet files with predicate: Some(VariadicOperation { op: Or, exprs: [UnaryOperation { op: Not, expr: UnaryOperation { op: IsNull, expr: Column("metaData.id") } }, UnaryOperation { op: Not, expr: UnaryOperation { op: IsNull, expr: Column("protocol.min_reader_version") } }] })

@OussamaSaoudi-db OussamaSaoudi-db changed the title [WIP] Implement query pushdown for protocol and snapshot queries Implement query pushdown for protocol and snapshot queries Sep 17, 2024
@OussamaSaoudi-db OussamaSaoudi-db marked this pull request as ready for review September 17, 2024 20:35
@OussamaSaoudi-db
Copy link
Collaborator Author

Code coverage is not happy with the debug print lines. This is likely because debug! is conditionally compiled depending on the RUST_LOG env variable.

@OussamaSaoudi-db OussamaSaoudi-db changed the title Implement query pushdown for protocol and snapshot queries Adding predicate to protocol and metadata log replay query Sep 17, 2024
@OussamaSaoudi-db OussamaSaoudi-db force-pushed the protocol-metadata-pushdown branch from 3dbdefd to fe77903 Compare September 17, 2024 21:20
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple of small nits

kernel/src/engine/default/json.rs Outdated Show resolved Hide resolved
kernel/src/snapshot.rs Outdated Show resolved Hide resolved
kernel/Cargo.toml Outdated Show resolved Hide resolved
kernel/src/snapshot.rs Show resolved Hide resolved
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@zachschuermann
Copy link
Collaborator

@OussamaSaoudi-db can you make a follow-up issue to actually implement the predicate pushdown and to test this more thoroughly (maybe by metrics or some other way)

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks!

@OussamaSaoudi-db OussamaSaoudi-db merged commit f6b8942 into delta-io:main Sep 18, 2024
12 checks passed
@OussamaSaoudi-db OussamaSaoudi-db deleted the protocol-metadata-pushdown branch September 18, 2024 19:03
OussamaSaoudi-db added a commit that referenced this pull request Sep 19, 2024
This is a followup to [this
PR](#336) which
patches a mistake in the filter. Protocol has a
`protocol.minReaderVersion`, and no `protocol.min_reader_version` field.

This PR also fixes an intermittent test failure caused by repeat
initialization of tracing. This PR changes the `test_scan_data` test to
instead use the test_log crate for initializing logs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

P&M query should leverage predicate pushdown
3 participants