-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding predicate to protocol and metadata log replay query #336
Adding predicate to protocol and metadata log replay query #336
Conversation
Passes a predicate hint to `Engine` in `read_metadata` so that batches are filtered to contain either metadata or protocol columns.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #336 +/- ##
==========================================
+ Coverage 73.86% 73.88% +0.02%
==========================================
Files 43 43
Lines 8078 8085 +7
Branches 8078 8085 +7
==========================================
+ Hits 5967 5974 +7
Misses 1732 1732
Partials 379 379 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome this is a great start :) one nit: we can use Expression::{whatever} and reduce the need for prepending Expression::
to everything. and we can loop back on testing!
144a8d8
to
a9b2375
Compare
In order to verify that an engine gets the predicate passed to log replay, I added a debug print statement in the default engine that logs the predicate. The reader acceptance tests use the default engine and you can observe that the predicate is received. For example, running the
|
Code coverage is not happy with the debug print lines. This is likely because |
3dbdefd
to
fe77903
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple of small nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
@OussamaSaoudi-db can you make a follow-up issue to actually implement the predicate pushdown and to test this more thoroughly (maybe by metrics or some other way) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! thanks!
This is a followup to [this PR](#336) which patches a mistake in the filter. Protocol has a `protocol.minReaderVersion`, and no `protocol.min_reader_version` field. This PR also fixes an intermittent test failure caused by repeat initialization of tracing. This PR changes the `test_scan_data` test to instead use the test_log crate for initializing logs.
Pass a predicate hint to
Engine
inread_metadata
so that log files are filtered to contain either metadata or protocol columns.The motivation of this change is to filter log data at the engine level instead of needlessly passing EngineData back to the kernel.
In order to verify that an engine gets the predicate passed to log replay, I added a debug print statement in the sync engine that logs the predicate. The
scan::tests::test_scan_data
uses the sync engine and you can observe that the predicate is received. For example, running the test withRUST_LOG=DEBUG
yields the following log:The predicate checks protocol and metadata leaf fields
metaData.id
andprotocol.min_reader_version
instead of checking formetaData
andprotocol
. This is done to leverage row group skipping, since statistics are kept for leaf fields, and not internal nodes. This is especially useful for large checkpoint files which will have exactly one row group containing protocol and metadata info.Closes: #74