Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paquet and JSON readers use Arc<Expression> to avoid deep copies #364

Merged
merged 19 commits into from
Oct 18, 2024

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Sep 28, 2024

Today, the engine parquet/json file handler APIs take an Expression arg for predicate pushdown. They cannot take a reference, because the iterator they return will likely depend on (but outlive) that reference. Worse, they need to do it for every file the query reads. Unfortunately, data skipping predicates can be arbitrarily large, and thus annoying/expensive to copy so much. We already use Arc to protect schemas (some of the time, at least), and we can start using Arc to protect expressions as well.

@scovich scovich requested review from nicklan and hntd187 September 28, 2024 03:49
@@ -79,7 +80,7 @@ fn as_inverted_data_skipping_predicate(expr: &Expr) -> Option<Expr> {
as_data_skipping_predicate(&expr)
}
VariadicOperation { op, exprs } => {
let expr = Expr::variadic(op.invert(), exprs.iter().cloned().map(Expr::not));
let expr = Expr::variadic(op.invert(), exprs.iter().cloned().map(|e| !e));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we actually bothered to impl std::ops::Not for Expression, we may as well use it!

(Expr::not is really just shorthand for std::ops::Not::not, and requires the appropriate import. By contrast,! is compiler magic and does not require any import)

None => return None,
};

let predicate = predicate.as_deref()?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deref is black magic, but it's sure nice when it works!

(trying just predicate? gives a compilation error)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Try isn't implemented on &Option as_deref converts &Option<T> to Option<&T> https://doc.rust-lang.org/std/option/enum.Option.html#method.as_deref

kernel/src/scan/mod.rs Show resolved Hide resolved
@@ -467,7 +464,7 @@ fn transform_to_logical_internal(
.get_expression_handler()
.get_evaluator(
read_schema,
read_expression.clone(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One copy saved

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

kernel/src/snapshot.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this matches what arrow does with their various XRef types, I'm inclined to think yeah this is a good idea.

kernel/src/scan/mod.rs Show resolved Hide resolved
@scovich scovich marked this pull request as ready for review October 10, 2024 13:42
@scovich scovich requested a review from nicklan October 10, 2024 13:43
Copy link

codecov bot commented Oct 10, 2024

Codecov Report

Attention: Patch coverage is 83.33333% with 8 lines in your changes missing coverage. Please review.

Project coverage is 78.22%. Comparing base (edc85e5) to head (a4c7693).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/default/parquet.rs 28.57% 5 Missing ⚠️
ffi/src/scan.rs 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #364      +/-   ##
==========================================
- Coverage   78.25%   78.22%   -0.04%     
==========================================
  Files          49       49              
  Lines       10256    10253       -3     
  Branches    10256    10253       -3     
==========================================
- Hits         8026     8020       -6     
- Misses       1780     1783       +3     
  Partials      450      450              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Oct 10, 2024
@scovich scovich changed the title [WIP] Paquet and JSON readers use Arc<Expression> to avoid deep copies Paquet and JSON readers use Arc<Expression> to avoid deep copies Oct 10, 2024
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great just added a couple questions of my own :)

kernel/src/engine/sync/mod.rs Show resolved Hide resolved
kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
@@ -467,7 +464,7 @@ fn transform_to_logical_internal(
.get_expression_handler()
.get_evaluator(
read_schema,
read_expression.clone(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, one question that zach had too

kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
@scovich scovich merged commit 9b2e7e3 into delta-io:main Oct 18, 2024
14 checks passed
@scovich scovich deleted the read-predicate-arc branch November 8, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants