Skip to content

Conversation

@scovich
Copy link
Collaborator

@scovich scovich commented Sep 28, 2024

Today, the engine parquet/json file handler APIs take an Expression arg for predicate pushdown. They cannot take a reference, because the iterator they return will likely depend on (but outlive) that reference. Worse, they need to do it for every file the query reads. Unfortunately, data skipping predicates can be arbitrarily large, and thus annoying/expensive to copy so much. We already use Arc to protect schemas (some of the time, at least), and we can start using Arc to protect expressions as well.

@scovich scovich requested review from hntd187 and nicklan September 28, 2024 03:49
}
VariadicOperation { op, exprs } => {
let expr = Expr::variadic(op.invert(), exprs.iter().cloned().map(Expr::not));
let expr = Expr::variadic(op.invert(), exprs.iter().cloned().map(|e| !e));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we actually bothered to impl std::ops::Not for Expression, we may as well use it!

(Expr::not is really just shorthand for std::ops::Not::not, and requires the appropriate import. By contrast,! is compiler magic and does not require any import)

None => return None,
};

let predicate = predicate.as_deref()?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deref is black magic, but it's sure nice when it works!

(trying just predicate? gives a compilation error)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Try isn't implemented on &Option as_deref converts &Option<T> to Option<&T> https://doc.rust-lang.org/std/option/enum.Option.html#method.as_deref

.get_expression_handler()
.get_evaluator(
read_schema,
read_expression.clone(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One copy saved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this matches what arrow does with their various XRef types, I'm inclined to think yeah this is a good idea.

@scovich scovich marked this pull request as ready for review October 10, 2024 13:42
@scovich scovich requested a review from nicklan October 10, 2024 13:43
@codecov
Copy link

codecov bot commented Oct 10, 2024

Codecov Report

Attention: Patch coverage is 83.33333% with 8 lines in your changes missing coverage. Please review.

Project coverage is 78.22%. Comparing base (edc85e5) to head (a4c7693).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/default/parquet.rs 28.57% 5 Missing ⚠️
ffi/src/scan.rs 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #364      +/-   ##
==========================================
- Coverage   78.25%   78.22%   -0.04%     
==========================================
  Files          49       49              
  Lines       10256    10253       -3     
  Branches    10256    10253       -3     
==========================================
- Hits         8026     8020       -6     
- Misses       1780     1783       +3     
  Partials      450      450              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Oct 10, 2024
@scovich scovich changed the title [WIP] Paquet and JSON readers use Arc<Expression> to avoid deep copies Paquet and JSON readers use Arc<Expression> to avoid deep copies Oct 10, 2024
Copy link
Member

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great just added a couple questions of my own :)

.get_expression_handler()
.get_evaluator(
read_schema,
read_expression.clone(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, one question that zach had too

@scovich scovich merged commit 9b2e7e3 into delta-io:main Oct 18, 2024
14 checks passed
@scovich scovich deleted the read-predicate-arc branch November 8, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants