-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: add metadata to literal expressions #16170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add metadata to literal expressions #16170
Conversation
273038e
to
124ea41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I have a few questions but in general I like this approach. I'm relatively new to the internals, but elevating the metadata into the enum
seems to have done a better job of exposing the places it might have otherwise been ignored.
/// A constant value along with associated metadata | ||
Literal(ScalarValue, Option<BTreeMap<String, String>>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious on the choice of a BTreeMap
here (isn't Field
using a HashMap
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for programmer's convenience. BTreeMap
will derive things like Debug
and Hash
which aren't implemented on HashMap
since HashMap
is not guaranteed to have specific ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was personally confused by it (and I believe it's why the conversions I pointed out below can't just be .clone()
). If the HashMap is not convenient for the metadata perhaps that's better done consistently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see...if you make this a HashMap the #[derive(Debug)]
on the enum no longer works out of the box. I don't have a strong opinion about this one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend:
- Adding a comment about why BTree is used to capture the context of this PR
- Change the metadata to be Arc'd so that
Clone
ing literals with metadata is not expensive.
I don't think this is a blocker for this PR, but do think it is important for the long term. Thus if we merge this API in for 48.0.0 I think we should be prepared to make a breaking API change for 49.0.0
I will try and whip up a prototype of what I am talking about
metadata | ||
.iter() | ||
.map(|(k, v)| (k.clone(), v.clone())) | ||
.collect(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does metadata.clone()
or metadata.into()
not work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Described in below comment
pub fn lit_with_metadata<T: Literal>( | ||
n: T, | ||
metadata: impl Into<Option<HashMap<String, String>>>, | ||
) -> Expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
@@ -342,15 +342,15 @@ mod test { | |||
) | |||
.unwrap(); | |||
let snap = dynamic_filter_1.snapshot().unwrap().unwrap(); | |||
insta::assert_snapshot!(format!("{snap:?}"), @r#"BinaryExpr { left: Column { name: "a", index: 0 }, op: Eq, right: Literal { value: Int32(42) }, fail_on_overflow: false }"#); | |||
insta::assert_snapshot!(format!("{snap:?}"), @r#"BinaryExpr { left: Column { name: "a", index: 0 }, op: Eq, right: Literal { value: Int32(42), field: Field { name: "42", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, fail_on_overflow: false }"#); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is Display
or Debug
, but would it be possible to omit the Field
if it isn't carrying extra metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Debug
output from format!("{snap:?}")
let mut new_metadata = prior_metadata | ||
.as_ref() | ||
.map(|m| { | ||
m.iter() | ||
.map(|(k, v)| (k.clone(), v.clone())) | ||
.collect::<HashMap<String, String>>() | ||
}) | ||
.unwrap_or_default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume there's some reason that .cloned()
doesn't work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you try doing that there is some trait bound that is unsatisfied that I didn't spend any more time trying to dig through.
@@ -6061,7 +6061,7 @@ physical_plan | |||
04)------AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))] | |||
05)--------ProjectionExec: expr=[] | |||
06)----------CoalesceBatchesExec: target_batch_size=8192 | |||
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278") }, Literal { value: Utf8View("a") }, Literal { value: Utf8View("b") }, Literal { value: Utf8View("c") }]) | |||
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "7f4b18de3cfeb9b4ac78c381ee2ad278", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "a", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "b", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "c", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those of us looking at physical plans all the time, I do think it might be worth the effort to make the Literal
output more succinct (perhaps just the metadata part?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want this on Display
or on Debug
? My gut reaction is that we just want Debug
as is. Right now on Display
I didn't change anything, but I could enhance it to include metadata when it exists. Since you're the first user, what are you looking for here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure which one shows up when you EXPLAIN
, but this does seem to affect the readability of that output? (Totally optional, can leave as a follow up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add this in right now. How would you like this to look? Suppose I have
let metadata: HashMap<_, _> = [("key1", "value1")]
.iter()
.map(|(k, v)| (k.to_string(), v.to_string()))
.collect();
let expr = lit_with_metadata(4.0, Some(metadata));
This expression shows up in the logical plan of explain as Float64(4)
and in the physical plan as 4 as Float64(4)
. How would you like that metadata shown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this look ok?
Float64(4) {"key1": "value1"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may play havoc with conversion to SQL but honestly I don't know how we'd handle metadata as SQL myself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks great! (I don't know what the implications are here, just a passing comment on impacting readability for the ubiquitous non-metadata case)
I don't know how we'd handle metadata as SQL myself
Not today, but rendering it as a function call may work (i.e., WithMetadata('...')
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a check locally to make sure that EXPLAIN output for normal queries isn't affected by this change and it isn't. Thanks!
124ea41
to
fb1e0ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @timsaucer.
Note for other reviewers - although this looks like a large PR, most of the changes are one of:
- update pattern matching to accommodate extra field
- propagating the new metadata
- updating tests assertions due to the Debug output including the new field
@alamb @xudong963 I think that we can include this in the next DF 48 rc |
🤖 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as this change doesn't cause any performance regressions (I started some benchmarks running) I am also ok with merging this PR
If we want to include it in the 48.0.0 I recommend we:
- merge this PR to main
- Create a backport PR that cherry-pick's the change to the
branch-48
branch
/// A constant value along with associated metadata | ||
Literal(ScalarValue, Option<BTreeMap<String, String>>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend:
- Adding a comment about why BTree is used to capture the context of this PR
- Change the metadata to be Arc'd so that
Clone
ing literals with metadata is not expensive.
I don't think this is a blocker for this PR, but do think it is important for the long term. Thus if we merge this API in for 48.0.0 I think we should be prepared to make a breaking API change for 49.0.0
I will try and whip up a prototype of what I am talking about
🤖: Benchmark completed Details
|
🤖 |
Here is a proposed PR: |
🤖: Benchmark completed Details
|
I'm a bit under the weather at the moment, but it sounds like people are leaning towards
Is that right? I took a quick look at Andrew's proposal and it looks like a good idea. I just wanted to avoid too much API churn. |
That is also my understanding. I will go ahead and merge this PR and create a PR to cherry pick to branch-48. I hope you feel better soon! |
* Adding metadata to literal enum type * Correct processing of metadata during simplificiation * Switch to btreemap instead of hashmap for carrying around metadata * Documentation update * Update unit test * Additional unit tests needed fields in the debug output * Updates after rebase * Add metadata to display when it exists
Thanks @andygrove and @timsaucer -- the plan of merging this PR and then a quick follow on in 49 seems great to me The planning benchmarks also looked good (no regressions) Thanks for helping make this happen |
I made a PR to main here (no rush on review): |
Which issue does this PR close?
Rationale for this change
This is an alternative to #16053
In this version, we change the enum variant of
Literal
in both logical and physical expressions to contain the optional metadata. This is a deeper change than #16053 but it does make an easier to follow data flow path. Rather than relying on complex checking on the schema during simplification and creation of the physical plans, it adds the schema to the logical expression.What changes are included in this PR?
Are these changes tested?
Tested with unit tests and add additional unit test that specifically checks for metadata on the input and output of scalar values.
Are there any user-facing changes?
The biggest change is that users who check literal values will need to update their enum variants to take and pass the additional metadata field.