-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support annotation for nullable values in a container #342
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #342 +/- ##
==========================================
+ Coverage 73.86% 74.00% +0.14%
==========================================
Files 43 43
Lines 8078 8130 +52
Branches 8078 8130 +52
==========================================
+ Hits 5967 6017 +50
- Misses 1732 1734 +2
Partials 379 379 ☔ View full report in Codecov by Sentry. |
derive-macros/src/lib.rs
Outdated
let have_schema_null = field.attrs.iter().any(|attr| { | ||
// check if we have schema_map_values_null attr | ||
match &attr.meta { | ||
Meta::Path(path) => path.get_ident().map(|ident| ident == "schema_container_values_null").unwrap_or(false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meta::Path(path) => path.get_ident().map(|ident| ident == "schema_container_values_null").unwrap_or(false), | |
Meta::Path(path) => path.get_ident() | |
.is_some_and(|ident| ident == "schema_container_values_null"), |
derive-macros/src/lib.rs
Outdated
@@ -10,7 +10,7 @@ use syn::{parse_macro_input, Data, DataStruct, DeriveInput, Fields, PathArgument | |||
/// Change Metadata](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#change-metadata) | |||
/// action (this macro allows the use of standard rust snake_case, and will convert to the correct | |||
/// delta schema camelCase version). | |||
#[proc_macro_derive(Schema)] | |||
#[proc_macro_derive(Schema, attributes(schema_container_values_null))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we doc-comment what schema_container_values_null
means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, added
derive-macros/src/lib.rs
Outdated
if have_schema_null { | ||
if let Some(first_ident) = type_path.path.segments.first().map(|seg| &seg.ident) { | ||
if first_ident != "HashMap" && first_ident != "Vec" { | ||
panic!("Can only use schema_container_values_null on HashMap or Vec fields, not {first_ident:?}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it be correct to delete null entries from a Vec? It would change the indexes of all entries after it, which could cause a problem if the indexes were somehow meaningful?
(I also wonder what potential use cases that even serves?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. We don't actually have a use case for it yet, so I've removed it and only support HashMap
for now. If we later find a use case we can add it back when we understand the semantics better.
kernel/src/actions/mod.rs
Outdated
@@ -141,6 +141,7 @@ pub struct Add { | |||
pub path: String, | |||
|
|||
/// A map from partition column to value for this logical file. | |||
#[schema_container_values_null] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider //
commenting why the attribute is needed?
(if I understand correctly, partition values can be physically null, and the attribute causes the read to correctly handle that even tho the resulting HashMap will simply not contain them)
However, the above explanation is a bit hard to parse, and doesn't immediately follow from the attribute. That makes me wonder if we need to take a different approach to avoid trouble.
For example -- should the attribute be called drop_null_container_values
? Then we could document it as "I know the underlying data for this non-nullable map/vec is nullable -- drop null values instead of erroring out"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I've renamed it to drop_null_container_values
and updated the doc comment to hopefully clarify things.
if first_ident != "HashMap" && first_ident != "Vec" { | ||
panic!("Can only use schema_container_values_null on HashMap or Vec fields, not {first_ident:?}"); | ||
if first_ident != "HashMap" { | ||
panic!("Can only use drop_null_container_values on HashMap fields, not {first_ident:?}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC panic
technically works, because it just kills the macro invocation and produces a compiler error... but IIRC there's a better way, emitting an error to the output token stream so the span info can give a more precise error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could use to_compile_error()
I've created #348 to convert to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
if first_ident != "HashMap" && first_ident != "Vec" { | ||
panic!("Can only use schema_container_values_null on HashMap or Vec fields, not {first_ident:?}"); | ||
if first_ident != "HashMap" { | ||
panic!("Can only use drop_null_container_values on HashMap fields, not {first_ident:?}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL!
When arrow fixes apache/arrow-rs#6391, our code will start to fail since we don't allow for nulls in partition values, but they can be null.
This adds an annotation that will cause the generated schema to allow null values. We already have the semantics that if value is null on the arrow side, then
materialize
will simply not include it in the final map, so nothing else needs to change.Tested by:
add
schema that ensures it sets thevalue_contains_null
field to true on thepartitionValues
map.