Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support annotation for nullable values in a container #342

Merged
merged 6 commits into from
Sep 24, 2024

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Sep 18, 2024

When arrow fixes apache/arrow-rs#6391, our code will start to fail since we don't allow for nulls in partition values, but they can be null.

This adds an annotation that will cause the generated schema to allow null values. We already have the semantics that if value is null on the arrow side, then materialize will simply not include it in the final map, so nothing else needs to change.

Tested by:

  1. All tests still pass
  2. Added a new test for the add schema that ensures it sets the value_contains_null field to true on the partitionValues map.
  3. Looking at generated code and noting we get
HashMap::<
  String,
  String,
>::get_nullable_container_struct_field("partitionValues"),

@zachschuermann zachschuermann self-requested a review September 18, 2024 01:06
@nicklan nicklan requested a review from scovich September 18, 2024 01:07
Copy link

codecov bot commented Sep 18, 2024

Codecov Report

Attention: Patch coverage is 96.29630% with 2 lines in your changes missing coverage. Please review.

Project coverage is 74.00%. Comparing base (ed1919a) to head (ac87235).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
derive-macros/src/lib.rs 87.50% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #342      +/-   ##
==========================================
+ Coverage   73.86%   74.00%   +0.14%     
==========================================
  Files          43       43              
  Lines        8078     8130      +52     
  Branches     8078     8130      +52     
==========================================
+ Hits         5967     6017      +50     
- Misses       1732     1734       +2     
  Partials      379      379              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

let have_schema_null = field.attrs.iter().any(|attr| {
// check if we have schema_map_values_null attr
match &attr.meta {
Meta::Path(path) => path.get_ident().map(|ident| ident == "schema_container_values_null").unwrap_or(false),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Meta::Path(path) => path.get_ident().map(|ident| ident == "schema_container_values_null").unwrap_or(false),
Meta::Path(path) => path.get_ident()
.is_some_and(|ident| ident == "schema_container_values_null"),

@@ -10,7 +10,7 @@ use syn::{parse_macro_input, Data, DataStruct, DeriveInput, Fields, PathArgument
/// Change Metadata](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#change-metadata)
/// action (this macro allows the use of standard rust snake_case, and will convert to the correct
/// delta schema camelCase version).
#[proc_macro_derive(Schema)]
#[proc_macro_derive(Schema, attributes(schema_container_values_null))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we doc-comment what schema_container_values_null means?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, added

if have_schema_null {
if let Some(first_ident) = type_path.path.segments.first().map(|seg| &seg.ident) {
if first_ident != "HashMap" && first_ident != "Vec" {
panic!("Can only use schema_container_values_null on HashMap or Vec fields, not {first_ident:?}");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it be correct to delete null entries from a Vec? It would change the indexes of all entries after it, which could cause a problem if the indexes were somehow meaningful?

(I also wonder what potential use cases that even serves?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. We don't actually have a use case for it yet, so I've removed it and only support HashMap for now. If we later find a use case we can add it back when we understand the semantics better.

@@ -141,6 +141,7 @@ pub struct Add {
pub path: String,

/// A map from partition column to value for this logical file.
#[schema_container_values_null]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider // commenting why the attribute is needed?
(if I understand correctly, partition values can be physically null, and the attribute causes the read to correctly handle that even tho the resulting HashMap will simply not contain them)

However, the above explanation is a bit hard to parse, and doesn't immediately follow from the attribute. That makes me wonder if we need to take a different approach to avoid trouble.

For example -- should the attribute be called drop_null_container_values? Then we could document it as "I know the underlying data for this non-nullable map/vec is nullable -- drop null values instead of erroring out"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I've renamed it to drop_null_container_values and updated the doc comment to hopefully clarify things.

@nicklan nicklan requested a review from scovich September 18, 2024 16:52
if first_ident != "HashMap" && first_ident != "Vec" {
panic!("Can only use schema_container_values_null on HashMap or Vec fields, not {first_ident:?}");
if first_ident != "HashMap" {
panic!("Can only use drop_null_container_values on HashMap fields, not {first_ident:?}");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC panic technically works, because it just kills the macro invocation and produces a compiler error... but IIRC there's a better way, emitting an error to the output token stream so the span info can give a more precise error message?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could use to_compile_error()

I've created #348 to convert to this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

if first_ident != "HashMap" && first_ident != "Vec" {
panic!("Can only use schema_container_values_null on HashMap or Vec fields, not {first_ident:?}");
if first_ident != "HashMap" {
panic!("Can only use drop_null_container_values on HashMap fields, not {first_ident:?}");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!

@nicklan nicklan merged commit b4e5403 into delta-io:main Sep 24, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading json map with non-nullable value schema doesn't error if values are actually null
3 participants