-
Notifications
You must be signed in to change notification settings - Fork 3.8k
GH-31387: [C++] Check nullability when validating fields on batches or struct arrays #46129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ruct Signed-off-by: Saurabh Kumar Singh <[email protected]>
…ation Signed-off-by: Saurabh Kumar Singh <[email protected]>
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. You'll find some detailed comments. Two more general comments, though:
- most CI builds are failing, and you should have seen test failures locally too. Did you run the unit tests locally before pushing?
- the validation logic for nested arrays (struct, list, etc.) is missing, did you forget it?
auto type = struct_({field("a", int32(), /*nullable=*/false), | ||
field("b", utf8(), /*nullable=*/false), | ||
field("c", list(boolean()), /*nullable=*/false)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should test more nested situations:
- non-nullable struct inside a list
- non-nullable struct inside a struct
auto struct_arr = ArrayFromJSON( | ||
type, R"([1, "a", [null, false]], [null, "bc", []], [2, null, null]])"); | ||
auto struct_arr_nonull = ArrayFromJSON( | ||
type, R"([[1, "a"], [true, false], [6, "bc", []], [2, "bcj", [true, true]]])"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure the JSON is right here? The brackets don't seem balanced.
cpp/src/arrow/array/array_test.cc
Outdated
auto array_nested_null = ArrayFromJSON(type, "[[0, 1], [3, 4], [2, null]]"); | ||
|
||
ASSERT_RAISES(Invalid, array->ValidateFull()); | ||
ASSERT_RAISES(Invalid, array->ValidateFull()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is repeated, did you mean something else?
ASSERT_RAISES(Invalid, array->ValidateFull()); | ||
} | ||
|
||
TEST_F(TestArray, TestValidateFullNullableFixedSizeList) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a fixed-size list in this test, did I miss something?
@@ -70,6 +70,21 @@ TEST(TestUnionArray, TestSliceEquals) { | |||
CheckUnion(batch->column(1)); | |||
} | |||
|
|||
TEST(TestSparseUnionArray, TestValidateFullNullable) { | |||
auto ty = sparse_union({field("ints", int64()), field("strs", utf8(), false)}, {2, 7}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any non-nullable field here.
@@ -70,6 +70,21 @@ TEST(TestUnionArray, TestSliceEquals) { | |||
CheckUnion(batch->column(1)); | |||
} | |||
|
|||
TEST(TestSparseUnionArray, TestValidateFullNullable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also test a dense union?
cpp/src/arrow/array/validate.cc
Outdated
@@ -464,8 +465,8 @@ struct ValidateArrayImpl { | |||
return data.buffers[index] != nullptr && data.buffers[index]->address() != 0; | |||
} | |||
|
|||
Status RecurseInto(const ArrayData& related_data) { | |||
ValidateArrayImpl impl{related_data, full_validation}; | |||
Status RecurseInto(const ArrayData& related_data, bool nullable = true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather make the argument mandatory, so that we don't forget to pass it.
Status RecurseInto(const ArrayData& related_data, bool nullable = true) { | |
Status RecurseInto(const ArrayData& related_data, bool nullable) { |
@@ -558,7 +559,7 @@ struct ValidateArrayImpl { | |||
} | |||
|
|||
if (full_validation) { | |||
if (data.null_count != kUnknownNullCount) { | |||
if (data.null_count != kUnknownNullCount || !nullable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but where does it test that actual_null_count
is 0 if nullable
is false?
cpp/src/arrow/table.cc
Outdated
@@ -187,6 +187,9 @@ class SimpleTable : public Table { | |||
ss << "Column " << i << ": " << st.message(); | |||
return st.WithMessage(ss.str()); | |||
} | |||
if (schema_->field(i)->nullable() && col->null_count() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should check that it's not nullable instead (did you notice the test failures?).
Signed-off-by: Saurabh Kumar Singh <[email protected]>
Rationale for this change
Ensures schema validation catches null values in non-nullable fields, preventing silent errors when writing to formats like Parquet.
What changes are included in this PR?
Fixes: #31387
These tests ensure that
ValidateFull()
fails when nulls are present in non-nullable fields.Are these changes tested?
Yes, new unit tests have been added
Are there any user-facing changes?
Yes, Users who try to validate or write Arrow data with nulls in non-nullable fields will now receive an explicit validation error.