Skip to content

GH-31387: [C++] Check nullability when validating fields on batches or struct arrays #46129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

singh1203
Copy link
Contributor

@singh1203 singh1203 commented Apr 14, 2025

Rationale for this change

Ensures schema validation catches null values in non-nullable fields, preventing silent errors when writing to formats like Parquet.

What changes are included in this PR?

Fixes: #31387

  • Nullability checks were added in ValidateFull() for arrays, struct arrays, union arrays, and record batches.
  • Introduced new validation logic in validate.cc to recursively check for nulls in non-nullable fields.
  • Added unit tests in:
    • array_test.cc
    • array_struct_test.cc
    • array_union_test.cc
    • record_batch_test.cc

These tests ensure that ValidateFull() fails when nulls are present in non-nullable fields.

Are these changes tested?

Yes, new unit tests have been added

Are there any user-facing changes?

Yes, Users who try to validate or write Arrow data with nulls in non-nullable fields will now receive an explicit validation error.

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@singh1203 singh1203 changed the title ARROW-15961: [C++] Check nullability when validating fields on batches or struct arrays GH-31387: [C++] Check nullability when validating fields on batches or struct arrays Apr 14, 2025
Copy link

⚠️ GitHub issue #31387 has been automatically assigned in GitHub to PR creator.

@singh1203
Copy link
Contributor Author

cc: @pitrou
@lidavidm

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. You'll find some detailed comments. Two more general comments, though:

  1. most CI builds are failing, and you should have seen test failures locally too. Did you run the unit tests locally before pushing?
  2. the validation logic for nested arrays (struct, list, etc.) is missing, did you forget it?

Comment on lines +175 to +177
auto type = struct_({field("a", int32(), /*nullable=*/false),
field("b", utf8(), /*nullable=*/false),
field("c", list(boolean()), /*nullable=*/false)});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should test more nested situations:

  • non-nullable struct inside a list
  • non-nullable struct inside a struct

auto struct_arr = ArrayFromJSON(
type, R"([1, "a", [null, false]], [null, "bc", []], [2, null, null]])");
auto struct_arr_nonull = ArrayFromJSON(
type, R"([[1, "a"], [true, false], [6, "bc", []], [2, "bcj", [true, true]]])");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure the JSON is right here? The brackets don't seem balanced.

auto array_nested_null = ArrayFromJSON(type, "[[0, 1], [3, 4], [2, null]]");

ASSERT_RAISES(Invalid, array->ValidateFull());
ASSERT_RAISES(Invalid, array->ValidateFull());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is repeated, did you mean something else?

ASSERT_RAISES(Invalid, array->ValidateFull());
}

TEST_F(TestArray, TestValidateFullNullableFixedSizeList) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a fixed-size list in this test, did I miss something?

@@ -70,6 +70,21 @@ TEST(TestUnionArray, TestSliceEquals) {
CheckUnion(batch->column(1));
}

TEST(TestSparseUnionArray, TestValidateFullNullable) {
auto ty = sparse_union({field("ints", int64()), field("strs", utf8(), false)}, {2, 7});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any non-nullable field here.

@@ -70,6 +70,21 @@ TEST(TestUnionArray, TestSliceEquals) {
CheckUnion(batch->column(1));
}

TEST(TestSparseUnionArray, TestValidateFullNullable) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test a dense union?

@@ -464,8 +465,8 @@ struct ValidateArrayImpl {
return data.buffers[index] != nullptr && data.buffers[index]->address() != 0;
}

Status RecurseInto(const ArrayData& related_data) {
ValidateArrayImpl impl{related_data, full_validation};
Status RecurseInto(const ArrayData& related_data, bool nullable = true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather make the argument mandatory, so that we don't forget to pass it.

Suggested change
Status RecurseInto(const ArrayData& related_data, bool nullable = true) {
Status RecurseInto(const ArrayData& related_data, bool nullable) {

@@ -558,7 +559,7 @@ struct ValidateArrayImpl {
}

if (full_validation) {
if (data.null_count != kUnknownNullCount) {
if (data.null_count != kUnknownNullCount || !nullable) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but where does it test that actual_null_count is 0 if nullable is false?

@@ -187,6 +187,9 @@ class SimpleTable : public Table {
ss << "Column " << i << ": " << st.message();
return st.WithMessage(ss.str());
}
if (schema_->field(i)->nullable() && col->null_count() > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check that it's not nullable instead (did you notice the test failures?).

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Check nullability when validating fields on batches or struct arrays
2 participants