Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema DSL for testing #566

Open
bkietz opened this issue Jul 25, 2024 · 6 comments
Open

Schema DSL for testing #566

bkietz opened this issue Jul 25, 2024 · 6 comments

Comments

@bkietz
Copy link
Member

bkietz commented Jul 25, 2024

Arrow C++ includes factories for constructing schemas, types, fields, and metadata which allow construction of even deeply nested structures to be expressive:

schema({
  field("some_col", int32(), key_value_metadata({
    {"some_key_field", "some_value_field"},
  })),
}, key_value_metadata({{"some_key", "some_value"}})),

It should be straightforward to write equivalent factories which build a nanoarrow::UniqueSchema.

@bkietz
Copy link
Member Author

bkietz commented Jul 25, 2024

This should include a schema equality utility too

@paleolimbot
Copy link
Member

We could certainly replicate Arrow C++'s syntax here, although I am hesitant to add scope to nanoarrow or make it seem like we are trying to replace anything about Arrow C++.

This should include a schema equality utility too

We have a few places that do something like this...for integration testing we have one that is slow (and somewhat specific to the types of schemas that show up in the integration testing) but generates a nice diff:

nanoarrow::testing::TestingJSONComparison comparison;
SetComparisonOptions(&comparison);
NANOARROW_RETURN_NOT_OK(
comparison.CompareSchema(actual.get(), data.schema.get(), error));
if (comparison.num_differences() > 0) {
std::stringstream ss;
comparison.WriteDifferences(ss);
ArrowErrorSet(error, "Found %d differences:\n%s",
static_cast<int>(comparison.num_differences()), ss.str().c_str());
return EINVAL;
}

...and in Python we have one (that should almost certainly be written in C) that performs the check but doesn't generate very useful output on failure:

def type_equals(self, CSchema other, check_nullability: bool=False) -> bool:
"""Test two schemas for data type equality
Checks two CSchema objects for type equality (i.e., that an array with
schema ``actual`` contains elements with the same logical meaning as and
array with schema ``expected``). Notably, this excludes metadata from
all nodes in the schema.
Parameters
----------
other : CSchema
The schema against which to test
check_nullability : bool
If True, actual and expected will be considered equal if their
data type information and marked nullability are identical.
"""
self._assert_valid()
if self._ptr == other._ptr:
return True
if self.format != other.format:
return False
# Nullability is not strictly part of the "type"; however, performing
# this check recursively is verbose to otherwise accomplish and
# sometimes this does matter.
cdef int64_t flags = self.flags
cdef int64_t other_flags = other.flags
if not check_nullability:
flags &= ~ARROW_FLAG_NULLABLE
other_flags &= ~ARROW_FLAG_NULLABLE
if flags != other_flags:
return False
if self.n_children != other.n_children:
return False
for child, other_child in zip(self.children, other.children):
if not child.type_equals(other_child, check_nullability=check_nullability):
return False
if (self.dictionary is None) != (other.dictionary is None):
return False
if self.dictionary is not None:
if not self.dictionary.type_equals(
other.dictionary,
check_nullability=check_nullability
):
return False
return True

Both of those are pretty specific to exactly what we needed them for.

@paleolimbot
Copy link
Member

I sent this to you offline as well but I'll post here too! For generating integration test JSON we had a similar situation to serializing IPC schemas and went with a helper function plus a lambda to generate the full range of data types:

TEST(NanoarrowTestingTest, NanoarrowTestingTestTypePrimitive) {
TestWriteJSON(
[](ArrowSchema* schema) {
return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_NA);
},
/*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "null"})");
TestWriteJSON(
[](ArrowSchema* schema) {
return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_BOOL);
},
/*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "bool"})");
TestWriteJSON(
[](ArrowSchema* schema) {
return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_INT8);
},
/*append_expr*/ nullptr, &WriteTypeJSON,
R"({"name": "int", "bitWidth": 8, "isSigned": true})");
TestWriteJSON(
[](ArrowSchema* schema) {
return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_UINT8);
},
/*append_expr*/ nullptr, &WriteTypeJSON,
R"({"name": "int", "bitWidth": 8, "isSigned": false})");
TestWriteJSON(
[](ArrowSchema* schema) {
return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_HALF_FLOAT);
},
/*append_expr*/ nullptr, &WriteTypeJSON,
R"({"name": "floatingpoint", "precision": "HALF"})");
TestWriteJSON(
[](ArrowSchema* schema) {
return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_FLOAT);
},
/*append_expr*/ nullptr, &WriteTypeJSON,
R"({"name": "floatingpoint", "precision": "SINGLE"})");
TestWriteJSON(
[](ArrowSchema* schema) {
return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_DOUBLE);
},
/*append_expr*/ nullptr, &WriteTypeJSON,
R"({"name": "floatingpoint", "precision": "DOUBLE"})");
TestWriteJSON(
[](ArrowSchema* schema) {
return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_STRING);
},
/*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "utf8"})");
TestWriteJSON(
[](ArrowSchema* schema) {
return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_LARGE_STRING);
},
/*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "largeutf8"})");
TestWriteJSON(
[](ArrowSchema* schema) {
return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_BINARY);
},
/*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "binary"})");
TestWriteJSON(
[](ArrowSchema* schema) {
return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_LARGE_BINARY);
},
/*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "largebinary"})");
}
TEST(NanoarrowTestingTest, NanoarrowTestingTestTypeParameterized) {
TestWriteJSON(
[](ArrowSchema* schema) {
ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetTypeFixedSize(schema, NANOARROW_TYPE_FIXED_SIZE_BINARY, 123));
return NANOARROW_OK;
},
/*append_expr*/ nullptr, &WriteTypeJSON,
R"({"name": "fixedsizebinary", "byteWidth": 123})");
TestWriteJSON(
[](ArrowSchema* schema) {
ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetTypeDecimal(schema, NANOARROW_TYPE_DECIMAL128, 10, 3));
return NANOARROW_OK;
},
/*append_expr*/ nullptr, &WriteTypeJSON,
R"({"name": "decimal", "bitWidth": 128, "precision": 10, "scale": 3})");
TestWriteJSON(
[](ArrowSchema* schema) {
ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetTypeStruct(schema, 0));
return NANOARROW_OK;
},
/*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "struct"})");
TestWriteJSON(
[](ArrowSchema* schema) {
ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema, NANOARROW_TYPE_LIST));
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_INT32));
return NANOARROW_OK;
},
/*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "list"})");
TestWriteJSON(
[](ArrowSchema* schema) {
ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema, NANOARROW_TYPE_MAP));
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetType(schema->children[0]->children[0], NANOARROW_TYPE_STRING));
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetType(schema->children[0]->children[1], NANOARROW_TYPE_INT32));
return NANOARROW_OK;
},
/*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "map", "keysSorted": false})");
TestWriteJSON(
[](ArrowSchema* schema) {
ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema, NANOARROW_TYPE_MAP));
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetType(schema->children[0]->children[0], NANOARROW_TYPE_STRING));
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetType(schema->children[0]->children[1], NANOARROW_TYPE_INT32));
schema->flags = ARROW_FLAG_MAP_KEYS_SORTED;
return NANOARROW_OK;
},
/*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "map", "keysSorted": true})");
TestWriteJSON(
[](ArrowSchema* schema) {
ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema, NANOARROW_TYPE_LARGE_LIST));
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_INT32));
return NANOARROW_OK;
},
/*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "largelist"})");
TestWriteJSON(
[](ArrowSchema* schema) {
ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetTypeFixedSize(schema, NANOARROW_TYPE_FIXED_SIZE_LIST, 12));
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_INT32));
return NANOARROW_OK;
},
/*append_expr*/ nullptr, &WriteTypeJSON,
R"({"name": "fixedsizelist", "listSize": 12})");
}
TEST(NanoarrowTestingTest, NanoarrowTestingTestTypeUnion) {
TestWriteJSON(
[](ArrowSchema* schema) {
ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetTypeUnion(schema, NANOARROW_TYPE_SPARSE_UNION, 0));
return NANOARROW_OK;
},
/*append_expr*/ nullptr, &WriteTypeJSON,
R"({"name": "union", "mode": "SPARSE", "typeIds": []})");
TestWriteJSON(
[](ArrowSchema* schema) {
ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetTypeUnion(schema, NANOARROW_TYPE_SPARSE_UNION, 2));
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_STRING));
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetType(schema->children[1], NANOARROW_TYPE_INT32));
return NANOARROW_OK;
},
/*append_expr*/ nullptr, &WriteTypeJSON,
R"({"name": "union", "mode": "SPARSE", "typeIds": [0,1]})");
TestWriteJSON(
[](ArrowSchema* schema) {
ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetTypeUnion(schema, NANOARROW_TYPE_DENSE_UNION, 0));
return NANOARROW_OK;
},
/*append_expr*/ nullptr, &WriteTypeJSON,
R"({"name": "union", "mode": "DENSE", "typeIds": []})");
TestWriteJSON(
[](ArrowSchema* schema) {
ArrowSchemaInit(schema);
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetTypeUnion(schema, NANOARROW_TYPE_DENSE_UNION, 2));
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_STRING));
NANOARROW_RETURN_NOT_OK(
ArrowSchemaSetType(schema->children[1], NANOARROW_TYPE_INT32));
return NANOARROW_OK;
},
/*append_expr*/ nullptr, &WriteTypeJSON,
R"({"name": "union", "mode": "DENSE", "typeIds": [0,1]})");
}

A similar example using Arrow C++ that would be nice to replace:

INSTANTIATE_TEST_SUITE_P(
NanoarrowIpcTest, ArrowTypeParameterizedTestFixture,
::testing::Values(
arrow::null(), arrow::boolean(), arrow::int8(), arrow::uint8(), arrow::int16(),
arrow::uint16(), arrow::int32(), arrow::uint32(), arrow::int64(), arrow::uint64(),
arrow::utf8(), arrow::float16(), arrow::float32(), arrow::float64(),
arrow::decimal128(10, 3), arrow::decimal256(10, 3), arrow::large_utf8(),
arrow::binary(), arrow::large_binary(), arrow::fixed_size_binary(123),
arrow::date32(), arrow::date64(), arrow::time32(arrow::TimeUnit::SECOND),
arrow::time32(arrow::TimeUnit::MILLI), arrow::time64(arrow::TimeUnit::MICRO),
arrow::time64(arrow::TimeUnit::NANO), arrow::timestamp(arrow::TimeUnit::SECOND),
arrow::timestamp(arrow::TimeUnit::MILLI),
arrow::timestamp(arrow::TimeUnit::MICRO), arrow::timestamp(arrow::TimeUnit::NANO),
arrow::timestamp(arrow::TimeUnit::SECOND, "UTC"),
arrow::duration(arrow::TimeUnit::SECOND), arrow::duration(arrow::TimeUnit::MILLI),
arrow::duration(arrow::TimeUnit::MICRO), arrow::duration(arrow::TimeUnit::NANO),
arrow::month_interval(), arrow::day_time_interval(),
arrow::month_day_nano_interval(),
arrow::list(arrow::field("some_custom_name", arrow::int32())),
arrow::large_list(arrow::field("some_custom_name", arrow::int32())),
arrow::fixed_size_list(arrow::field("some_custom_name", arrow::int32()), 123),
arrow::map(arrow::utf8(), arrow::int64(), false),
arrow::map(arrow::utf8(), arrow::int64(), true),
arrow::struct_({arrow::field("col1", arrow::int32()),
arrow::field("col2", arrow::utf8())}),
// Zero-size union doesn't roundtrip through the C Data interface until
// Arrow 11 (which is not yet available on all platforms)
// arrow::sparse_union(FieldVector()), arrow::dense_union(FieldVector()),
// No custom type IDs
arrow::sparse_union({arrow::field("col1", arrow::int32()),
arrow::field("col2", arrow::utf8())}),
arrow::dense_union({arrow::field("col1", arrow::int32()),
arrow::field("col2", arrow::utf8())}),
// With custom type IDs
arrow::sparse_union({arrow::field("col1", arrow::int32()),
arrow::field("col2", arrow::utf8())},
{126, 127}),
arrow::dense_union({arrow::field("col1", arrow::int32()),
arrow::field("col2", arrow::utf8())},
{126, 127}),
// Type with nested metadata
arrow::list(arrow::field("some_custom_name", arrow::int32(),
arrow::KeyValueMetadata::Make({"key1"}, {"value1"})))
));

@bkietz
Copy link
Member Author

bkietz commented Jul 26, 2024

I am hesitant to add scope to nanoarrow

If we keep it minimal and closely aligned with the ABI, 100-200 lines would suffice for:

  using namespace nanoarrow::testing::dsl;

  // declare a schema (default format is +s)
  UniqueSchema s = schema{
    // we can make the arguments look kwarg-like
    children{
      {"i", "my int field's name"},
      {"i", dictionary{{"u"}}, "my dictionary field's name",
       metadata{
           "some_key=some_value",
           "some_key2=some_value2",
       },
       ARROW_FLAG_NULLABLE},
    }
  };

@paleolimbot
Copy link
Member

I like the idea of putting it in testing (it can move if it becomes popular). Replacing the usage in the Testing JSON generator would probably get you all the unit tests for free!

@paleolimbot
Copy link
Member

In searching for Array equality utilities, I found that ADBC's validation utility also has a way to create schemas using nanoarrow for use in testing!

https://github.com/apache/arrow-adbc/blob/36f0cd32af2e3f75b12d4397d1ed9b6ecbc1acce/c/validation/adbc_validation_util.h#L252-L434

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants