Schema DSL for testing #566

bkietz · 2024-07-25T19:48:03Z

Arrow C++ includes factories for constructing schemas, types, fields, and metadata which allow construction of even deeply nested structures to be expressive:

schema({
  field("some_col", int32(), key_value_metadata({
    {"some_key_field", "some_value_field"},
  })),
}, key_value_metadata({{"some_key", "some_value"}})),

It should be straightforward to write equivalent factories which build a nanoarrow::UniqueSchema.

The text was updated successfully, but these errors were encountered:

bkietz · 2024-07-25T19:53:52Z

This should include a schema equality utility too

paleolimbot · 2024-07-25T20:23:21Z

We could certainly replicate Arrow C++'s syntax here, although I am hesitant to add scope to nanoarrow or make it seem like we are trying to replace anything about Arrow C++.

This should include a schema equality utility too

We have a few places that do something like this...for integration testing we have one that is slow (and somewhat specific to the types of schemas that show up in the integration testing) but generates a nice diff:

arrow-nanoarrow/src/nanoarrow/integration/c_data_integration.cc

Lines 151 to 162 in 2040e74

    
           nanoarrow::testing::TestingJSONComparison comparison; 
        
           SetComparisonOptions(&comparison); 
        
           NANOARROW_RETURN_NOT_OK( 
        
               comparison.CompareSchema(actual.get(), data.schema.get(), error)); 
        
           if (comparison.num_differences() > 0) { 
        
             std::stringstream ss; 
        
             comparison.WriteDifferences(ss); 
        
             ArrowErrorSet(error, "Found %d differences:\n%s", 
        
                           static_cast<int>(comparison.num_differences()), ss.str().c_str()); 
        
             return EINVAL; 
        
           }

...and in Python we have one (that should almost certainly be written in C) that performs the check but doesn't generate very useful output on failure:

arrow-nanoarrow/python/src/nanoarrow/_schema.pyx

Lines 349 to 402 in 2040e74

    
           def type_equals(self, CSchema other, check_nullability: bool=False) -> bool: 
        
               """Test two schemas for data type equality 
        
               Checks two CSchema objects for type equality (i.e., that an array with 
        
               schema ``actual`` contains elements with the same logical meaning as and 
        
               array with schema ``expected``). Notably, this excludes metadata from 
        
               all nodes in the schema. 
        
               Parameters 
        
               ---------- 
        
               other : CSchema 
        
                   The schema against which to test 
        
               check_nullability : bool 
        
                   If True, actual and expected will be considered equal if their 
        
                   data type information and marked nullability are identical. 
        
               """ 
        
               self._assert_valid() 
        
               if self._ptr == other._ptr: 
        
                   return True 
        
               if self.format != other.format: 
        
                   return False 
        
               # Nullability is not strictly part of the "type"; however, performing 
        
               # this check recursively is verbose to otherwise accomplish and 
        
               # sometimes this does matter. 
        
               cdef int64_t flags = self.flags 
        
               cdef int64_t other_flags = other.flags 
        
               if not check_nullability: 
        
                   flags &= ~ARROW_FLAG_NULLABLE 
        
                   other_flags &= ~ARROW_FLAG_NULLABLE 
        
               if flags != other_flags: 
        
                   return False 
        
               if self.n_children != other.n_children: 
        
                   return False 
        
               for child, other_child in zip(self.children, other.children): 
        
                   if not child.type_equals(other_child, check_nullability=check_nullability): 
        
                       return False 
        
               if (self.dictionary is None) != (other.dictionary is None): 
        
                   return False 
        
               if self.dictionary is not None: 
        
                   if not self.dictionary.type_equals( 
        
                       other.dictionary, 
        
                       check_nullability=check_nullability 
        
                   ): 
        
                       return False 
        
               return True

Both of those are pretty specific to exactly what we needed them for.

paleolimbot · 2024-07-26T14:26:58Z

I sent this to you offline as well but I'll post here too! For generating integration test JSON we had a similar situation to serializing IPC schemas and went with a helper function plus a lambda to generate the full range of data types:

arrow-nanoarrow/src/nanoarrow/testing/testing_test.cc

Lines 496 to 704 in 2040e74

    
           TEST(NanoarrowTestingTest, NanoarrowTestingTestTypePrimitive) { 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_NA); 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "null"})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_BOOL); 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "bool"})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_INT8); 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, 
        
                 R"({"name": "int", "bitWidth": 8, "isSigned": true})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_UINT8); 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, 
        
                 R"({"name": "int", "bitWidth": 8, "isSigned": false})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_HALF_FLOAT); 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, 
        
                 R"({"name": "floatingpoint", "precision": "HALF"})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_FLOAT); 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, 
        
                 R"({"name": "floatingpoint", "precision": "SINGLE"})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_DOUBLE); 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, 
        
                 R"({"name": "floatingpoint", "precision": "DOUBLE"})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_STRING); 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "utf8"})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_LARGE_STRING); 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "largeutf8"})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_BINARY); 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "binary"})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   return ArrowSchemaInitFromType(schema, NANOARROW_TYPE_LARGE_BINARY); 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "largebinary"})"); 
        
           } 
        
           TEST(NanoarrowTestingTest, NanoarrowTestingTestTypeParameterized) { 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   ArrowSchemaInit(schema); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetTypeFixedSize(schema, NANOARROW_TYPE_FIXED_SIZE_BINARY, 123)); 
        
                   return NANOARROW_OK; 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, 
        
                 R"({"name": "fixedsizebinary", "byteWidth": 123})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   ArrowSchemaInit(schema); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetTypeDecimal(schema, NANOARROW_TYPE_DECIMAL128, 10, 3)); 
        
                   return NANOARROW_OK; 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, 
        
                 R"({"name": "decimal", "bitWidth": 128, "precision": 10, "scale": 3})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   ArrowSchemaInit(schema); 
        
                   NANOARROW_RETURN_NOT_OK(ArrowSchemaSetTypeStruct(schema, 0)); 
        
                   return NANOARROW_OK; 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "struct"})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   ArrowSchemaInit(schema); 
        
                   NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema, NANOARROW_TYPE_LIST)); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_INT32)); 
        
                   return NANOARROW_OK; 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "list"})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   ArrowSchemaInit(schema); 
        
                   NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema, NANOARROW_TYPE_MAP)); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetType(schema->children[0]->children[0], NANOARROW_TYPE_STRING)); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetType(schema->children[0]->children[1], NANOARROW_TYPE_INT32)); 
        
                   return NANOARROW_OK; 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "map", "keysSorted": false})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   ArrowSchemaInit(schema); 
        
                   NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema, NANOARROW_TYPE_MAP)); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetType(schema->children[0]->children[0], NANOARROW_TYPE_STRING)); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetType(schema->children[0]->children[1], NANOARROW_TYPE_INT32)); 
        
                   schema->flags = ARROW_FLAG_MAP_KEYS_SORTED; 
        
                   return NANOARROW_OK; 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "map", "keysSorted": true})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   ArrowSchemaInit(schema); 
        
                   NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(schema, NANOARROW_TYPE_LARGE_LIST)); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_INT32)); 
        
                   return NANOARROW_OK; 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, R"({"name": "largelist"})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   ArrowSchemaInit(schema); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetTypeFixedSize(schema, NANOARROW_TYPE_FIXED_SIZE_LIST, 12)); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_INT32)); 
        
                   return NANOARROW_OK; 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, 
        
                 R"({"name": "fixedsizelist", "listSize": 12})"); 
        
           } 
        
           TEST(NanoarrowTestingTest, NanoarrowTestingTestTypeUnion) { 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   ArrowSchemaInit(schema); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetTypeUnion(schema, NANOARROW_TYPE_SPARSE_UNION, 0)); 
        
                   return NANOARROW_OK; 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, 
        
                 R"({"name": "union", "mode": "SPARSE", "typeIds": []})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   ArrowSchemaInit(schema); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetTypeUnion(schema, NANOARROW_TYPE_SPARSE_UNION, 2)); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_STRING)); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetType(schema->children[1], NANOARROW_TYPE_INT32)); 
        
                   return NANOARROW_OK; 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, 
        
                 R"({"name": "union", "mode": "SPARSE", "typeIds": [0,1]})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   ArrowSchemaInit(schema); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetTypeUnion(schema, NANOARROW_TYPE_DENSE_UNION, 0)); 
        
                   return NANOARROW_OK; 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, 
        
                 R"({"name": "union", "mode": "DENSE", "typeIds": []})"); 
        
             TestWriteJSON( 
        
                 [](ArrowSchema* schema) { 
        
                   ArrowSchemaInit(schema); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetTypeUnion(schema, NANOARROW_TYPE_DENSE_UNION, 2)); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_STRING)); 
        
                   NANOARROW_RETURN_NOT_OK( 
        
                       ArrowSchemaSetType(schema->children[1], NANOARROW_TYPE_INT32)); 
        
                   return NANOARROW_OK; 
        
                 }, 
        
                 /*append_expr*/ nullptr, &WriteTypeJSON, 
        
                 R"({"name": "union", "mode": "DENSE", "typeIds": [0,1]})"); 
        
           }

A similar example using Arrow C++ that would be nice to replace:

arrow-nanoarrow/src/nanoarrow/ipc/decoder_test.cc

Lines 671 to 716 in 2040e74

    
           INSTANTIATE_TEST_SUITE_P( 
        
               NanoarrowIpcTest, ArrowTypeParameterizedTestFixture, 
        
               ::testing::Values( 
        
                   arrow::null(), arrow::boolean(), arrow::int8(), arrow::uint8(), arrow::int16(), 
        
                   arrow::uint16(), arrow::int32(), arrow::uint32(), arrow::int64(), arrow::uint64(), 
        
                   arrow::utf8(), arrow::float16(), arrow::float32(), arrow::float64(), 
        
                   arrow::decimal128(10, 3), arrow::decimal256(10, 3), arrow::large_utf8(), 
        
                   arrow::binary(), arrow::large_binary(), arrow::fixed_size_binary(123), 
        
                   arrow::date32(), arrow::date64(), arrow::time32(arrow::TimeUnit::SECOND), 
        
                   arrow::time32(arrow::TimeUnit::MILLI), arrow::time64(arrow::TimeUnit::MICRO), 
        
                   arrow::time64(arrow::TimeUnit::NANO), arrow::timestamp(arrow::TimeUnit::SECOND), 
        
                   arrow::timestamp(arrow::TimeUnit::MILLI), 
        
                   arrow::timestamp(arrow::TimeUnit::MICRO), arrow::timestamp(arrow::TimeUnit::NANO), 
        
                   arrow::timestamp(arrow::TimeUnit::SECOND, "UTC"), 
        
                   arrow::duration(arrow::TimeUnit::SECOND), arrow::duration(arrow::TimeUnit::MILLI), 
        
                   arrow::duration(arrow::TimeUnit::MICRO), arrow::duration(arrow::TimeUnit::NANO), 
        
                   arrow::month_interval(), arrow::day_time_interval(), 
        
                   arrow::month_day_nano_interval(), 
        
                   arrow::list(arrow::field("some_custom_name", arrow::int32())), 
        
                   arrow::large_list(arrow::field("some_custom_name", arrow::int32())), 
        
                   arrow::fixed_size_list(arrow::field("some_custom_name", arrow::int32()), 123), 
        
                   arrow::map(arrow::utf8(), arrow::int64(), false), 
        
                   arrow::map(arrow::utf8(), arrow::int64(), true), 
        
                   arrow::struct_({arrow::field("col1", arrow::int32()), 
        
                                   arrow::field("col2", arrow::utf8())}), 
        
                   // Zero-size union doesn't roundtrip through the C Data interface until 
        
                   // Arrow 11 (which is not yet available on all platforms) 
        
                   // arrow::sparse_union(FieldVector()), arrow::dense_union(FieldVector()), 
        
                   // No custom type IDs 
        
                   arrow::sparse_union({arrow::field("col1", arrow::int32()), 
        
                                        arrow::field("col2", arrow::utf8())}), 
        
                   arrow::dense_union({arrow::field("col1", arrow::int32()), 
        
                                       arrow::field("col2", arrow::utf8())}), 
        
                   // With custom type IDs 
        
                   arrow::sparse_union({arrow::field("col1", arrow::int32()), 
        
                                        arrow::field("col2", arrow::utf8())}, 
        
                                       {126, 127}), 
        
                   arrow::dense_union({arrow::field("col1", arrow::int32()), 
        
                                       arrow::field("col2", arrow::utf8())}, 
        
                                      {126, 127}), 
        
                   // Type with nested metadata 
        
                   arrow::list(arrow::field("some_custom_name", arrow::int32(), 
        
                                            arrow::KeyValueMetadata::Make({"key1"}, {"value1"}))) 
        
                       ));

bkietz · 2024-07-26T18:16:36Z

I am hesitant to add scope to nanoarrow

If we keep it minimal and closely aligned with the ABI, 100-200 lines would suffice for:

  using namespace nanoarrow::testing::dsl;

  // declare a schema (default format is +s)
  UniqueSchema s = schema{
    // we can make the arguments look kwarg-like
    children{
      {"i", "my int field's name"},
      {"i", dictionary{{"u"}}, "my dictionary field's name",
       metadata{
           "some_key=some_value",
           "some_key2=some_value2",
       },
       ARROW_FLAG_NULLABLE},
    }
  };

paleolimbot · 2024-07-26T20:42:38Z

I like the idea of putting it in testing (it can move if it becomes popular). Replacing the usage in the Testing JSON generator would probably get you all the unit tests for free!

paleolimbot · 2024-08-05T18:41:43Z

In searching for Array equality utilities, I found that ADBC's validation utility also has a way to create schemas using nanoarrow for use in testing!

https://github.com/apache/arrow-adbc/blob/36f0cd32af2e3f75b12d4397d1ed9b6ecbc1acce/c/validation/adbc_validation_util.h#L252-L434

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema DSL for testing #566

Schema DSL for testing #566

bkietz commented Jul 25, 2024

bkietz commented Jul 25, 2024

paleolimbot commented Jul 25, 2024

paleolimbot commented Jul 26, 2024

bkietz commented Jul 26, 2024 •

edited

Loading

paleolimbot commented Jul 26, 2024

paleolimbot commented Aug 5, 2024

Schema DSL for testing #566

Schema DSL for testing #566

Comments

bkietz commented Jul 25, 2024

bkietz commented Jul 25, 2024

paleolimbot commented Jul 25, 2024

paleolimbot commented Jul 26, 2024

bkietz commented Jul 26, 2024 • edited Loading

paleolimbot commented Jul 26, 2024

paleolimbot commented Aug 5, 2024

bkietz commented Jul 26, 2024 •

edited

Loading