SNOW-1665420 add logic to parse Iceberg schema #996

sfc-gh-bzabek · 2024-11-13T08:54:45Z

Overview

SNOW-1665420

The goal of this PR is to implement logic to parse plain iceberg schema. That schema will be retrieved from a channel during schema evolution.
I generate IcebergColumnTree from Iceberg schema.
I added logic to generate part of the query that will be used to alter the column. Generating query is out of scope but It's the best way to test the logic. I didn't handle nullability of the columns yet.

Pre-review checklist

sfc-gh-bzabek · 2024-11-13T08:56:00Z

...ke/kafka/connector/internal/streaming/schemaevolution/iceberg/ApacheIcebergColumnSchema.java

+
+  private final Type schema;
+
+  private final String columnName;


I need to have a column field. I can't retrieve it from schema.

I don't understand this comment. Could you elaborate on that?

When channel.getTableSchema().get("COLUMN_NAME").getIcebergSchema() in return we receive schema like in my Tree test. There is no info about column name.
Hence I will have to get column name from InsertError. I am going to pass both into this class.

ok, I got it

sfc-gh-bzabek · 2024-11-13T08:56:40Z

...lake/kafka/connector/internal/streaming/schemaevolution/iceberg/IcebergColumnTypeMapper.java

 import org.apache.kafka.connect.data.Date;
 import org.apache.kafka.connect.data.Decimal;
 import org.apache.kafka.connect.data.Schema;
 import org.apache.kafka.connect.data.Time;
 import org.apache.kafka.connect.data.Timestamp;

 public class IcebergColumnTypeMapper extends ColumnTypeMapper {
+
+  /**
+   * See <a href="https://docs.snowflake.com/en/user-guide/tables-iceberg-data-types">Data types for


see https://docs.snowflake.com/en/user-guide/tables-iceberg-data-types for details

sfc-gh-bzabek · 2024-11-13T09:00:51Z

...lake/kafka/connector/internal/streaming/schemaevolution/iceberg/IcebergColumnTypeMapper.java

+        return "TIME(6)";
+      case TIMESTAMP:
+        Types.TimestampType timestamp = (Types.TimestampType) apacheIcebergType;
+        return timestamp.shouldAdjustToUTC() ? "TIMESTAMP_LTZ" : "TIMESTAMP";


Here I didn't type precision of a timestamp like ex. TIMESTAMP_LTZ(6). I want server to resolve it. It's confusing because docs mentions that timestamp -> TIMESTAMP_LTZ/NTZ(6) but when I manually altered Iceberg table it creates a column with ...(9) precision. I must come back to it when It will start to work end 2 end.

sfc-gh-bzabek · 2024-11-13T09:03:52Z

...wflake/kafka/connector/internal/streaming/schemaevolution/iceberg/IcebergDataTypeParser.java

+ * GlobalServices/modules/data-lake/datalake-api/src/main/java/com/snowflake/metadata/iceberg
+ * /IcebergDataTypeParser.java
+ */
+public class IcebergDataTypeParser {


By now I only use it in test. However It's gonna be needed later. It's copied from ingest-sdk.

The comment states that it is copied from the monorepo, but I guess it is the same.

So they copied it from GS and I copied from them, whatever. I don't remember.

I think we shouldn't include the path in the javadoc comment

Comment removed.

sfc-gh-bzabek · 2024-11-13T09:07:24Z

.../java/com/snowflake/kafka/connector/streaming/iceberg/IcebergIngestionSchemaEvolutionIT.java

+  @MethodSource("prepareData")
+  @Disabled
+  // Schema evolution for structured types is not yet supported
+  void shouldEvolveSchemaAndInsertRecords_structuredData(


It's going to evolve. I am gonna need it later.

Sure, but please extract the common parts of both tests.

I will just save it to my notes. Maybe I shouldn't have add it yet - it confuses reviewer and it's not yet needed. The code will evolve anyway.

sfc-gh-bzabek · 2024-11-13T09:25:54Z

...lake/kafka/connector/internal/streaming/schemaevolution/iceberg/IcebergColumnTypeMapper.java

+   * See <a href="https://docs.snowflake.com/en/user-guide/tables-iceberg-data-types">Data types for
+   * Apache Iceberg™ tables</a>
+   */
+  public static String mapToSnowflakeDataType(Type apacheIcebergType) {


I don't like this static but I want it to stay as it is for now.

sfc-gh-bzabek · 2024-11-13T11:52:14Z

.../snowflake/kafka/connector/streaming/schemaevolution/iceberg/ParseIcebergColumnTreeTest.java

+            "TEST_COLUMN_NAME OBJECT(K1 NUMBER(10,0), K2 NUMBER(10,0), NESTED_OBJECT"
+                + " OBJECT(NESTED_KEY1 VARCHAR(16777216), NESTED_KEY2 VARCHAR(16777216)))"),
+        arguments(
+            "{\"type\":\"struct\",\"fields\":[{\"id\":2,\"name\":\"offset\",\"required\":false,\"type\":\"int\"},{\"id\":3,\"name\":\"topic\",\"required\":false,\"type\":\"string\"},{\"id\":4,\"name\":\"partition\",\"required\":false,\"type\":\"int\"},{\"id\":5,\"name\":\"key\",\"required\":false,\"type\":\"string\"},{\"id\":6,\"name\":\"schema_id\",\"required\":false,\"type\":\"int\"},{\"id\":7,\"name\":\"key_schema_id\",\"required\":false,\"type\":\"int\"},{\"id\":8,\"name\":\"CreateTime\",\"required\":false,\"type\":\"long\"},{\"id\":9,\"name\":\"LogAppendTime\",\"required\":false,\"type\":\"long\"},{\"id\":10,\"name\":\"SnowflakeConnectorPushTime\",\"required\":false,\"type\":\"long\"},{\"id\":11,\"name\":\"headers\",\"required\":false,\"type\":{\"type\":\"map\",\"key-id\":12,\"key\":\"string\",\"value-id\":13,\"value\":\"string\",\"value-required\":false}}]}\n",


Don't made me format it manually...

sfc-gh-mbobowski · 2024-11-13T15:11:51Z

pom.xml

@@ -56,6 +56,7 @@
        <confluent.version>7.7.0</confluent.version>
        <!--Compatible protobuf version https://github.com/confluentinc/common/blob/v7.7.0/pom.xml#L91 -->
        <protobuf.version>3.25.5</protobuf.version>
+        <iceberg.version>1.5.2</iceberg.version>


It's 1.6.1 in the ingest-sdk. Let's try to align.

sfc-gh-mbobowski · 2024-11-13T15:14:00Z

pom.xml

@@ -338,7 +339,7 @@
        <dependency>
            <groupId>net.snowflake</groupId>
            <artifactId>snowflake-ingest-sdk</artifactId>
-            <version>2.3.0</version>
+            <version>3.0.0</version>


Please align pom.xml with pom_confluent.xml.

Sry, I forgot about pom_confluent.xml, however I must revert back to 2.3.0. , 3.0.0 is not yet available.

sfc-gh-mbobowski · 2024-11-13T15:32:20Z

...m/snowflake/kafka/connector/internal/streaming/schemaevolution/iceberg/IcebergFieldNode.java

+  public final LinkedHashMap<String, IcebergFieldNode> children;
+
+  public IcebergFieldNode(String name, Type apacheIcebergSchema) {
+    this.name = name.toUpperCase();


I am not sure about this toUpperCase(). At least fields inside the nested structures are case sensitive.
Even though we can wait for testing details like that once we set everything e2e.

At least fields inside the nested structures are case sensitive.

So then It won't work. I'll remove this toUpperCase method.

sfc-gh-mbobowski · 2024-11-13T15:34:38Z

...m/snowflake/kafka/connector/internal/streaming/schemaevolution/iceberg/IcebergFieldNode.java

+
+  public final LinkedHashMap<String, IcebergFieldNode> children;
+
+  public IcebergFieldNode(String name, Type apacheIcebergSchema) {


nit: it's pretty heavy for the constructor. Perhaps we could move this logic to some kind of Factory class?

I don't really feel moving this logic somewhere else. It's easy to construct nodes inside this class.

sfc-gh-mbobowski · 2024-11-13T15:39:59Z

.../snowflake/kafka/connector/streaming/schemaevolution/iceberg/ParseIcebergColumnTreeTest.java

+    Assertions.assertEquals(expectedQuery, tree.buildQuery());
+  }
+
+  static Stream<Arguments> icebergSchemas() {


sfc-gh-bzabek requested a review from a team as a code owner November 13, 2024 08:54

sfc-gh-bzabek commented Nov 13, 2024

View reviewed changes

sfc-gh-bzabek marked this pull request as draft November 13, 2024 09:13

sfc-gh-bzabek force-pushed the bzabek-SNOW-1665420-parse-iceberg-schema branch from 2754d3c to a5870be Compare November 13, 2024 09:22

sfc-gh-bzabek commented Nov 13, 2024

View reviewed changes

sfc-gh-bzabek marked this pull request as ready for review November 13, 2024 11:54

sfc-gh-bzabek force-pushed the bzabek-SNOW-1665420-parse-iceberg-schema branch from a5870be to 61b0e74 Compare November 13, 2024 11:58

sfc-gh-mbobowski reviewed Nov 13, 2024

View reviewed changes

sfc-gh-bzabek added 6 commits November 14, 2024 13:05

SNOW-1665420 add logic to parse Iceberg schema

93c9afe

revert ingest-sdk version, remove toUpperCase

90eeae5

remove some comments in IcebergDataTypeParser

f15760d

make IcebergFieldNode constructor package private

fcb4c5b

update pom_confluent and move iceberg deps to the bottom

f0d29e6

revert new test addition to IcebergIngestionSchemaEvolutionIT

197770f

sfc-gh-mbobowski approved these changes Nov 14, 2024

View reviewed changes

sfc-gh-bzabek force-pushed the bzabek-SNOW-1665420-parse-iceberg-schema branch from 059939c to 197770f Compare November 14, 2024 12:07

sfc-gh-bzabek added 5 commits November 14, 2024 13:11

ingest-sdk 3.0.0

1154e1f

remove apache iceberg-parquet and iceberg-core, add licence to list

acd87be

restore iceberg-core in pom

6c35234

License for iceberg-core

47fa4d8

add iceberg-common license

f775818

sfc-gh-bzabek added 2 commits November 15, 2024 09:13

license for io.airlift:aircompressor

c9587b7

license for RoaringBitmap

7f62713

sfc-gh-bzabek merged commit 14d43f7 into master Nov 15, 2024
53 of 54 checks passed

sfc-gh-bzabek deleted the bzabek-SNOW-1665420-parse-iceberg-schema branch November 15, 2024 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1665420 add logic to parse Iceberg schema #996

SNOW-1665420 add logic to parse Iceberg schema #996

sfc-gh-bzabek commented Nov 13, 2024 •

edited

Loading

sfc-gh-bzabek Nov 13, 2024

sfc-gh-mbobowski Nov 13, 2024

sfc-gh-bzabek Nov 14, 2024

sfc-gh-mbobowski Nov 14, 2024

sfc-gh-bzabek Nov 13, 2024

sfc-gh-bzabek Nov 13, 2024 •

edited

Loading

sfc-gh-bzabek Nov 13, 2024 •

edited

Loading

sfc-gh-mbobowski Nov 13, 2024

sfc-gh-bzabek Nov 14, 2024 •

edited

Loading

sfc-gh-rsawicki Nov 14, 2024

sfc-gh-bzabek Nov 14, 2024

sfc-gh-bzabek Nov 13, 2024

sfc-gh-mbobowski Nov 13, 2024

sfc-gh-bzabek Nov 14, 2024 •

edited

Loading

sfc-gh-bzabek Nov 13, 2024

sfc-gh-bzabek Nov 13, 2024

sfc-gh-mbobowski Nov 13, 2024

sfc-gh-bzabek Nov 14, 2024

sfc-gh-mbobowski Nov 13, 2024

sfc-gh-bzabek Nov 14, 2024

sfc-gh-mbobowski Nov 13, 2024

sfc-gh-bzabek Nov 14, 2024

sfc-gh-mbobowski Nov 13, 2024

sfc-gh-bzabek Nov 14, 2024 •

edited

Loading

sfc-gh-mbobowski Nov 13, 2024


		public final LinkedHashMap<String, IcebergFieldNode> children;

		public IcebergFieldNode(String name, Type apacheIcebergSchema) {

SNOW-1665420 add logic to parse Iceberg schema #996

SNOW-1665420 add logic to parse Iceberg schema #996

Conversation

sfc-gh-bzabek commented Nov 13, 2024 • edited Loading

Overview

Pre-review checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-bzabek Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

sfc-gh-bzabek Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-bzabek Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-bzabek Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-bzabek Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-bzabek commented Nov 13, 2024 •

edited

Loading

sfc-gh-bzabek Nov 13, 2024 •

edited

Loading

sfc-gh-bzabek Nov 13, 2024 •

edited

Loading

sfc-gh-bzabek Nov 14, 2024 •

edited

Loading

sfc-gh-bzabek Nov 14, 2024 •

edited

Loading

sfc-gh-bzabek Nov 14, 2024 •

edited

Loading