Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1665420 add logic to parse Iceberg schema #996

Merged
merged 13 commits into from
Nov 15, 2024

Conversation

sfc-gh-bzabek
Copy link
Contributor

@sfc-gh-bzabek sfc-gh-bzabek commented Nov 13, 2024

Overview

SNOW-1665420

The goal of this PR is to implement logic to parse plain iceberg schema. That schema will be retrieved from a channel during schema evolution.
I generate IcebergColumnTree from Iceberg schema.
I added logic to generate part of the query that will be used to alter the column. Generating query is out of scope but It's the best way to test the logic. I didn't handle nullability of the columns yet.

Pre-review checklist

  • This change should be part of a Behavior Change Release. See go/behavior-change.
  • This change has passed Merge gate tests
  • Snowpipe Changes
  • Snowpipe Streaming Changes
  • This change is TEST-ONLY
  • This change is README/Javadocs only
  • This change is protected by a config parametersnowflake.streaming.iceberg.enabled
    • Yes - Added end to end and Unit Tests.
    • No - Suggest why it is not param protected
  • Is his change protected by parameter <PARAMETER_NAME> on the server side?
    • The parameter/feature is not yet active in production (partial rollout or PrPr, see Changes for Unreleased Features and Fixes).
    • If there is an issue, it can be safely mitigated by turning the parameter off. This is also verified by a test (See go/ppp).

@sfc-gh-bzabek sfc-gh-bzabek requested a review from a team as a code owner November 13, 2024 08:54

private final Type schema;

private final String columnName;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to have a column field. I can't retrieve it from schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. Could you elaborate on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When channel.getTableSchema().get("COLUMN_NAME").getIcebergSchema() in return we receive schema like in my Tree test. There is no info about column name.
Hence I will have to get column name from InsertError. I am going to pass both into this class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I got it

import org.apache.kafka.connect.data.Date;
import org.apache.kafka.connect.data.Decimal;
import org.apache.kafka.connect.data.Schema;
import org.apache.kafka.connect.data.Time;
import org.apache.kafka.connect.data.Timestamp;

public class IcebergColumnTypeMapper extends ColumnTypeMapper {

/**
* See <a href="https://docs.snowflake.com/en/user-guide/tables-iceberg-data-types">Data types for
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return "TIME(6)";
case TIMESTAMP:
Types.TimestampType timestamp = (Types.TimestampType) apacheIcebergType;
return timestamp.shouldAdjustToUTC() ? "TIMESTAMP_LTZ" : "TIMESTAMP";
Copy link
Contributor Author

@sfc-gh-bzabek sfc-gh-bzabek Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I didn't type precision of a timestamp like ex. TIMESTAMP_LTZ(6). I want server to resolve it. It's confusing because docs mentions that timestamp -> TIMESTAMP_LTZ/NTZ(6) but when I manually altered Iceberg table it creates a column with ...(9) precision. I must come back to it when It will start to work end 2 end.

* GlobalServices/modules/data-lake/datalake-api/src/main/java/com/snowflake/metadata/iceberg
* /IcebergDataTypeParser.java
*/
public class IcebergDataTypeParser {
Copy link
Contributor Author

@sfc-gh-bzabek sfc-gh-bzabek Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By now I only use it in test. However It's gonna be needed later. It's copied from ingest-sdk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states that it is copied from the monorepo, but I guess it is the same.

Copy link
Contributor Author

@sfc-gh-bzabek sfc-gh-bzabek Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So they copied it from GS and I copied from them, whatever. I don't remember.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't include the path in the javadoc comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment removed.

@MethodSource("prepareData")
@Disabled
// Schema evolution for structured types is not yet supported
void shouldEvolveSchemaAndInsertRecords_structuredData(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's going to evolve. I am gonna need it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but please extract the common parts of both tests.

Copy link
Contributor Author

@sfc-gh-bzabek sfc-gh-bzabek Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just save it to my notes. Maybe I shouldn't have add it yet - it confuses reviewer and it's not yet needed. The code will evolve anyway.

@sfc-gh-bzabek sfc-gh-bzabek marked this pull request as draft November 13, 2024 09:13
@sfc-gh-bzabek sfc-gh-bzabek force-pushed the bzabek-SNOW-1665420-parse-iceberg-schema branch from 2754d3c to a5870be Compare November 13, 2024 09:22
* See <a href="https://docs.snowflake.com/en/user-guide/tables-iceberg-data-types">Data types for
* Apache Iceberg™ tables</a>
*/
public static String mapToSnowflakeDataType(Type apacheIcebergType) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this static but I want it to stay as it is for now.

"TEST_COLUMN_NAME OBJECT(K1 NUMBER(10,0), K2 NUMBER(10,0), NESTED_OBJECT"
+ " OBJECT(NESTED_KEY1 VARCHAR(16777216), NESTED_KEY2 VARCHAR(16777216)))"),
arguments(
"{\"type\":\"struct\",\"fields\":[{\"id\":2,\"name\":\"offset\",\"required\":false,\"type\":\"int\"},{\"id\":3,\"name\":\"topic\",\"required\":false,\"type\":\"string\"},{\"id\":4,\"name\":\"partition\",\"required\":false,\"type\":\"int\"},{\"id\":5,\"name\":\"key\",\"required\":false,\"type\":\"string\"},{\"id\":6,\"name\":\"schema_id\",\"required\":false,\"type\":\"int\"},{\"id\":7,\"name\":\"key_schema_id\",\"required\":false,\"type\":\"int\"},{\"id\":8,\"name\":\"CreateTime\",\"required\":false,\"type\":\"long\"},{\"id\":9,\"name\":\"LogAppendTime\",\"required\":false,\"type\":\"long\"},{\"id\":10,\"name\":\"SnowflakeConnectorPushTime\",\"required\":false,\"type\":\"long\"},{\"id\":11,\"name\":\"headers\",\"required\":false,\"type\":{\"type\":\"map\",\"key-id\":12,\"key\":\"string\",\"value-id\":13,\"value\":\"string\",\"value-required\":false}}]}\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't made me format it manually...

@sfc-gh-bzabek sfc-gh-bzabek marked this pull request as ready for review November 13, 2024 11:54
@sfc-gh-bzabek sfc-gh-bzabek force-pushed the bzabek-SNOW-1665420-parse-iceberg-schema branch from a5870be to 61b0e74 Compare November 13, 2024 11:58
pom.xml Outdated
@@ -56,6 +56,7 @@
<confluent.version>7.7.0</confluent.version>
<!--Compatible protobuf version https://github.com/confluentinc/common/blob/v7.7.0/pom.xml#L91 -->
<protobuf.version>3.25.5</protobuf.version>
<iceberg.version>1.5.2</iceberg.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 1.6.1 in the ingest-sdk. Let's try to align.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pom.xml Outdated
@@ -338,7 +339,7 @@
<dependency>
<groupId>net.snowflake</groupId>
<artifactId>snowflake-ingest-sdk</artifactId>
<version>2.3.0</version>
<version>3.0.0</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please align pom.xml with pom_confluent.xml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry, I forgot about pom_confluent.xml, however I must revert back to 2.3.0. , 3.0.0 is not yet available.

public final LinkedHashMap<String, IcebergFieldNode> children;

public IcebergFieldNode(String name, Type apacheIcebergSchema) {
this.name = name.toUpperCase();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this toUpperCase(). At least fields inside the nested structures are case sensitive.
Even though we can wait for testing details like that once we set everything e2e.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least fields inside the nested structures are case sensitive.

So then It won't work. I'll remove this toUpperCase method.


public final LinkedHashMap<String, IcebergFieldNode> children;

public IcebergFieldNode(String name, Type apacheIcebergSchema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's pretty heavy for the constructor. Perhaps we could move this logic to some kind of Factory class?

Copy link
Contributor Author

@sfc-gh-bzabek sfc-gh-bzabek Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really feel moving this logic somewhere else. It's easy to construct nodes inside this class.

Assertions.assertEquals(expectedQuery, tree.buildQuery());
}

static Stream<Arguments> icebergSchemas() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

@sfc-gh-bzabek sfc-gh-bzabek force-pushed the bzabek-SNOW-1665420-parse-iceberg-schema branch from 059939c to 197770f Compare November 14, 2024 12:07
@sfc-gh-bzabek sfc-gh-bzabek merged commit 14d43f7 into master Nov 15, 2024
53 of 54 checks passed
@sfc-gh-bzabek sfc-gh-bzabek deleted the bzabek-SNOW-1665420-parse-iceberg-schema branch November 15, 2024 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants