-
Notifications
You must be signed in to change notification settings - Fork 63
Introduce branching and support for spark.wap.branch #379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
.../src/main/java/com/linkedin/openhouse/internal/catalog/OpenHouseInternalTableOperations.java
Outdated
Show resolved
Hide resolved
.../src/main/java/com/linkedin/openhouse/internal/catalog/OpenHouseInternalTableOperations.java
Outdated
Show resolved
Hide resolved
.../src/main/java/com/linkedin/openhouse/internal/catalog/OpenHouseInternalTableOperations.java
Outdated
Show resolved
Hide resolved
sumedhsakdeo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
.../src/main/java/com/linkedin/openhouse/internal/catalog/OpenHouseInternalTableOperations.java
Outdated
Show resolved
Hide resolved
jiang95-dev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Christian for the comprehensive tests to ensure safety! I couldn't finish reading all of the tests, but I read through SnapshotDiffApplier and left some comments. I don't have concerns about correctness, but have some regarding readability and easiness to debug.
...e-spark-itest/src/test/java/com/linkedin/openhouse/spark/catalogtest/BranchTestSpark3_5.java
Outdated
Show resolved
Hide resolved
...e-spark-itest/src/test/java/com/linkedin/openhouse/spark/catalogtest/BranchTestSpark3_5.java
Show resolved
Hide resolved
...ternalcatalog/src/main/java/com/linkedin/openhouse/internal/catalog/SnapshotDiffApplier.java
Outdated
Show resolved
Hide resolved
...ternalcatalog/src/main/java/com/linkedin/openhouse/internal/catalog/SnapshotDiffApplier.java
Outdated
Show resolved
Hide resolved
...e-spark-itest/src/test/java/com/linkedin/openhouse/spark/catalogtest/BranchTestSpark3_5.java
Show resolved
Hide resolved
...e-spark-itest/src/test/java/com/linkedin/openhouse/spark/catalogtest/BranchTestSpark3_5.java
Show resolved
Hide resolved
...ternalcatalog/src/main/java/com/linkedin/openhouse/internal/catalog/SnapshotDiffApplier.java
Outdated
Show resolved
Hide resolved
| int appendedCount = | ||
| (int) | ||
| regularSnapshots.stream() | ||
| .filter(s -> !existingById.containsKey(s.snapshotId())) | ||
| .count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the above comment. We should have regularSnapshots as appended regular snapshots.
...ternalcatalog/src/main/java/com/linkedin/openhouse/internal/catalog/SnapshotDiffApplier.java
Outdated
Show resolved
Hide resolved
...ternalcatalog/src/main/java/com/linkedin/openhouse/internal/catalog/SnapshotDiffApplier.java
Outdated
Show resolved
Hide resolved
8748e63 to
39b6cf1
Compare
sumedhsakdeo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rebase this PR on top of #392 (review) thanks 🙏
| assertTrue(refNames.contains("main")); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test-case here that enables MOR on table, and then updates/deletes rows in the main branch, then test two-scenarios, commit conflict that cannot be retried because main has received new updates and cherry pick that goes through because main has not received new updates.
sumedhsakdeo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should find a way to reuse some of the tests in iceberg repo to work against our OH catalog implementation for complete confidence. I found the following test cases that may be relevant.
We could either leverage the test modules from linkedin/iceberg in linkedin/openhouse, or use OpenHouse catalog in linkedin/iceberg.
https://github.com/linkedin/iceberg/blob/openhouse-1.5.0/core/src/test/java/org/apache/iceberg/TestWapWorkflow.java
https://github.com/linkedin/iceberg/blob/openhouse-1.5.0/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWritesToWapBranch.java
https://github.com/linkedin/iceberg/blob/openhouse-1.5.0/core/src/test/java/org/apache/iceberg/TestSnapshotManager.java
https://github.com/linkedin/iceberg/blob/openhouse-1.5.0/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestBranchDDL.java
https://github.com/linkedin/iceberg/blob/openhouse-1.5.0/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWritesToBranch.java
https://github.com/linkedin/iceberg/blob/openhouse-1.5.0/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWritesToWapBranch.java
https://github.com/linkedin/iceberg/blob/openhouse-1.5.0/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestFastForwardBranchProcedure.java
https://github.com/linkedin/iceberg/blob/openhouse-1.5.0/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestTagDDL.java
https://github.com/linkedin/iceberg/blob/openhouse-1.5.0/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestCherrypickSnapshotProcedure.java
| .sql("SELECT * FROM " + tableName + " VERSION AS OF 'feature_a'") | ||
| .collectAsList() | ||
| .size()); // feature-a unchanged | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a test for select from wap snapshot?
| "Main branch should still have 1 row (staged WAP not visible)"); | ||
|
|
||
| // ===== SELECTIVE WAP PUBLISHING ===== | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
swap the publish order here, wap3 before wap2, then do a time-travel query see if the cherry pick order is maintained.
| .collectAsList(); | ||
| String wap2SnapshotId = String.valueOf(wap2Snapshots.get(0).getLong(0)); | ||
| spark.sql( | ||
| "CALL openhouse.system.cherrypick_snapshot('" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason why you picked cherrypick_snapshot over publish_changes? What is the difference? Should we have a test for both?
Summary
Implement Iceberg table branching functionality in OpenHouse internal catalog with comprehensive testing infrastructure.
The old code was getting unwieldy with nested for-loops, mutable collections, and scattered state management, repeated logic and overall needed a restructure with branching supported first-class. i feel more confident that this is reliable.
Changes
Changes
Testing Done
Additional Information