Skip to content

feat: make output data path of table with identity timestamp partition consistent with java api #1736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sharkdtu
Copy link

Resolves: #1735

@sharkdtu
Copy link
Author

@Fokko Could you please take a look at this PR? Thanks!

@sungwy
Copy link
Collaborator

sungwy commented Feb 27, 2025

Hi @sharkdtu thank you for working on this PR! 😊

I think consistency is great, but Iceberg currently does not require that we guarantee consistent paths (unlike Hive style partition).

I wanted to make sure we understood the reason for requiring consistent paths in your use case.

I left a comment on the linked issue to facilitate that discussion: #1735

@sharkdtu
Copy link
Author

Hi @sharkdtu thank you for working on this PR! 😊

I think consistency is great, but Iceberg currently does not require that we guarantee consistent paths (unlike Hive style partition).

I wanted to make sure we understood the reason for requiring consistent paths in your use case.

I left a comment on the linked issue to facilitate that discussion: #1735

@sungwy thank you for reviewing this PR!

Yes, Iceberg does not require that we guarantee consistent paths. While this does not affect correctness, I believe it is best to maintain consistency in the behavior of different APIs; otherwise, using the Python and Java APIs may create a misleading impression.

In our production systems, we need to monitor the storage usage and number of files for tables. Although this information can be obtained through Iceberg metadata, the actual physical storage may differ from the Iceberg metadata due to orphan files, residual files that were marked deleted, and other reasons. Therefore, it is often necessary to check the physical storage information corresponding to the table/partition paths. If the files of a partition are scattered across multiple paths, it can cause significant trouble for operations and maintenance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The identity partition path of timestamp type is inconsistent with java api
2 participants