Skip to content

Fix reading specifying format for date partition projection in Hive connector #25817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

chenjian2664
Copy link
Contributor

@chenjian2664 chenjian2664 commented May 18, 2025

Description

Closes #25642

This PR assumes that the underlying partition paths already follow the user-defined format, specifically for date projection.

If the projection storage.location.template isn't set, the partition paths follow the default Hive format: partName=${partValue}. In this case, partValue is escaped for special characters (e.g., / becomes %2F), same as the existing writing logic in Hive today.

If the storage.location.template is set, the partition value is inserted into the path as is, without escaping special characters. For example:
• With a custom date projection format and a partition column dt formatted as yyyy/MM/dd, this PR allows reading the partition path: dt=2015%2F10%2F10
• If the storage.location.template is set to something like /aaa/bbb/${dt}-xxx, this PR enables reading the partition path:/aaa/bbb/2015/10/10-xxx

This PR not support writing DateProjection with user-defined format that not compatible with Hive date/timestamp format.

Additional context and related issues

Alternative for #25657

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Hive
* Fix reading specifying format for date partition projection in Hive connector. ({issue}`25642`)

Replace legacy formatting with `DateTimeFormatter` for better compatibility
 with the `java.time` package, providing more precise and predictable behavior
This refactor follows AWS Athena's date-type partition projection format
(https://docs.aws.amazon.com/athena/latest/ug/partition-projection-supported-types.html)
in the description.
Additionally, this commit replaces the error-prone Supplier<Instant>
with a direct Instant type for `leftBound` and `rightBound`.
@cla-bot cla-bot bot added the cla-signed label May 18, 2025
@github-actions github-actions bot added the hive Hive connector label May 18, 2025
@chenjian2664 chenjian2664 force-pushed the ref_date_projection branch from af8efa8 to 6ef1fa6 Compare May 18, 2025 13:52
@chenjian2664 chenjian2664 requested a review from pettyjamesm May 19, 2025 01:05
@chenjian2664 chenjian2664 force-pushed the ref_date_projection branch from 6ef1fa6 to bc09276 Compare May 26, 2025 06:52
@github-actions github-actions bot added the hudi Hudi connector label May 26, 2025
@@ -1719,12 +1722,13 @@ public void finishStatisticsCollection(ConnectorSession session, ConnectorTableH
.map(HiveColumnHandle::getType)
.collect(toImmutableList());

Optional<PartitionProjection> partitionProjection = getPartitionProjectionFromHiveTable(handle);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling partition projection within this path seems incorrect, although I'm not sure I understand the implications of it since the code was already somewhat wrong in this regard.

The issue is that partition projection avoids storing the actual partitions in the metastore and computes the values in memory instead. That means that partition statistics don't exist on a per partition basis and the preferred approach is probably to handle these in the same way that unpartitioned tables are handled.

Copy link
Contributor Author

@chenjian2664 chenjian2664 May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the "unpartitioned tables are handled" ?
It seems make no sense to handle in the "finish" stage, I am going to put the the Optional.empty here, as the partition will mismatch at the begining.

About the projection partition stats, I was thought the partition is mapping to the exactly a path

That means that partition statistics don't exist on a per partition basis and the preferred approach is probably to handle these in the same way that unpartitioned tables are handled.

I have no idea, how to make it to act as the unpartitioned table, could you please help to suggest?
Do we need to even update logic when we collecting the column statistics?
But it confuses me: let's say we have a table with partition projection enabled and with the location template: "s3://bucket/some_path1/{pt1}/some_path2/{pt2}-xxx/some_path3", then if we want to collect the statistics, but if want this act as the unpartitioned table, what is the "table path", is it should be something like :"s3//bucket/some_path1/" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to don't change the method canonicalizePartitionValues and just pass the Optional.empty() when call the method parsePartitionValue inside it.
It's somehow already broken when collecting the partition projection table at the begining, we could just leave this part as it is, or: we could reject the collecting stats at the begin part?

@@ -52,6 +53,31 @@ public PartitionProjection(Optional<String> storageLocationTemplate, Map<String,
this.columnProjections = ImmutableMap.copyOf(requireNonNull(columnProjections, "columnProjections is null"));
}

// TODO: support writing partition projection
public boolean allowWrite()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Control flow here doesn't handle the possibility of a storageLocationTemplate being present that doesn't match Hive's path name assumptions (this can happen with any projection type).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@chenjian2664 chenjian2664 force-pushed the ref_date_projection branch 4 times, most recently from a8cabd2 to e60ed0f Compare May 29, 2025 11:50
@chenjian2664 chenjian2664 force-pushed the ref_date_projection branch from e60ed0f to 6c9b3ea Compare May 30, 2025 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector hudi Hudi connector
Development

Successfully merging this pull request may close these issues.

Hive partition projection date partition values are not parsed with partition_projection_format
2 participants