Skip to content

Conversation

@CTTY
Copy link
Contributor

@CTTY CTTY commented Aug 22, 2025

Which issue does this PR close?

What changes are included in this PR?

  • Added PartitionKey to hold necessary partition-related info for path generation
  • [BREAKING] Changed LocationGenerator::generate_location to accept an optional partition key
  • [BREAKING] Have ParquetWriter take in an Option<PartitionKey>

Are these changes tested?

Added ut

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY for this pr!

}

/// Checks if a partition key is effectively none.
pub fn partition_key_is_none(partition_key: Option<&PartitionKey>) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put it in PartitionKey?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be pub?

Copy link
Contributor Author

@CTTY CTTY Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think putting this in PartitionKey is a good idea. Currently I have added a PartitionKeyExt trait to have this function work on Option<&PartitionKey> directly.

partition_key_is_none may be used in custom implementations of location generators or used along with some custom writers, so I think it's better to leave it pub

let location = default_location_gen.generate_location(Some(partition_key), file_name);
assert_eq!(
location,
"s3://data.db/table/data/id=42/name=\"alice\"/data-00000.parquet"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis is incorrect, it should be url encoded.

Copy link
Contributor Author

@CTTY CTTY Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I didn't notice that the implementation of Transform::to_human_string is different from Java. It turns out the Display implementation for Datum escaped the value for string via double quotes here, and partition_path_to_path uses Transform::to_human_string -> Datum.to_string() to get a human readable string.

Do you happen to know why this is the case?

My current solution is to add a new helper to_human_string under Datum to explicitly not add double quotes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you happen to know why this is the case?

I'm not that clear about it, maybe it's an arbitrary decision.

}

/// Extension to help check if a partition key is effectively none.
pub trait PartitionKeyExt {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little over design. What I mean was just like following:

impl PartitionKey {
     fn is_effectively_none(input: Option<&PartitionKey>) -> bool {
    ...
    }
}

impl LocationGenerator for MockLocationGenerator {
fn generate_location(&self, file_name: &str) -> String {
format!("{}/{}", self.root, file_name)
fn generate_location(&self, partition: Option<&PartitionKey>, file_name: &str) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this MockLocationGenerator? Why not just add a method for DefaultLocationGenerator such as with_data_location so that all the tests cover using it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought as well, not sure why we even need a mock location gen. let me give it a try

Copy link
Contributor Author

@CTTY CTTY Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out there are 20+ usages of Mock loc gen across the code base, and I think it may be better to have a separate PR, I can work on it after this

created an issue: #1645

@CTTY CTTY force-pushed the ctty/partition-writer branch from 290a530 to 724aeeb Compare September 2, 2025 02:48
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY for this pr!

@liurenjie1024 liurenjie1024 merged commit cd53ac2 into apache:main Sep 2, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make LocationGenerator partition-aware

2 participants