- 
                Notifications
    You must be signed in to change notification settings 
- Fork 338
feat(writer): Make LocationGenerator partition-aware #1625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY for this pr!
        
          
                crates/iceberg/src/spec/partition.rs
              
                Outdated
          
        
      | } | ||
|  | ||
| /// Checks if a partition key is effectively none. | ||
| pub fn partition_key_is_none(partition_key: Option<&PartitionKey>) -> bool { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put it in PartitionKey?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be pub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think putting this in PartitionKey is a good idea. Currently I have added a PartitionKeyExt trait to have this function work on Option<&PartitionKey> directly.
partition_key_is_none may be used in custom implementations of location generators or used along with some custom writers, so I think it's better to leave it pub
| let location = default_location_gen.generate_location(Some(partition_key), file_name); | ||
| assert_eq!( | ||
| location, | ||
| "s3://data.db/table/data/id=42/name=\"alice\"/data-00000.parquet" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THis is incorrect, it should be url encoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I didn't notice that the implementation of Transform::to_human_string is different from Java. It turns out the Display implementation for Datum escaped the value for string via double quotes here, and partition_path_to_path uses Transform::to_human_string  -> Datum.to_string() to get a human readable string.
Do you happen to know why this is the case?
My current solution is to add a new helper to_human_string under Datum to explicitly not add double quotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you happen to know why this is the case?
I'm not that clear about it, maybe it's an arbitrary decision.
        
          
                crates/iceberg/src/spec/partition.rs
              
                Outdated
          
        
      | } | ||
|  | ||
| /// Extension to help check if a partition key is effectively none. | ||
| pub trait PartitionKeyExt { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little over design. What I mean was just like following:
impl PartitionKey {
     fn is_effectively_none(input: Option<&PartitionKey>) -> bool {
    ...
    }
}
| impl LocationGenerator for MockLocationGenerator { | ||
| fn generate_location(&self, file_name: &str) -> String { | ||
| format!("{}/{}", self.root, file_name) | ||
| fn generate_location(&self, partition: Option<&PartitionKey>, file_name: &str) -> String { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this MockLocationGenerator? Why not just add a method for DefaultLocationGenerator such as with_data_location so that all the  tests cover using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thought as well, not sure why we even need a mock location gen. let me give it a try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out there are 20+ usages of Mock loc gen across the code base, and I think it may be better to have a separate PR, I can work on it after this
created an issue: #1645
… in location generator
290a530    to
    724aeeb      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY for this pr!
Which issue does this PR close?
What changes are included in this PR?
PartitionKeyto hold necessary partition-related info for path generationLocationGenerator::generate_locationto accept an optional partition keyParquetWritertake in anOption<PartitionKey>Are these changes tested?
Added ut