-
Notifications
You must be signed in to change notification settings - Fork 338
feat: Declarative TableMetadata Builder #1362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Declarative TableMetadata Builder #1362
Conversation
|
@liurenjie1024 @Xuanwo Looking forward to your thoughts! |
| #[builder(default, setter(transform = |stats: Vec<StatisticsFile>| { | ||
| stats.into_iter().map(|s| (s.snapshot_id, s)).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why also default? How is this different from snapshots?
Accepting Vec<StatisticsFile> should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not - I missed the default for snapshots and current_snapshot_id and a few others.
My basic reasoning is that all fields that must be set in order to produce valid metadata should not have a default, even if the Rust type would allow it. An example for a non-default field would be schemas, as we require at least one entry in the HashMap to be able to produce valid Metadata (current_schema_id must be set and be present).
Valid TableMetadata must not contain snapshots though, and current_snapshot_id can be None, so those should have defaults.
| .metadata_log(vec![]) | ||
| .refs(HashMap::new()) | ||
| .build() | ||
| .try_normalize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the test for the declarative builder would be better to avoid unnecessary calls. Do we need metadata.try_normalize? (the test passes without metadata.try_normalize call)
OTOH it's somewhat weird that try_normalize is the only API to finish building.
Could build() do the try_normalize implicitly, so that user doesn't have to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this design as well, however I can't find an easy way to implement a custom build method without implementing it on the very complex generic type that is derived by the macro (see end of this message) and is hard to keep in sync.
typed_builders only usable customization of the build method is using Into to convert the type. There is no native way to inject try_normalize into the build method. Especially, there is no way to make the generated build method return a Result.
I see the following options:
- Just build
TableMetadatawithout normalization - this would allow to create a potentially invalid TableMetadata, as all other constructors (existingTableMetadataBuilder&Deserialize) use normalization. Hence, I discarded it. - Implement a fallible
try_build(with normalize) method and make the derivedbuildmethod private. This would require to implement thetry_buildon the above type, which isn't very obvious, and would need to be kept in sync with fields inTableMetadataand even worse, with which have a default builder implementation. Due to this complexity, and a high probability of state drift, I discarded this as well. - Don't use typed builder - which requires us to write significantly more code, but it would be a viable option.
- Extend typed builder to support
TryIntoin addition toIntoin the build method. - Use a very small custom type that wraps the notion of unvalidated metadata and offers an easy way to validate it - which is what I did.
If there are any other ideas here or if I missed something, let me know!
Here is the type: Note that default and non-default fields are treated differently:
#[allow(dead_code, non_camel_case_types, missing_docs)]
#[automatically_derived]
impl<
__properties: ::typed_builder::Optional<HashMap<String, String>>,
__current_snapshot_id: ::typed_builder::Optional<Option<i64>>,
__snapshots: ::typed_builder::Optional<HashMap<i64, SnapshotRef>>,
__snapshot_log: ::typed_builder::Optional<Vec<SnapshotLog>>,
__metadata_log: ::typed_builder::Optional<Vec<MetadataLog>>,
__statistics: ::typed_builder::Optional<HashMap<i64, StatisticsFile>>,
__partition_statistics: ::typed_builder::Optional<HashMap<i64, PartitionStatisticsFile>>,
__encryption_keys: ::typed_builder::Optional<HashMap<String, String>>,
>
TableMetadataDeclarativeBuilder<(
(FormatVersion,),
(Uuid,),
(String,),
(i64,),
(i64,),
(i32,),
(HashMap<i32, SchemaRef>,),
(i32,),
(HashMap<i32, PartitionSpecRef>,),
(PartitionSpecRef,),
(StructType,),
(i32,),
__properties,
__current_snapshot_id,
__snapshots,
__snapshot_log,
__metadata_log,
(HashMap<i64, SortOrderRef>,),
(i64,),
(HashMap<String, SnapshotReference>,),
__statistics,
__partition_statistics,
__encryption_keys,
)>There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typed_builders only usable customization of the build method is using Into to convert the type. There is no native way to inject try_normalize into the build method. Especially, there is no way to make the generated build method return a Result.
Looks there is draft design for this problem: idanarye/rust-typed-builder#95 . Another option maybe we can try to make the progress of feature. 😄
| #[derive(Debug, PartialEq, Deserialize, Eq, Clone)] | ||
| #[derive(Debug, PartialEq, Deserialize, Eq, Clone, typed_builder::TypedBuilder)] | ||
| #[builder(builder_method(name=declarative_builder))] | ||
| #[builder(builder_type(name=TableMetadataDeclarativeBuilder, doc="Build a new [`TableMetadata`] in a declarative way. For imperative operations (e.g. `add_snapshot`) and creating new TableMetadata, use [`TableMetadataBuilder`] instead."))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced "declarative" vs "imperative" is the defining distinction.
The TableMetadataBuilder is not "a builder pattern for TableMetadata". It's a facade to perform changes on the TableMetadata to be invoked by higher layer. For that reason, it includes implicit changes that are supposed to be carried on when a metadata change happens, such as updating last_updated_ms when a new snapshot is added
| self.last_updated_ms = Some(snapshot.timestamp_ms()); |
The use-case described in the PR description is "the" standard builder pattern. To me, this is the defining difference.
I suggest changing the doc string here and align the type names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I absolutely agree with this from a rust perspective. From an iceberg perspective other languages use the world Builder for this functionality unfortunately. I am uncertain if renaming the TableMetadataBuilder to a clearer name is worth the breaking change today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @findepi about the naming. To me this is more like a constructor, but since I'm not a native English speaker, I don't have a good suggestion to the name. But it would be lovely if we have a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I feel the TypedBuilder is a little hard to maintain in such a large struct, do you mind to create a standalone struct for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I feel the
TypedBuilderis a little hard to maintain in such a large struct, do you mind to create a standalone struct for this?
I don't quite understand this suggestion - isn't the purpose of this PR to create TableMetadata? How would implementing it on another struct help?
Regarding names, @liurenjie1024 would you be OK with the breaking change that Piotrs proposal implies? Renaming the existing TableMetadataBuilder to something else, then naming the new builder TableMetadataBuilder (which does something completely different and is used by far less people?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @liurenjie1024 and @findepi you mean different things.
@liurenjie1024 you just would like to rename TableMetadataDeclarativeBuilder to TableMetadataConstructor - correct? I am onboard with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liurenjie1024 you just would like to rename TableMetadataDeclarativeBuilder to TableMetadataConstructor - correct? I am onboard with that.
Yes. Feel free to suggest anything more intuitive.
I don't quite understand this suggestion - isn't the purpose of this PR to create TableMetadata? How would implementing it on another struct help?
I mean not using TypedBuilder to generate code automatically, rather than manually maintaining the builder/construct data structure. This pr contains many advanced usage of TypeBuilder, and as a reviewer I find it's sth how difficult to catch up all the usages. I also worry about the maintainceness of using too much auto generated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed now to TableMetadataConstructor
|
|
||
| impl UnnormalizedTableMetadata { | ||
| /// Build a new [`UnnormalizedTableMetadata`] using the [`TableMetadataDeclarativeBuilder`]. | ||
| pub fn builder() -> TableMetadataDeclarativeBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused? remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a library and the constructor can be used outside of the crate. I like to attach some constructor to a type if available.
As it's an auxiliary type, I am OK with removing it as well.
| /// Unnormalized table metadata, used as an intermediate type | ||
| /// to build table metadata in a declarative way. | ||
| #[derive(Debug, PartialEq, Deserialize, Eq, Clone)] | ||
| pub struct UnnormalizedTableMetadata(TableMetadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we need this. Why not build TableMeadata from a declarative builder directly?🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly due to my point 1) in this comment: #1362 (comment)
I don't think there should be any constructor that allows to create unvalidated / potentially corrupt TableMetadata.
|
@findepi @ZENOTME thanks for the Reviews! The most unclear point is probably the return type of the Build method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @c-thiel for this pr! In general I agree that the requirement is reasonable. But I have concerns with the naming and TypedBuilder usage, left some comments for discussion.
| #[derive(Debug, PartialEq, Deserialize, Eq, Clone)] | ||
| #[derive(Debug, PartialEq, Deserialize, Eq, Clone, typed_builder::TypedBuilder)] | ||
| #[builder(builder_method(name=declarative_builder))] | ||
| #[builder(builder_type(name=TableMetadataDeclarativeBuilder, doc="Build a new [`TableMetadata`] in a declarative way. For imperative operations (e.g. `add_snapshot`) and creating new TableMetadata, use [`TableMetadataBuilder`] instead."))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @findepi about the naming. To me this is more like a constructor, but since I'm not a native English speaker, I don't have a good suggestion to the name. But it would be lovely if we have a better name.
| #[derive(Debug, PartialEq, Deserialize, Eq, Clone)] | ||
| #[derive(Debug, PartialEq, Deserialize, Eq, Clone, typed_builder::TypedBuilder)] | ||
| #[builder(builder_method(name=declarative_builder))] | ||
| #[builder(builder_type(name=TableMetadataDeclarativeBuilder, doc="Build a new [`TableMetadata`] in a declarative way. For imperative operations (e.g. `add_snapshot`) and creating new TableMetadata, use [`TableMetadataBuilder`] instead."))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I feel the TypedBuilder is a little hard to maintain in such a large struct, do you mind to create a standalone struct for this?
|
@liurenjie1024 thanks for your review and sorry for letting this drag so long. |
Which issue does this PR close?
Currently we don't have a way to create
TableMetadatain a declarative way other than deserializing it from JSON.The
TableMetadataBuildermostly works imperatively with methods likeadd_schemaandadd_partition_specthat mutate the state incrementally. While incremental modification is what most users need, it is also helpful to offer a type safe way to create a newTableMetadataobject given all of its attributes, other than deserialization from JSON.My concrete use-case is: Lakekeeper stores TableMetadata in Postgres. We extract data typesafe, so we end up with various objects such as a list of schemas, partition_specs, the last_updated_ms and all other fields required to build a new
TableMetadata. As we already have rust-types, we don't want to serialize them to JSON first. We also can't use theTableMetadatabuilder, as it's incremental nature would result in significant overhead.Design considerations
Instead of using the builder approach, we could also add another method
try_from_partstoimpl TableMetadata. We have this alternative approach implemented here: https://github.com/lakekeeper/iceberg-rust/blob/a8b6509775b139c92772a999b0cbca637e274a03/crates/iceberg/src/spec/table_metadata.rs#L676-L740I feel that a builder is less intrusive though. I don't like adding the
UnnormalizedTableMetadata, but didn't find a way to skip this type without writing significantly more code that needs to be maintained. Because of the derived builder in this PR, the approach is almost maintenance free.What changes are included in this PR?
TableMetadataDeclarativeBuilderUnnormalizedTableMetadataas an intermediate type that is.try_normalize()toTableMetadataAre these changes tested?
yes