-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add pure virtual classes for Catalog, Table, etc. #47
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable overall
src/iceberg/table.h
Outdated
virtual std::string uuid() const = 0; | ||
|
||
/// \brief Refresh the current table metadata | ||
virtual void Refresh() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do we want a table to be able to do I/O, or should it effectively be a POD (albeit not actually a POD)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind, Table
only triggers the (JSON and/or Avro) readers to perform I/O and it is the implementation's responsibility to do that. TBH, I haven't thought about it in detail yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking Table would be a wrapper around already-parsed data, and so to refresh the table, you'd just load the table again (or is there a reason why it has to be the same object?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(In that model, there would be no need for Table to be abstract, except for perhaps the parts relating to scans/transactions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking Table would be a wrapper around already-parsed data, and so to refresh the table
I think you are referring to SerializableTable
in the java impl. I think a Table abstraction is still needed because we want to differentiate a real table versus metadata table. Table internally holds a TableOperations
object which offloads all I/O operations to the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe good to also look at PyIceberg, where we left out the TableOperations
. There is more abstraction in Java because supports File System tables. In PyIceberg we've designed everything around the catalog. The refresh
operation will refresh the TableMetadata
that (de)serializes from/into the JSON metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PyIceberg, the Table also has a reference back to the catalog, which makes it easier to refresh the underlying TableMetadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @lidavidm:
Hmm, do we want a table to be able to do I/O, or should it effectively be a POD (albeit not actually a POD)?
I believe we have to support I/O at least to parse metadata file. See my comment at #30 (review).
I was thinking Table would be a wrapper around already-parsed data, and so to refresh the table, you'd just load the table again
I think we can still define an abstract Table
class. A subclass StaticTable
accepts a deserialized TableMetadata
and StaticTable::Refresh()
is a no-op.
We still need the Table abstraction for catalog-backed tables and metadata tables.
To @Fokko:
In PyIceberg, the Table also has a reference back to the catalog, which makes it easier to refresh the underlying TableMetadata.
That makes sense! We can add a CatalogTable
to implement a catalog-backed table.
38a7359
to
8600cb2
Compare
src/iceberg/catalog.h
Outdated
/// \brief Starts a transaction to create the table | ||
/// | ||
/// \return the Transaction to create the table | ||
virtual std::unique_ptr<Transaction> CreateTransaction() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought. To make this more Iceberg REST catalog agnostic, we could also call this Stage
:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko Quick question: Transaction
has a comment A transaction for performing multiple updates to a table
but the BaseTransaction
has a check checkLastOperationCommitted
which throws when previous update is uncommitted. Will multiple updates be supported in the future to remove this inconsistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will multiple updates be supported in the future to remove this inconsistency?
Yes, there is no reason not to support this.
src/iceberg/table_operations.h
Outdated
/// \param uncommittedMetadata uncommitted table metadata | ||
/// \return a temporary table operations that behaves like the uncommitted metadata is | ||
/// current | ||
virtual std::unique_ptr<TableOperations> Temp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this one is more appropriate to be renamed as Stage
... @Fokko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The staged table creation is where you first call the catalog to create a table. The table won't be visible yet but will be reserved for the client that called it. Next, you can write all your data and once you commit the snapshot, it will be visible in the catalog. This way:
- If you do a
CREATE TABLE AS SELECT * ...
which might take some time, you know that you're not going to get into conflicts. - The catalog might provide S3 credentials the client needs to write the data to S3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You convinced me. I have renamed them.
4b90f63
to
2b6e237
Compare
src/iceberg/type_fwd.h
Outdated
@@ -81,4 +81,36 @@ class TimestampTzType; | |||
class Type; | |||
class UuidType; | |||
|
|||
/// \brief Error types for iceberg. | |||
/// TODO: add more and sort them based on some rules. | |||
enum class ErrorKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create a new file for this ErrorKind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was also thinking of adding a detail error message like:
struct Error {
ErrorKind kind;
std::string message;
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems same as Status
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status can be OK but here we don't want to include that.
kCommitStateUnknown, | ||
}; | ||
|
||
struct Namespace; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are putting these forward declarations here, should we change the name type_fwd.h to something like spec_fwd.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type_fwd.h
is a convention used by arrow-cpp to hold forward declarations and enums in a single header file. It might be a coincident that it starts with the data types...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's not "Iceberg data type forward declarations", it's "C++ data type forward declarations". Open to renaming it (though STL at least uses the fwd
convention too with iosfwd
)
2b6e237
to
2d7ceaa
Compare
I have removed Let me know what you think. @lidavidm @zhjwpku @Fokko @Xuanwo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
virtual ~Catalog() = default; | ||
|
||
/// \brief Return the name for this catalog | ||
virtual std::string_view name() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe Name() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the coding style convention from arrow-cpp. For non-trivial functions, we use camel case with capitalized initial. For trivial functions (e.g. getters), we simply use lowercased snake case.
/// \param ns a namespace | ||
/// \return a list of identifiers for tables or ErrorKind::kNoSuchNamespace | ||
/// if the namespace does not exist | ||
virtual expected<std::vector<TableIdentifier>, Error> ListTables( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should std::unique_ptr<std::vector> be used
instead of std::vector to avoid copying ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can forward declare TableIdentifierList / TableIdentifierListPtr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the compiler is smart enough to do RVO. BTW, I still prefer std::vector<TableIdentifier>
to TableIdentifierList
because it is short enough and more readable (do not need an extra jump to see its full definition)
int64_t snapshot_id) const = 0; | ||
|
||
/// \brief Get the snapshots of this table | ||
virtual const std::vector<std::shared_ptr<Snapshot>>& snapshots() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return an Enumerator or Iterator type if there are too many snapshots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. Maybe you can add it later? I think the current one makes sense because the Table
object has a full state of TableMetadata
so it anyway caches all Snapshot
s in it.
/// May throw ValidationException if any update cannot be applied to the current table | ||
/// metadata. May throw CommitFailedException if the updates cannot be committed due to | ||
/// conflicts. | ||
virtual void CommitTransaction() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abort or rollback is automatically called when a commit fails ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure about this. But checking the java impl it at least does some cleanup work: https://github.com/apache/iceberg/blob/e1e0a7404740b2bf9e6638afb5f0ff19f2536713/core/src/main/java/org/apache/iceberg/BaseTransaction.java#L324-L349
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that’s the case, then the abort/rollback methods don’t need to exist on the transaction interface.
/// \brief Get the snapshot history of this table | ||
/// | ||
/// \return a vector of history entries | ||
virtual const std::vector<std::shared_ptr<HistoryEntry>>& history() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enumerator or Iterator here too ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as above.
What are done in this PR
Catalog
,Table
,TableOperations
,Transaction
andLocationProvider
with limited features likeAppendFiles
andTableScan
.Namespace
andTableIdentifier
.Error
enum.What are undecided
FileIO
.StructLike
.