Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add pure virtual classes for Catalog, Table, etc. #47

Merged
merged 3 commits into from
Mar 26, 2025

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Mar 21, 2025

What are done in this PR

  • Added pure virtual classes of Catalog, Table, TableOperations, Transaction and LocationProvider with limited features like AppendFiles and TableScan.
  • Added basic definitions of Namespace and TableIdentifier.
  • Defined initial Error enum.

What are undecided

  • How to define an IO-less FileIO.
  • How to define StructLike.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable overall

virtual std::string uuid() const = 0;

/// \brief Refresh the current table metadata
virtual void Refresh() = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we want a table to be able to do I/O, or should it effectively be a POD (albeit not actually a POD)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, Table only triggers the (JSON and/or Avro) readers to perform I/O and it is the implementation's responsibility to do that. TBH, I haven't thought about it in detail yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking Table would be a wrapper around already-parsed data, and so to refresh the table, you'd just load the table again (or is there a reason why it has to be the same object?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(In that model, there would be no need for Table to be abstract, except for perhaps the parts relating to scans/transactions)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking Table would be a wrapper around already-parsed data, and so to refresh the table

I think you are referring to SerializableTable in the java impl. I think a Table abstraction is still needed because we want to differentiate a real table versus metadata table. Table internally holds a TableOperations object which offloads all I/O operations to the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe good to also look at PyIceberg, where we left out the TableOperations. There is more abstraction in Java because supports File System tables. In PyIceberg we've designed everything around the catalog. The refresh operation will refresh the TableMetadata that (de)serializes from/into the JSON metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PyIceberg, the Table also has a reference back to the catalog, which makes it easier to refresh the underlying TableMetadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @lidavidm:

Hmm, do we want a table to be able to do I/O, or should it effectively be a POD (albeit not actually a POD)?

I believe we have to support I/O at least to parse metadata file. See my comment at #30 (review).

I was thinking Table would be a wrapper around already-parsed data, and so to refresh the table, you'd just load the table again

I think we can still define an abstract Table class. A subclass StaticTable accepts a deserialized TableMetadata and StaticTable::Refresh() is a no-op.

We still need the Table abstraction for catalog-backed tables and metadata tables.

To @Fokko:

In PyIceberg, the Table also has a reference back to the catalog, which makes it easier to refresh the underlying TableMetadata.

That makes sense! We can add a CatalogTable to implement a catalog-backed table.

@wgtmac wgtmac force-pushed the add_catalog_interfaces branch from 38a7359 to 8600cb2 Compare March 21, 2025 08:41
/// \brief Starts a transaction to create the table
///
/// \return the Transaction to create the table
virtual std::unique_ptr<Transaction> CreateTransaction() = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought. To make this more Iceberg REST catalog agnostic, we could also call this Stage:

https://github.com/apache/iceberg/blob/a4816c1c99063770473920cd6d62f88f90a292dc/open-api/rest-catalog-open-api.yaml#L556-L561

Copy link
Member Author

@wgtmac wgtmac Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko Quick question: Transaction has a comment A transaction for performing multiple updates to a table but the BaseTransaction has a check checkLastOperationCommitted which throws when previous update is uncommitted. Will multiple updates be supported in the future to remove this inconsistency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will multiple updates be supported in the future to remove this inconsistency?

Yes, there is no reason not to support this.

/// \param uncommittedMetadata uncommitted table metadata
/// \return a temporary table operations that behaves like the uncommitted metadata is
/// current
virtual std::unique_ptr<TableOperations> Temp(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this one is more appropriate to be renamed as Stage... @Fokko

Copy link
Contributor

@Fokko Fokko Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The staged table creation is where you first call the catalog to create a table. The table won't be visible yet but will be reserved for the client that called it. Next, you can write all your data and once you commit the snapshot, it will be visible in the catalog. This way:

  • If you do a CREATE TABLE AS SELECT * ... which might take some time, you know that you're not going to get into conflicts.
  • The catalog might provide S3 credentials the client needs to write the data to S3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You convinced me. I have renamed them.

@wgtmac wgtmac force-pushed the add_catalog_interfaces branch 4 times, most recently from 4b90f63 to 2b6e237 Compare March 21, 2025 11:01
@@ -81,4 +81,36 @@ class TimestampTzType;
class Type;
class UuidType;

/// \brief Error types for iceberg.
/// TODO: add more and sort them based on some rules.
enum class ErrorKind {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a new file for this ErrorKind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was also thinking of adding a detail error message like:

struct Error {
  ErrorKind kind;
  std::string message;
};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems same as Status?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status can be OK but here we don't want to include that.

kCommitStateUnknown,
};

struct Namespace;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are putting these forward declarations here, should we change the name type_fwd.h to something like spec_fwd.h?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type_fwd.h is a convention used by arrow-cpp to hold forward declarations and enums in a single header file. It might be a coincident that it starts with the data types...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not "Iceberg data type forward declarations", it's "C++ data type forward declarations". Open to renaming it (though STL at least uses the fwd convention too with iosfwd)

@wgtmac wgtmac force-pushed the add_catalog_interfaces branch from 2b6e237 to 2d7ceaa Compare March 25, 2025 06:42
@wgtmac
Copy link
Member Author

wgtmac commented Mar 25, 2025

I have removed TableOperations as suggested. Now the Catalog API reflects the operationId from the Rest Catalog API to be consistent. Please note that these APIs are still subject to change when we find any issue. With these minimal set of APIs, developers can work on different parts without blocking each other.

Let me know what you think. @lidavidm @zhjwpku @Fokko @Xuanwo

@wgtmac wgtmac changed the title Add pure virtual classes for Catalog, Table, etc. feat: add pure virtual classes for Catalog, Table, etc. Mar 25, 2025
Copy link
Collaborator

@zhjwpku zhjwpku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

virtual ~Catalog() = default;

/// \brief Return the name for this catalog
virtual std::string_view name() const = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe Name() ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the coding style convention from arrow-cpp. For non-trivial functions, we use camel case with capitalized initial. For trivial functions (e.g. getters), we simply use lowercased snake case.

/// \param ns a namespace
/// \return a list of identifiers for tables or ErrorKind::kNoSuchNamespace
/// if the namespace does not exist
virtual expected<std::vector<TableIdentifier>, Error> ListTables(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should std::unique_ptr<std::vector> be used
instead of std::vector to avoid copying ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can forward declare TableIdentifierList / TableIdentifierListPtr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the compiler is smart enough to do RVO. BTW, I still prefer std::vector<TableIdentifier> to TableIdentifierList because it is short enough and more readable (do not need an extra jump to see its full definition)

int64_t snapshot_id) const = 0;

/// \brief Get the snapshots of this table
virtual const std::vector<std::shared_ptr<Snapshot>>& snapshots() const = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return an Enumerator or Iterator type if there are too many snapshots?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. Maybe you can add it later? I think the current one makes sense because the Table object has a full state of TableMetadata so it anyway caches all Snapshots in it.

/// May throw ValidationException if any update cannot be applied to the current table
/// metadata. May throw CommitFailedException if the updates cannot be committed due to
/// conflicts.
virtual void CommitTransaction() = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abort or rollback is automatically called when a commit fails ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that’s the case, then the abort/rollback methods don’t need to exist on the transaction interface.

/// \brief Get the snapshot history of this table
///
/// \return a vector of history entries
virtual const std::vector<std::shared_ptr<HistoryEntry>>& history() const = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enumerator or Iterator here too ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as above.

@Fokko
Copy link
Contributor

Fokko commented Mar 26, 2025

It looks like there is consensus; let's move this forward! Thanks @wgtmac for working on this, and thanks @lidavidm, @zhjwpku, @gt-yu and @mapleFU for the reviews 🚀

@Fokko Fokko merged commit 0358d7f into apache:main Mar 26, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants