feat: add pure virtual classes for Catalog, Table, etc. #47

wgtmac · 2025-03-21T07:34:52Z

What are done in this PR

Added pure virtual classes of Catalog, Table, TableOperations, Transaction and LocationProvider with limited features like AppendFiles and TableScan.
Added basic definitions of Namespace and TableIdentifier.
Defined initial Error enum.

What are undecided

How to define an IO-less FileIO.
How to define StructLike.

lidavidm

Seems reasonable overall

lidavidm · 2025-03-21T07:48:29Z

src/iceberg/table.h

+  virtual std::string uuid() const = 0;
+
+  /// \brief Refresh the current table metadata
+  virtual void Refresh() = 0;


Hmm, do we want a table to be able to do I/O, or should it effectively be a POD (albeit not actually a POD)?

In my mind, Table only triggers the (JSON and/or Avro) readers to perform I/O and it is the implementation's responsibility to do that. TBH, I haven't thought about it in detail yet.

I was thinking Table would be a wrapper around already-parsed data, and so to refresh the table, you'd just load the table again (or is there a reason why it has to be the same object?)

(In that model, there would be no need for Table to be abstract, except for perhaps the parts relating to scans/transactions)

I was thinking Table would be a wrapper around already-parsed data, and so to refresh the table

I think you are referring to SerializableTable in the java impl. I think a Table abstraction is still needed because we want to differentiate a real table versus metadata table. Table internally holds a TableOperations object which offloads all I/O operations to the implementation.

Maybe good to also look at PyIceberg, where we left out the TableOperations. There is more abstraction in Java because supports File System tables. In PyIceberg we've designed everything around the catalog. The refresh operation will refresh the TableMetadata that (de)serializes from/into the JSON metadata.

In PyIceberg, the Table also has a reference back to the catalog, which makes it easier to refresh the underlying TableMetadata.

To @lidavidm:

Hmm, do we want a table to be able to do I/O, or should it effectively be a POD (albeit not actually a POD)?

I believe we have to support I/O at least to parse metadata file. See my comment at #30 (review).

I was thinking Table would be a wrapper around already-parsed data, and so to refresh the table, you'd just load the table again

I think we can still define an abstract Table class. A subclass StaticTable accepts a deserialized TableMetadata and StaticTable::Refresh() is a no-op.

We still need the Table abstraction for catalog-backed tables and metadata tables.

To @Fokko:

In PyIceberg, the Table also has a reference back to the catalog, which makes it easier to refresh the underlying TableMetadata.

That makes sense! We can add a CatalogTable to implement a catalog-backed table.

src/iceberg/table.h

src/iceberg/table_identifier.h

src/iceberg/type_fwd.h

src/iceberg/table.h

src/iceberg/catalog.h

src/iceberg/table_operations.h

Fokko · 2025-03-21T09:14:41Z

src/iceberg/catalog.h

+    /// \brief Starts a transaction to create the table
+    ///
+    /// \return the Transaction to create the table
+    virtual std::unique_ptr<Transaction> CreateTransaction() = 0;


Just a thought. To make this more Iceberg REST catalog agnostic, we could also call this Stage:

https://github.com/apache/iceberg/blob/a4816c1c99063770473920cd6d62f88f90a292dc/open-api/rest-catalog-open-api.yaml#L556-L561

@Fokko Quick question: Transaction has a comment A transaction for performing multiple updates to a table but the BaseTransaction has a check checkLastOperationCommitted which throws when previous update is uncommitted. Will multiple updates be supported in the future to remove this inconsistency?

Will multiple updates be supported in the future to remove this inconsistency?

Yes, there is no reason not to support this.

wgtmac · 2025-03-21T09:23:12Z

src/iceberg/table_operations.h

+  /// \param uncommittedMetadata uncommitted table metadata
+  /// \return a temporary table operations that behaves like the uncommitted metadata is
+  /// current
+  virtual std::unique_ptr<TableOperations> Temp(


I thought this one is more appropriate to be renamed as Stage... @Fokko

The staged table creation is where you first call the catalog to create a table. The table won't be visible yet but will be reserved for the client that called it. Next, you can write all your data and once you commit the snapshot, it will be visible in the catalog. This way:

If you do a CREATE TABLE AS SELECT * ... which might take some time, you know that you're not going to get into conflicts.

The catalog might provide S3 credentials the client needs to write the data to S3.

You convinced me. I have renamed them.

zhjwpku · 2025-03-24T12:43:34Z

src/iceberg/type_fwd.h

@@ -81,4 +81,36 @@ class TimestampTzType;
 class Type;
 class UuidType;

+/// \brief Error types for iceberg.
+/// TODO: add more and sort them based on some rules.
+enum class ErrorKind {


Should we create a new file for this ErrorKind?

Yes, I was also thinking of adding a detail error message like:

struct Error { ErrorKind kind; std::string message; };

This seems same as Status?

Status can be OK but here we don't want to include that.

zhjwpku · 2025-03-24T12:49:33Z

src/iceberg/type_fwd.h

+  kCommitStateUnknown,
+};
+
+struct Namespace;


Since we are putting these forward declarations here, should we change the name type_fwd.h to something like spec_fwd.h?

type_fwd.h is a convention used by arrow-cpp to hold forward declarations and enums in a single header file. It might be a coincident that it starts with the data types...

Yeah, it's not "Iceberg data type forward declarations", it's "C++ data type forward declarations". Open to renaming it (though STL at least uses the fwd convention too with iosfwd)

.github/workflows/cpp-linter.yml

wgtmac · 2025-03-25T06:48:21Z

I have removed TableOperations as suggested. Now the Catalog API reflects the operationId from the Rest Catalog API to be consistent. Please note that these APIs are still subject to change when we find any issue. With these minimal set of APIs, developers can work on different parts without blocking each other.

Let me know what you think. @lidavidm @zhjwpku @Fokko @Xuanwo

zhjwpku

LGTM

gt-yu · 2025-03-25T14:34:28Z

src/iceberg/catalog.h

+  virtual ~Catalog() = default;
+
+  /// \brief Return the name for this catalog
+  virtual std::string_view name() const = 0;


maybe Name() ?

This is the coding style convention from arrow-cpp. For non-trivial functions, we use camel case with capitalized initial. For trivial functions (e.g. getters), we simply use lowercased snake case.

gt-yu · 2025-03-25T14:40:11Z

src/iceberg/catalog.h

+  /// \param ns a namespace
+  /// \return a list of identifiers for tables or ErrorKind::kNoSuchNamespace
+  /// if the namespace does not exist
+  virtual expected<std::vector<TableIdentifier>, Error> ListTables(


should std::unique_ptr<std::vector> be used
instead of std::vector to avoid copying ?

We can forward declare TableIdentifierList / TableIdentifierListPtr.

I think the compiler is smart enough to do RVO. BTW, I still prefer std::vector<TableIdentifier> to TableIdentifierList because it is short enough and more readable (do not need an extra jump to see its full definition)

gt-yu · 2025-03-25T14:54:47Z

src/iceberg/table.h

+      int64_t snapshot_id) const = 0;
+
+  /// \brief Get the snapshots of this table
+  virtual const std::vector<std::shared_ptr<Snapshot>>& snapshots() const = 0;


Should we return an Enumerator or Iterator type if there are too many snapshots?

That's a good question. Maybe you can add it later? I think the current one makes sense because the Table object has a full state of TableMetadata so it anyway caches all Snapshots in it.

gt-yu · 2025-03-25T14:59:55Z

src/iceberg/transaction.h

+  /// May throw ValidationException if any update cannot be applied to the current table
+  /// metadata. May throw CommitFailedException if the updates cannot be committed due to
+  /// conflicts.
+  virtual void CommitTransaction() = 0;


Abort or rollback is automatically called when a commit fails ?

I'm not 100% sure about this. But checking the java impl it at least does some cleanup work: https://github.com/apache/iceberg/blob/e1e0a7404740b2bf9e6638afb5f0ff19f2536713/core/src/main/java/org/apache/iceberg/BaseTransaction.java#L324-L349

If that’s the case, then the abort/rollback methods don’t need to exist on the transaction interface.

gt-yu · 2025-03-25T15:00:56Z

src/iceberg/table.h

+  /// \brief Get the snapshot history of this table
+  ///
+  /// \return a vector of history entries
+  virtual const std::vector<std::shared_ptr<HistoryEntry>>& history() const = 0;


Enumerator or Iterator here too ?

Same reason as above.

Fokko · 2025-03-26T13:08:38Z

It looks like there is consensus; let's move this forward! Thanks @wgtmac for working on this, and thanks @lidavidm, @zhjwpku, @gt-yu and @mapleFU for the reviews 🚀

lidavidm reviewed Mar 21, 2025

View reviewed changes

wgtmac force-pushed the add_catalog_interfaces branch from 38a7359 to 8600cb2 Compare March 21, 2025 08:41

lidavidm approved these changes Mar 21, 2025

View reviewed changes

src/iceberg/catalog.h Outdated Show resolved Hide resolved

src/iceberg/table_operations.h Outdated Show resolved Hide resolved

Fokko reviewed Mar 21, 2025

View reviewed changes

wgtmac commented Mar 21, 2025

View reviewed changes

wgtmac force-pushed the add_catalog_interfaces branch 4 times, most recently from 4b90f63 to 2b6e237 Compare March 21, 2025 11:01

zhjwpku reviewed Mar 24, 2025

View reviewed changes

wgtmac added 3 commits March 25, 2025 11:24

Add pure virtual classes for Catalog, Table, etc.

a3b4117

use const ref and unique_ptr where possible

1aaa348

remove TableOperations

2d7ceaa

wgtmac force-pushed the add_catalog_interfaces branch from 2b6e237 to 2d7ceaa Compare March 25, 2025 06:42

wgtmac changed the title ~~Add pure virtual classes for Catalog, Table, etc.~~ feat: add pure virtual classes for Catalog, Table, etc. Mar 25, 2025

lidavidm approved these changes Mar 25, 2025

View reviewed changes

Fokko approved these changes Mar 25, 2025

View reviewed changes

zhjwpku approved these changes Mar 25, 2025

View reviewed changes

gt-yu reviewed Mar 25, 2025

View reviewed changes

gt-yu approved these changes Mar 26, 2025

View reviewed changes

Fokko merged commit 0358d7f into apache:main Mar 26, 2025
6 checks passed

feat: add pure virtual classes for Catalog, Table, etc. #47

feat: add pure virtual classes for Catalog, Table, etc. #47

Conversation

wgtmac commented Mar 21, 2025 • edited Loading

What are done in this PR

What are undecided

lidavidm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac commented Mar 25, 2025

zhjwpku left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko commented Mar 26, 2025

wgtmac commented Mar 21, 2025 •

edited

Loading

wgtmac Mar 25, 2025 •

edited

Loading

Fokko Mar 21, 2025 •

edited

Loading