refactor: Adopt ObjectProvider in Table #1227

Xuanwo · 2025-04-17T08:33:34Z

Which issue does this PR close?

Part of #1226

What changes are included in this PR?

This PR unlock our users to use and tune their own object cache at needs.

Are these changes tested?

Unit tests, integration tests.

Signed-off-by: Xuanwo <[email protected]>

liurenjie1024

Thanks @Xuanwo for this pr, generally LGTM!

liurenjie1024 · 2025-04-18T13:01:31Z

crates/iceberg/src/spec/manifest_list.rs

-    pub async fn load_manifest(&self, file_io: &FileIO) -> Result<Manifest> {
+    ///
+    /// If `object_cache` is provided, it will be used to cache the manifest.
+    pub async fn load_manifest(


I think we should keep the original public api, while keeping the one with cache provider crate private. The version with cache provider is more likely being called by internal usage.

Hi, exposing a public API without an object cache would make the cache meaningless.

Perhaps we should design our API differently, such as using table.load_manifest(), which might make more sense. This can also avoid pass FileIO all the way.

Hi, exposing a public API without an object cache would make the cache meaningless.

It's still useful when user want to deal with single manifest or mainfest list file alone, for example doing some debugging or writing some simple tests. I'm fine with exposing the version with cache as public api, but I think we still need to keep the original one.

It's still useful when user want to deal with single manifest or mainfest list file alone, for example doing some debugging or writing some simple tests.

It should be very clear to use None to bypass the ObjectCache.

Another idea I have is to extract something like a TableContext that only contains the table's runtime context (for now, it's just FileIO and ObjectCcache), so it doesn't need to be passed around everywhere.

It should be very clear to use None to bypass the ObjectCache.

How about keeping both apis? I'm worrying about breaking public api unnecessarily.

Another idea I have is to extract something like a TableContext that only contains the table's runtime context (for now, it's just FileIO and ObjectCcache), so it doesn't need to be passed around everywhere.

I also thought about this, but I thought it's a little over design. Another idea is to have sth like ManifestListLoader/ManifestLoader, which could be created from table. But these are optional.

Perhaps we should design our API differently, such as using table.load_manifest(), which might make more sense. This can also avoid pass FileIO all the way.

I think this has the opportunity to provide the most ergonomic result (whilst also potentially more disruptive to existing users, although we may be able to mitigate disruption by exposing both old and new methods for a while and mark the old ones as deprecated).

Most of the time, I'd anticipate that most users would want to use the same FileIO and ObjectCache instance across all loaded tables and for each call to table methods. But sometimes a user may want to have an instance of a Table that has different FileIO / ObjectCache to other tables.

With that in mind, I'd suggest an API where Catalog builders would expose with_file_io() / with_object_cache() methods that specify an FileIO and ObjectCache that are provided by default to the table builder when the catalog's load_table method is called.

Then, the table builders could expose with_file_io() / with_object_cache() as well so that consumers can specify an alternative FileIO and/or ObjectCache at the point at which they instantiate the Table, that would be used instead of the default ones passed from the Catalog.

Putting both of these FileIO / ObjectCache / TableMetadata objects into a TableContext struct feels like a good approach to avoid a proliferation of arguments being passed around like here in load_manifest and below in load_manifest_list.

I agree that we should have with_file_io/with_object_cache in CatalogBuilder/TableBuilder interface. However, an extra TableContext in public api seems a little odd to me. All these things, FileIO, ObjectCache also exists in Catalog. WDYT the ManifestLoader/ManifestListLoader solution?

I agree that we should have with_file_io/with_object_cache in CatalogBuilder/TableBuilder interface. However, an extra TableContext in public api seems a little odd to me. All these things, FileIO, ObjectCache also exists in Catalog. WDYT the ManifestLoader/ManifestListLoader solution?

That's fair, I'm happy to not have TableContext for now and to re-assess if we discover we need to pass a third item around in the same way.

Re ManifestLoader / ManifestListLoader: I'm not 100% clear what your vision is here. Are you thinking that, if a public consumer needed direct access to manifests / manifest lists for a table, they'd do something like:

let manifest_list_loader: ManifestListLoader = table_foo.manifest_list_loader(); let manifest_loader: ManifestLoader = table_foo.manifest_loader(); let manifest_list: ManifestList = manifest_list_loader.load_for_current_snapshot()?; // or load via snapshot ID or file URI? let manifest: Manifest = manifest_loader.load("s3://blah")?; // re-use manifest loader to load as many manifests as you like for table foo

If so, I think this seems reasonable. Are you thinking that the table scan plan would itself use the ManifestLoader also, internally?

Are you thinking that the table scan plan would itself use the ManifestLoader also, internally?

I think we should also use ManifestLoader internally if we can, so that advanced users colud learn how to use them for source code. But if there is a strong reason we can't do it that way, I'm open to other solutions.

Let me give the loader api a try.

liurenjie1024 · 2025-04-18T13:01:51Z

crates/iceberg/src/spec/snapshot.rs

        file_io: &FileIO,
+        object_cache: Option<&ObjectCacheProvider>,


refactor: Adopt ObjectProvider in Table

815e99a

Signed-off-by: Xuanwo <[email protected]>

Xuanwo mentioned this pull request Apr 17, 2025

[EPIC] Iceberg Cache #1226

Open

3 tasks

Xuanwo requested review from sdd and liurenjie1024 April 17, 2025 08:33

Xuanwo added 2 commits April 17, 2025 16:43

iceberg don't need moka directly

f8862eb

Signed-off-by: Xuanwo <[email protected]>

Fix tests

fe5434f

Signed-off-by: Xuanwo <[email protected]>

liurenjie1024 reviewed Apr 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Adopt ObjectProvider in Table #1227

refactor: Adopt ObjectProvider in Table #1227

Xuanwo commented Apr 17, 2025

liurenjie1024 left a comment

liurenjie1024 Apr 18, 2025

Xuanwo Apr 18, 2025

liurenjie1024 Apr 18, 2025

Xuanwo Apr 18, 2025 •

edited

Loading

liurenjie1024 Apr 19, 2025

sdd Apr 22, 2025

liurenjie1024 Apr 22, 2025

sdd Apr 22, 2025

liurenjie1024 Apr 23, 2025

Xuanwo Apr 23, 2025

liurenjie1024 Apr 18, 2025

		file_io: &FileIO,
		object_cache: Option<&ObjectCacheProvider>,

refactor: Adopt ObjectProvider in Table #1227

Are you sure you want to change the base?

refactor: Adopt ObjectProvider in Table #1227

Conversation

Xuanwo commented Apr 17, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xuanwo Apr 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xuanwo Apr 18, 2025 •

edited

Loading