Skip to content

refactor: Adopt ObjectProvider in Table #1227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Xuanwo
Copy link
Member

@Xuanwo Xuanwo commented Apr 17, 2025

Which issue does this PR close?

Part of #1226

What changes are included in this PR?

This PR unlock our users to use and tune their own object cache at needs.

Are these changes tested?

Unit tests, integration tests.

@Xuanwo Xuanwo mentioned this pull request Apr 17, 2025
3 tasks
@Xuanwo Xuanwo requested review from sdd and liurenjie1024 April 17, 2025 08:33
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Xuanwo for this pr, generally LGTM!

pub async fn load_manifest(&self, file_io: &FileIO) -> Result<Manifest> {
///
/// If `object_cache` is provided, it will be used to cache the manifest.
pub async fn load_manifest(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the original public api, while keeping the one with cache provider crate private. The version with cache provider is more likely being called by internal usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, exposing a public API without an object cache would make the cache meaningless.

Perhaps we should design our API differently, such as using table.load_manifest(), which might make more sense. This can also avoid pass FileIO all the way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, exposing a public API without an object cache would make the cache meaningless.

It's still useful when user want to deal with single manifest or mainfest list file alone, for example doing some debugging or writing some simple tests. I'm fine with exposing the version with cache as public api, but I think we still need to keep the original one.

Copy link
Member Author

@Xuanwo Xuanwo Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still useful when user want to deal with single manifest or mainfest list file alone, for example doing some debugging or writing some simple tests.

It should be very clear to use None to bypass the ObjectCache.

Another idea I have is to extract something like a TableContext that only contains the table's runtime context (for now, it's just FileIO and ObjectCcache), so it doesn't need to be passed around everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be very clear to use None to bypass the ObjectCache.

How about keeping both apis? I'm worrying about breaking public api unnecessarily.

Another idea I have is to extract something like a TableContext that only contains the table's runtime context (for now, it's just FileIO and ObjectCcache), so it doesn't need to be passed around everywhere.

I also thought about this, but I thought it's a little over design. Another idea is to have sth like ManifestListLoader/ManifestLoader, which could be created from table. But these are optional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should design our API differently, such as using table.load_manifest(), which might make more sense. This can also avoid pass FileIO all the way.

I think this has the opportunity to provide the most ergonomic result (whilst also potentially more disruptive to existing users, although we may be able to mitigate disruption by exposing both old and new methods for a while and mark the old ones as deprecated).

Most of the time, I'd anticipate that most users would want to use the same FileIO and ObjectCache instance across all loaded tables and for each call to table methods. But sometimes a user may want to have an instance of a Table that has different FileIO / ObjectCache to other tables.

With that in mind, I'd suggest an API where Catalog builders would expose with_file_io() / with_object_cache() methods that specify an FileIO and ObjectCache that are provided by default to the table builder when the catalog's load_table method is called.

Then, the table builders could expose with_file_io() / with_object_cache() as well so that consumers can specify an alternative FileIO and/or ObjectCache at the point at which they instantiate the Table, that would be used instead of the default ones passed from the Catalog.

Putting both of these FileIO / ObjectCache / TableMetadata objects into a TableContext struct feels like a good approach to avoid a proliferation of arguments being passed around like here in load_manifest and below in load_manifest_list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should have with_file_io/with_object_cache in CatalogBuilder/TableBuilder interface. However, an extra TableContext in public api seems a little odd to me. All these things, FileIO, ObjectCache also exists in Catalog. WDYT the ManifestLoader/ManifestListLoader solution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should have with_file_io/with_object_cache in CatalogBuilder/TableBuilder interface. However, an extra TableContext in public api seems a little odd to me. All these things, FileIO, ObjectCache also exists in Catalog. WDYT the ManifestLoader/ManifestListLoader solution?

That's fair, I'm happy to not have TableContext for now and to re-assess if we discover we need to pass a third item around in the same way.

Re ManifestLoader / ManifestListLoader: I'm not 100% clear what your vision is here. Are you thinking that, if a public consumer needed direct access to manifests / manifest lists for a table, they'd do something like:

let manifest_list_loader: ManifestListLoader = table_foo.manifest_list_loader();
let manifest_loader: ManifestLoader = table_foo.manifest_loader();

let manifest_list: ManifestList = manifest_list_loader.load_for_current_snapshot()?;
// or load via snapshot ID or file URI?

let manifest: Manifest = manifest_loader.load("s3://blah")?;
// re-use manifest loader to load as many manifests as you like for table foo

If so, I think this seems reasonable. Are you thinking that the table scan plan would itself use the ManifestLoader also, internally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking that the table scan plan would itself use the ManifestLoader also, internally?

I think we should also use ManifestLoader internally if we can, so that advanced users colud learn how to use them for source code. But if there is a strong reason we can't do it that way, I'm open to other solutions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give the loader api a try.

file_io: &FileIO,
object_cache: Option<&ObjectCacheProvider>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants