-
Notifications
You must be signed in to change notification settings - Fork 235
refactor: Adopt ObjectProvider in Table #1227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Xuanwo for this pr, generally LGTM!
pub async fn load_manifest(&self, file_io: &FileIO) -> Result<Manifest> { | ||
/// | ||
/// If `object_cache` is provided, it will be used to cache the manifest. | ||
pub async fn load_manifest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the original public api, while keeping the one with cache provider crate private. The version with cache provider is more likely being called by internal usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, exposing a public API without an object cache would make the cache meaningless.
Perhaps we should design our API differently, such as using table.load_manifest()
, which might make more sense. This can also avoid pass FileIO
all the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, exposing a public API without an object cache would make the cache meaningless.
It's still useful when user want to deal with single manifest or mainfest list file alone, for example doing some debugging or writing some simple tests. I'm fine with exposing the version with cache as public api, but I think we still need to keep the original one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still useful when user want to deal with single manifest or mainfest list file alone, for example doing some debugging or writing some simple tests.
It should be very clear to use None
to bypass the ObjectCache
.
Another idea I have is to extract something like a TableContext
that only contains the table's runtime context (for now, it's just FileIO
and ObjectCcache
), so it doesn't need to be passed around everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be very clear to use None to bypass the ObjectCache.
How about keeping both apis? I'm worrying about breaking public api unnecessarily.
Another idea I have is to extract something like a TableContext that only contains the table's runtime context (for now, it's just FileIO and ObjectCcache), so it doesn't need to be passed around everywhere.
I also thought about this, but I thought it's a little over design. Another idea is to have sth like ManifestListLoader
/ManifestLoader
, which could be created from table. But these are optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should design our API differently, such as using
table.load_manifest()
, which might make more sense. This can also avoid passFileIO
all the way.
I think this has the opportunity to provide the most ergonomic result (whilst also potentially more disruptive to existing users, although we may be able to mitigate disruption by exposing both old and new methods for a while and mark the old ones as deprecated).
Most of the time, I'd anticipate that most users would want to use the same FileIO and ObjectCache instance across all loaded tables and for each call to table methods. But sometimes a user may want to have an instance of a Table that has different FileIO / ObjectCache to other tables.
With that in mind, I'd suggest an API where Catalog builders would expose with_file_io()
/ with_object_cache()
methods that specify an FileIO and ObjectCache that are provided by default to the table builder when the catalog's load_table
method is called.
Then, the table builders could expose with_file_io()
/ with_object_cache()
as well so that consumers can specify an alternative FileIO and/or ObjectCache at the point at which they instantiate the Table, that would be used instead of the default ones passed from the Catalog.
Putting both of these FileIO / ObjectCache / TableMetadata objects into a TableContext
struct feels like a good approach to avoid a proliferation of arguments being passed around like here in load_manifest
and below in load_manifest_list
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should have with_file_io
/with_object_cache
in CatalogBuilder
/TableBuilder
interface. However, an extra TableContext
in public api seems a little odd to me. All these things, FileIO
, ObjectCache
also exists in Catalog
. WDYT the ManifestLoader
/ManifestListLoader
solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should have
with_file_io
/with_object_cache
inCatalogBuilder
/TableBuilder
interface. However, an extraTableContext
in public api seems a little odd to me. All these things,FileIO
,ObjectCache
also exists inCatalog
. WDYT theManifestLoader
/ManifestListLoader
solution?
That's fair, I'm happy to not have TableContext
for now and to re-assess if we discover we need to pass a third item around in the same way.
Re ManifestLoader
/ ManifestListLoader
: I'm not 100% clear what your vision is here. Are you thinking that, if a public consumer needed direct access to manifests / manifest lists for a table, they'd do something like:
let manifest_list_loader: ManifestListLoader = table_foo.manifest_list_loader();
let manifest_loader: ManifestLoader = table_foo.manifest_loader();
let manifest_list: ManifestList = manifest_list_loader.load_for_current_snapshot()?;
// or load via snapshot ID or file URI?
let manifest: Manifest = manifest_loader.load("s3://blah")?;
// re-use manifest loader to load as many manifests as you like for table foo
If so, I think this seems reasonable. Are you thinking that the table scan plan would itself use the ManifestLoader also, internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking that the table scan plan would itself use the ManifestLoader also, internally?
I think we should also use ManifestLoader
internally if we can, so that advanced users colud learn how to use them for source code. But if there is a strong reason we can't do it that way, I'm open to other solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me give the loader api a try.
file_io: &FileIO, | ||
object_cache: Option<&ObjectCacheProvider>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
Which issue does this PR close?
Part of #1226
What changes are included in this PR?
This PR unlock our users to use and tune their own object cache at needs.
Are these changes tested?
Unit tests, integration tests.