Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: get prefix from offset path #699

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Feb 14, 2025

What changes are proposed in this pull request?

Our list_from implementation for the object_store based filesystem client is currently broken, since it does not behave as documented / required for that function. Specifically we should list all files in the parent folder for using the path as offset to list from.

In a follow up PR we then need to lift teh assumtion that all URLs will always be under the same store to get proper URL handling.

This PR affects the following public APIs

DefaultEngine::new no longer requires a table_root parameter. I do expect some more changes in an immediate follow-up PR where we update object store handling to account for files tored in separate stores.

How was this change tested?

Current unit tests.

Copy link

codecov bot commented Feb 14, 2025

Codecov Report

Attention: Patch coverage is 91.48936% with 8 lines in your changes missing coverage. Please review.

Project coverage is 84.23%. Comparing base (16d2557) to head (c0b028e).

Files with missing lines Patch % Lines
kernel/src/engine/default/filesystem.rs 70.58% 4 Missing and 1 partial ⚠️
kernel/src/engine/default/json.rs 0.00% 0 Missing and 1 partial ⚠️
kernel/src/engine/default/mod.rs 90.00% 0 Missing and 1 partial ⚠️
kernel/src/engine/sync/json.rs 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #699      +/-   ##
==========================================
+ Coverage   84.22%   84.23%   +0.01%     
==========================================
  Files          77       78       +1     
  Lines       17926    17988      +62     
  Branches    17926    17988      +62     
==========================================
+ Hits        15098    15153      +55     
- Misses       2110     2114       +4     
- Partials      718      721       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Feb 14, 2025
@roeap roeap requested review from zachschuermann, scovich, nicklan and OussamaSaoudi and removed request for zachschuermann and scovich February 15, 2025 00:11
Signed-off-by: Robert Pack <[email protected]>
Comment on lines +349 to +353
/// List the paths in the same directory that are lexicographically greater than
/// (UTF-8 sorting) the given `path`. The result should also be sorted by the file name.
///
/// If the path is directory-like (ends with '/'), the result should contain
/// all the files in the directory.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure this is the behavior we want to go for, but thought I"d put up the PR for discussion.

Comment on lines +48 to +50
let offset = Path::from_url_path(path.path())?;
let prefix = if url.path().ends_with('/') {
offset.clone()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a bit of trouble following this, but I think it's doing a directory listing rather than a traditional lexicographical start-after listing? That doesn't seem correct given the documented behavior of this function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: The offset is used for list-after; the prefix is used to restrict the listing to a specific directory. And Path provides no easy way to check whether a name is directory-like, because it strips trailing /, so we're reduced to this manual manipulation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some code comments explaining all this might be helpful

Comment on lines +52 to +59
let parts = offset.parts().collect_vec();
if parts.is_empty() {
return Err(Error::generic(format!(
"Offset path must not be a root directory. Got: '{}'",
url.as_str()
)));
}
Path::from_iter(parts[..parts.len() - 1].iter().cloned())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just

Suggested change
let parts = offset.parts().collect_vec();
if parts.is_empty() {
return Err(Error::generic(format!(
"Offset path must not be a root directory. Got: '{}'",
url.as_str()
)));
}
Path::from_iter(parts[..parts.len() - 1].iter().cloned())
let parts = offset.parts().collect_vec();
if parts.pop().is_empty() {
return Err(Error::generic(format!(
"Offset path must not be a root directory. Got: '{}'",
url.as_str()
)));
}
Path::from_iter(parts)

@@ -48,9 +45,19 @@ impl<E: TaskExecutor> FileSystemClient for ObjectStoreFileSystemClient<E> {
path: &Url,
) -> DeltaResult<Box<dyn Iterator<Item = DeltaResult<FileMeta>>>> {
let url = path.clone();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: I don't understand why we need this extra copy when path is never consumed? it makes the code harder to understand because the reader has to keep track of two of the same thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants