Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up and expand log path parsing utilities #347

Merged
merged 5 commits into from
Sep 25, 2024

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Sep 19, 2024

Today's log path parsing utilities have two weaknesses:

  1. Everything is wrapped in Option because of the possibility that some path failed to parse. This complicates code that uses it.
  2. The parsing is a thin wrapper around Url, which can only exist very temporarily due to borrowed references. This causes repeated parsing.

To solve both problems, rewrite the parsing utilities to take ownership of -- and embed -- the FileMeta they are derived from, so they can be instantiated once, passed around as needed, and the metadata extracted easily at the end. Also amp up the unit tests to cover the functionality more completely.

@scovich scovich requested review from nicklan and hntd187 September 19, 2024 19:36
@zachschuermann zachschuermann self-requested a review September 19, 2024 19:36
Copy link

codecov bot commented Sep 19, 2024

Codecov Report

Attention: Patch coverage is 95.96774% with 15 lines in your changes missing coverage. Please review.

Project coverage is 74.73%. Comparing base (4c9a6ac) to head (128d7a3).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/path.rs 97.59% 1 Missing and 7 partials ⚠️
kernel/src/snapshot.rs 85.71% 1 Missing and 4 partials ⚠️
acceptance/src/data.rs 0.00% 0 Missing and 1 partial ⚠️
ffi/src/lib.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #347      +/-   ##
==========================================
+ Coverage   74.03%   74.73%   +0.70%     
==========================================
  Files          43       43              
  Lines        8137     8368     +231     
  Branches     8137     8368     +231     
==========================================
+ Hits         6024     6254     +230     
+ Misses       1733     1727       -6     
- Partials      380      387       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice thanks! just a couple small nits

ffi/src/lib.rs Outdated Show resolved Hide resolved
get_filename(path)
.and_then(|f| f.split_once('.'))
.and_then(|(name, _)| get_version_opt(Some(name), VERSION_LEN))
impl CanTryIntoLogPath for FileMeta {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CanTryIntoLogPath should be called AsUrl or AsLogPathUrl. It's not 'trying' (the method should return a result/option if it was), and it's converting to a Url.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a marker trait for ParsedLogPath::try_from. I was hoping it would let me work around foreign trait restrictions, but it didn't. I changed it to AsUrl tho, simpler name is better.

Copy link
Collaborator Author

@scovich scovich Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I also considered just using AsRef<Url>, but AsRef as a general mechanism for exposing struct fields seems weird and error prone, so I decided against it)

kernel/src/path.rs Show resolved Hide resolved
kernel/src/path.rs Outdated Show resolved Hide resolved
@scovich scovich requested a review from nicklan September 23, 2024 16:10
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very awesome, love the new ergonomics! just a quick question and comment

kernel/src/path.rs Show resolved Hide resolved
Comment on lines 110 to 117
let file_type = match split.len() {
// Commit file: <n>.json
1 if extension == "json" => LogPathFileType::Commit,

// Single-part checkpoint: <n>.checkpoint.parquet
2 if split[0] == "checkpoint" && extension == "parquet" => {
LogPathFileType::SinglePartCheckpoint
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could also do split.as_slice()? (just for another idea if that helps readability)

Suggested change
let file_type = match split.len() {
// Commit file: <n>.json
1 if extension == "json" => LogPathFileType::Commit,
// Single-part checkpoint: <n>.checkpoint.parquet
2 if split[0] == "checkpoint" && extension == "parquet" => {
LogPathFileType::SinglePartCheckpoint
}
let file_type = match split.as_slice() {
// Commit file: <n>.json
[_] if extension == "json" => LogPathFileType::Commit,
// Single-part checkpoint: <n>.checkpoint.parquet
["checkpoint", _] && extension == "parquet" => {
LogPathFileType::SinglePartCheckpoint
}
...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also maybe doing pop_back for the extension would help with clarity so it isn't still in split? (would require VecDeque though..)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, that's nice!

@scovich scovich merged commit e88ae2d into delta-io:main Sep 25, 2024
12 checks passed
nicklan added a commit that referenced this pull request Oct 2, 2024
…354)

If we have a `_last_checkpoint` that is out of date, things can get
confused. This code:

1. Cleans up the listing function a bit
2. Ensures we end up with the real latest checkpoint
3. Drops any commit files from the listing that are older than the last
checkpoint
4. `warns!` if ` _last_checkpoint` is out of date
5. Adds a test for this case

This code will conflict with #347, so maybe hold of merging until that
merges and then I can rebase and clean this up more.

---------

Co-authored-by: Nick Lanham <[email protected]>
Co-authored-by: Zach Schuermann <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Stephen Carman <[email protected]>
OussamaSaoudi-db pushed a commit to OussamaSaoudi-db/delta-kernel-rs that referenced this pull request Oct 7, 2024
…elta-io#354)

If we have a `_last_checkpoint` that is out of date, things can get
confused. This code:

1. Cleans up the listing function a bit
2. Ensures we end up with the real latest checkpoint
3. Drops any commit files from the listing that are older than the last
checkpoint
4. `warns!` if ` _last_checkpoint` is out of date
5. Adds a test for this case

This code will conflict with delta-io#347, so maybe hold of merging until that
merges and then I can rebase and clean this up more.

---------

Co-authored-by: Nick Lanham <[email protected]>
Co-authored-by: Zach Schuermann <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Stephen Carman <[email protected]>
@scovich scovich deleted the parsed-log-path branch November 8, 2024 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants