Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement chunk comparison and selective extraction for borg extract (#5638) #8632

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

alighazi288
Copy link
Contributor

@alighazi288 alighazi288 commented Jan 12, 2025

Archive File Chunk Comparison and Extraction

This implementation provides efficient file restoration from archives by comparing and extracting chunks. Instead of blindly extracting entire files, it:

  1. Compares existing file content with archived chunks
  2. Only fetches and updates chunks that differ
  3. Handles various edge cases:
    • Partial chunks at end of files
    • Files longer/shorter than archive version
    • Empty files
    • Cross-chunk boundary changes

@alighazi288
Copy link
Contributor Author

@ThomasWaldmann Since this is a significant change, I wanted to open this PR as a draft to get your feedback before proceeding further.

Could you please review the current implementation and provide any suggestions or improvements?

Copy link

codecov bot commented Jan 12, 2025

Codecov Report

Attention: Patch coverage is 78.26087% with 10 lines in your changes missing coverage. Please review.

Project coverage is 81.80%. Comparing base (1559a1e) to head (57760ef).
Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
src/borg/archive.py 78.26% 6 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8632      +/-   ##
==========================================
- Coverage   81.83%   81.80%   -0.04%     
==========================================
  Files          74       74              
  Lines       13319    13393      +74     
  Branches     1963     1981      +18     
==========================================
+ Hits        10900    10956      +56     
- Misses       1755     1767      +12     
- Partials      664      670       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@ThomasWaldmann ThomasWaldmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some tests, so that codecov does not spam the whole diff view?

@alighazi288
Copy link
Contributor Author

Done.

Copy link
Member

@ThomasWaldmann ThomasWaldmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some first feedback

src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archiver/extract_cmd.py Outdated Show resolved Hide resolved
src/borg/testsuite/archive_test.py Outdated Show resolved Hide resolved
src/borg/testsuite/archive_test.py Outdated Show resolved Hide resolved
src/borg/testsuite/archive_test.py Outdated Show resolved Hide resolved
Copy link
Member

@ThomasWaldmann ThomasWaldmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some feedback

src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/testsuite/archive_test.py Outdated Show resolved Hide resolved
src/borg/testsuite/archive_test.py Outdated Show resolved Hide resolved
src/borg/testsuite/archive_test.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/testsuite/archive_test.py Show resolved Hide resolved
src/borg/testsuite/archive_test.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
- Add compare_and_extract_chunks functionality
- Add comprehensive test coverage
- Fix file state tracking with st parameter
@alighazi288 alighazi288 marked this pull request as ready for review January 20, 2025 21:14
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
Comment on lines 864 to 865
os.unlink(path)
st = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you think about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThomasWaldmann I didn't realize that this would break the continue_extraction functionality. The issue is compare_and_extract_chunks still tries to use stale st info after the file is unlinked.

I have tried:

  • Tracking unlink state with flags
  • Checking inode/link count
  • Modifying comparison logic

The only fix in my mind is the extra OS call to check the file's existence. Maybe I'm missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the fs file is a normal file, your code requires it to be there, so it can be "updated" - thus it must not be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we also need to think about what to do with existing metadata, like acls, xattrs, fs flags, ... - the current code assumes that there is no existing metadata and just adds the stuff from the archive item.

Copy link
Contributor Author

@alighazi288 alighazi288 Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThomasWaldmann another solution I can think of is:

try:
    st = os.stat(path, follow_symlinks=False)
    if continue_extraction and same_item(item, st):
        return  # done! we already have fully extracted this file in a previous run.
    if stat.S_ISREG(st.st_mode) and not continue_extraction:
        if self.compare_and_extract_chunks(item, path, st=st, pi=pi):
            return
    elif stat.S_ISDIR(st.st_mode):
        os.rmdir(path)
    else:
        os.unlink(path)

This way we can try an in-place update attempt before any removal/recreation. If the function returns True, we're done. Otherwise, we fall back to the original remove/recreate behavior.

Also, since I'm using restore_attrs() just like the existing code and not handling metadata directly at all, shouldn't it be consistent with how Borg already works?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I already said: the restore_attrs code expects a fresh state of the file (like newly created, no acls, no xattrs) and just adds the stuff from the archived item.

But if you are updating an existing file, there can be already acls or xattrs that do not match what's in the archive (and what shall be the final state).

@alighazi288
Copy link
Contributor Author

alighazi288 commented Jan 27, 2025

@ThomasWaldmann This is getting a bit too advanced for my understanding, but I've still tried to implement and verify the attribute restoration. I'm still stuck on the recreate_cmd test failures - would appreciate some guidance there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants