Implement chunk comparison and selective extraction for borg extract (#5638) #8632

alighazi288 · 2025-01-12T05:05:11Z

Archive File Chunk Comparison and Extraction

This implementation provides efficient file restoration from archives by comparing and extracting chunks. Instead of blindly extracting entire files, it:

Compares existing file content with archived chunks
Only fetches and updates chunks that differ
Handles various edge cases:
- Partial chunks at end of files
- Files longer/shorter than archive version
- Empty files
- Cross-chunk boundary changes

alighazi288 · 2025-01-12T05:08:32Z

@ThomasWaldmann Since this is a significant change, I wanted to open this PR as a draft to get your feedback before proceeding further.

Could you please review the current implementation and provide any suggestions or improvements?

codecov · 2025-01-12T05:16:53Z

Codecov Report

Attention: Patch coverage is 78.26087% with 10 lines in your changes missing coverage. Please review.

Project coverage is 81.80%. Comparing base (1559a1e) to head (57760ef).
Report is 17 commits behind head on master.

Files with missing lines	Patch %	Lines
src/borg/archive.py	78.26%	6 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8632      +/-   ##
==========================================
- Coverage   81.83%   81.80%   -0.04%     
==========================================
  Files          74       74              
  Lines       13319    13393      +74     
  Branches     1963     1981      +18     
==========================================
+ Hits        10900    10956      +56     
- Misses       1755     1767      +12     
- Partials      664      670       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ThomasWaldmann

can you add some tests, so that codecov does not spam the whole diff view?

alighazi288 · 2025-01-13T05:09:48Z

Done.

ThomasWaldmann

some first feedback

src/borg/archive.py

src/borg/archiver/extract_cmd.py

src/borg/testsuite/archive_test.py

ThomasWaldmann

some feedback

src/borg/archive.py

src/borg/testsuite/archive_test.py

src/borg/archive.py

src/borg/testsuite/archive_test.py

src/borg/archive.py

- Add compare_and_extract_chunks functionality - Add comprehensive test coverage - Fix file state tracking with st parameter

src/borg/archive.py

ThomasWaldmann · 2025-01-20T23:31:15Z

src/borg/archive.py

                os.unlink(path)
+                st = None


did you think about this?

@ThomasWaldmann I didn't realize that this would break the continue_extraction functionality. The issue is compare_and_extract_chunks still tries to use stale st info after the file is unlinked.

I have tried:

Tracking unlink state with flags

Checking inode/link count

Modifying comparison logic

The only fix in my mind is the extra OS call to check the file's existence. Maybe I'm missing something?

If the fs file is a normal file, your code requires it to be there, so it can be "updated" - thus it must not be removed.

And we also need to think about what to do with existing metadata, like acls, xattrs, fs flags, ... - the current code assumes that there is no existing metadata and just adds the stuff from the archive item.

@ThomasWaldmann another solution I can think of is:

try: st = os.stat(path, follow_symlinks=False) if continue_extraction and same_item(item, st): return # done! we already have fully extracted this file in a previous run. if stat.S_ISREG(st.st_mode) and not continue_extraction: if self.compare_and_extract_chunks(item, path, st=st, pi=pi): return elif stat.S_ISDIR(st.st_mode): os.rmdir(path) else: os.unlink(path)

This way we can try an in-place update attempt before any removal/recreation. If the function returns True, we're done. Otherwise, we fall back to the original remove/recreate behavior.

Also, since I'm using restore_attrs() just like the existing code and not handling metadata directly at all, shouldn't it be consistent with how Borg already works?

As I already said: the restore_attrs code expects a fresh state of the file (like newly created, no acls, no xattrs) and just adds the stuff from the archived item.

But if you are updating an existing file, there can be already acls or xattrs that do not match what's in the archive (and what shall be the final state).

alighazi288 · 2025-01-27T00:00:35Z

@ThomasWaldmann This is getting a bit too advanced for my understanding, but I've still tried to implement and verify the attribute restoration. I'm still stuck on the recreate_cmd test failures - would appreciate some guidance there.

ThomasWaldmann reviewed Jan 12, 2025

View reviewed changes

ThomasWaldmann requested changes Jan 13, 2025

View reviewed changes

alighazi288 force-pushed the master branch from 737f618 to e81ef60 Compare January 14, 2025 07:59

ThomasWaldmann reviewed Jan 14, 2025

View reviewed changes

ThomasWaldmann requested changes Jan 16, 2025

View reviewed changes

ThomasWaldmann requested changes Jan 17, 2025

View reviewed changes

ThomasWaldmann requested changes Jan 19, 2025

View reviewed changes

src/borg/archive.py Outdated Show resolved Hide resolved

src/borg/archive.py Outdated Show resolved Hide resolved

src/borg/archive.py Outdated Show resolved Hide resolved

alighazi288 force-pushed the master branch from 17ffef9 to 564b040 Compare January 20, 2025 20:10

Implement chunk comparison and selective extraction

57760ef

- Add compare_and_extract_chunks functionality - Add comprehensive test coverage - Fix file state tracking with st parameter

alighazi288 force-pushed the master branch from 564b040 to 57760ef Compare January 20, 2025 20:23

alighazi288 marked this pull request as ready for review January 20, 2025 21:14

ThomasWaldmann requested changes Jan 20, 2025

View reviewed changes

Refactor compare_and_extract_chunks

80862e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement chunk comparison and selective extraction for borg extract (#5638) #8632

Implement chunk comparison and selective extraction for borg extract (#5638) #8632

alighazi288 commented Jan 12, 2025 •

edited

Loading

alighazi288 commented Jan 12, 2025

codecov bot commented Jan 12, 2025 •

edited

Loading

ThomasWaldmann left a comment

alighazi288 commented Jan 13, 2025

ThomasWaldmann left a comment

ThomasWaldmann left a comment

ThomasWaldmann Jan 20, 2025

alighazi288 Jan 21, 2025

ThomasWaldmann Jan 21, 2025

ThomasWaldmann Jan 21, 2025

alighazi288 Jan 21, 2025 •

edited

Loading

ThomasWaldmann Jan 24, 2025

alighazi288 commented Jan 27, 2025 •

edited

Loading

Implement chunk comparison and selective extraction for borg extract (#5638) #8632

Are you sure you want to change the base?

Implement chunk comparison and selective extraction for borg extract (#5638) #8632

Conversation

alighazi288 commented Jan 12, 2025 • edited Loading

Archive File Chunk Comparison and Extraction

alighazi288 commented Jan 12, 2025

codecov bot commented Jan 12, 2025 • edited Loading

Codecov Report

ThomasWaldmann left a comment

Choose a reason for hiding this comment

alighazi288 commented Jan 13, 2025

ThomasWaldmann left a comment

Choose a reason for hiding this comment

ThomasWaldmann left a comment

Choose a reason for hiding this comment

ThomasWaldmann Jan 20, 2025

Choose a reason for hiding this comment

alighazi288 Jan 21, 2025

Choose a reason for hiding this comment

ThomasWaldmann Jan 21, 2025

Choose a reason for hiding this comment

ThomasWaldmann Jan 21, 2025

Choose a reason for hiding this comment

alighazi288 Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

ThomasWaldmann Jan 24, 2025

Choose a reason for hiding this comment

alighazi288 commented Jan 27, 2025 • edited Loading

alighazi288 commented Jan 12, 2025 •

edited

Loading

codecov bot commented Jan 12, 2025 •

edited

Loading

alighazi288 Jan 21, 2025 •

edited

Loading

alighazi288 commented Jan 27, 2025 •

edited

Loading