Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow on macOS tiny file removal ( Sequoia 15.3 | git 2.48 | Python 3.12 ) #642

Closed
protoEvangelion opened this issue Feb 14, 2025 · 7 comments

Comments

@protoEvangelion
Copy link

protoEvangelion commented Feb 14, 2025

What is happening

Running this on a 9 byte file takes over 1 hour:

git filter-repo --path foo.bar --invert-paths

Environment Specs

  • Repo size according to: du -sh .git is 104mb
  • Number of commits is 12k
  • Sequoia 15.3
  • git 2.48.1
  • Python 3.12.9
  • 32gb ram (still have about 10gb to play with)
  • CPU low when running

What I tried

I added logs everywhere to track down what is taking long and narrowed it down here to GitUtils.get_refs(target_working_dir) which calls git show-ref.

I tried several versions of python starting from 3.10 to 3.12.9. I updated git from 2.39 to 2.48.

Workarounds?

@newren Are there any workarounds that you are aware of on mac like bumping up resources somehow or commenting out unnecessary parts?

@newren
Copy link
Owner

newren commented Feb 14, 2025

You're saying that git show-ref takes most of an hour?? Did you run that separately to confirm?

What's the output of git count-objects -v?

Your repository isn't that large from the stats you provided; I'd expect it to run much faster. Could you run git-sizer on it to see if you have a very unusual data shape of some form, and provide that output here as well?

@protoEvangelion
Copy link
Author

@newren I added a log here before getRefs in the for loop & it logs about 1000 times in ~4 minutes. I'm guessing each log corresponds to one commit? So the whole cmd takes ~1 hour.

git count-objects -v

count: 3161
size: 18164
in-pack: 280778
packs: 34
size-pack: 182743
prune-packable: 95
garbage: 0
size-garbage: 0
git-sizer

Processing blobs: 57049                        
Processing trees: 165774                        
Processing commits: 23899                        
Matching commits to trees: 23899                        
Processing annotated tags: 19117                        
Processing references: 19805                        
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Biggest checkouts            |           |                                |
| * Maximum path depth     [1] |    13     | *                              |
| * Maximum path length    [2] |   188 B   | *                              |

[1]  05d849c84d1b3b42a610a1e157d92fd82e33cef0 (refs/heads/ai-hackathon-news^{tree})
[2]  f1f1c01fa24059b44451d5aaaccf274ce044c4e6 (90f98622d1699c9116aebace6e7b9cab404635c0^{tree})

@newren
Copy link
Owner

newren commented Feb 14, 2025

@newren I added a log here before getRefs in the for loop & it logs about 1000 times in ~4 minutes. I'm guessing each log corresponds to one commit? So the whole cmd takes ~1 hour.

The line you highlighted should be run exactly once per program invocation, so if you see it hitting that line 1000 times in ~4 minutes, then something is horribly broken with the setup. Where did you get git-filter-repo from? What version? What other modifications did you (or others) make to it?

git count-objects -v

Your output looks reasonable. The repo is getting close to needing a repack (git gc) -- it's just a little below the limits when automatic gc will kick in. But that shouldn't make much of a difference.

git-sizer

Sorry, can you re-run with the --verbose option?

@protoEvangelion
Copy link
Author

Hey @newren I'm using a copy pasted version of the git-filter-repo file on main branch & I've only added 3 lines to log below. I actually linked to the wrong get_refs call. Here is the right one.

Click to expand
    counter=0

    for refname, pair in old_ref_map.items():
      ...
      else: # Must be either an annotated tag, or a ref whose tip was pruned
        if not new_refs_initialized:
          target_working_dir = self._args.target or b'.'

          counter+=1
          print(counter)

          new_refs = GitUtils.get_refs(target_working_dir)

Here is the log on a git clone with depth of 3k (the counting section makes up 99% of the time to run):

Click to expand
 ✝  ~/dev/temp/temp3|(master) => python3 ../gfr.py --path foo.bar --invert-paths
NOTICE: Removing 'origin' remote; see 'Why is my origin removed?'
        in the manual if you want to push back there.
        (was [email protected]:foo/bar.git)
Parsed 3000 commits
New history written in 4.34 seconds; now repacking/cleaning...
1
2
3
...
6304
Repacking your repo and cleaning out old unneeded objects
HEAD is now at f3ec4b47 Update README.md
Enumerating objects: 51569, done.
Counting objects: 100% (51569/51569), done.
Delta compression using up to 10 threads
Compressing objects: 100% (16626/16626), done.
Writing objects: 100% (51569/51569), done.
Total 51569 (delta 32980), reused 51569 (delta 32980), pack-reused 0
Completely finished after 1144.74 seconds.

git-sizer --verbose output:

Click to expand
Processing blobs: 57053                        
Processing trees: 165780                        
Processing commits: 23904                        
Matching commits to trees: 23904                        
Processing annotated tags: 19119                        
Processing references: 19811                        
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |  23.9 k   |                                |
|   * Total size               |  7.06 MiB |                                |
| * Trees                      |           |                                |
|   * Count                    |   166 k   |                                |
|   * Total size               |  90.9 MiB |                                |
|   * Total tree entries       |  2.42 M   |                                |
| * Blobs                      |           |                                |
|   * Count                    |  57.1 k   |                                |
|   * Total size               |  3.72 GiB |                                |
| * Annotated tags             |           |                                |
|   * Count                    |  19.1 k   |                                |
| * References                 |           |                                |
|   * Count                    |  19.8 k   |                                |
|     * Branches               |   653     |                                |
|     * Tags                   |  19.1 k   |                                |
|     * Remote-tracking refs   |    37     |                                |
|     * Git stash              |     1     |                                |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  1.90 KiB |                                |
|   * Maximum parents      [2] |     2     |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |   180     |                                |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |  7.31 MiB |                                |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |  7.33 k   |                                |
| * Maximum tag depth      [5] |     1     |                                |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [6] |   756     |                                |
| * Maximum path depth     [7] |    13     | *                              |
| * Maximum path length    [8] |   188 B   | *                              |
| * Number of files        [6] |  2.83 k   |                                |
| * Total size of files    [9] |  24.5 MiB |                                |
| * Number of symlinks         |     0     |                                |
| * Number of submodules       |     0     |                                |

[1]  e15daea129b88c614b93f38b3450ebfa10c575d5
[2]  a3565a137e86838c0725a3cfabc86c425606bcc9 (refs/stash)
[3]  62b13938ee685cdbbeabc52dc6c13d49772f6b97 (refs/stash:foo)
[4]  76a00879d21a1205181a359d3779181b7d184715 (cdc302316bc649be99f136d332c3ef47ada35e57:package-lock.json)
[5]  f05f3bc671e8fad6fd1e184ff89b3b96197806f2 (refs/tags/0.1.100)
[6]  f30bd4eb2406bd5e035e8a5ff241054e17e06144 (6c7cc288c4f7987f62a31891dbec42efe4e70614^{tree})
[7]  05d849c84d1b3b42a610a1e157d92fd82e33cef0 (refs/heads/foo^{tree})
[8]  f1f1c01fa24059b44451d5aaaccf274ce044c4e6 (90f98622d1699c9116aebace6e7b9cab404635c0^{tree})
[9]  893bb7c91b71feea4112c22262c68b39e4a8555e (cdc302316bc649be99f136d332c3ef47ada35e57^{tree})

@newren
Copy link
Owner

newren commented Feb 19, 2025

Well, I feel a bit dumb. What if you make this change:

diff --git a/git-filter-repo b/git-filter-repo
index 72629c4..41f3a02 100755
--- a/git-filter-repo
+++ b/git-filter-repo
@@ -4630,6 +4630,7 @@ class RepoFilter(object):
         if not new_refs_initialized:
           target_working_dir = self._args.target or b'.'
           new_refs = GitUtils.get_refs(target_working_dir)
+          new_refs_initialized = True
         if refname in new_refs:
           new_hash = new_refs[refname]
         else:
@@ -4639,6 +4640,7 @@ class RepoFilter(object):
       if not new_refs_initialized:
         target_working_dir = self._args.target or b'.'
         new_refs = GitUtils.get_refs(target_working_dir)
+        new_refs_initialized = True
       for ref, new_hash in new_refs.items():
         if ref not in orig_refs and not ref.startswith(b'refs/replace/'):
           old_hash = b'0'*len(new_hash)

Does that speed it up considerably?

@newren
Copy link
Owner

newren commented Feb 19, 2025

Even if it doesn't fix your case that's an important fix, so I committed and pushed it as ccc1885 (filter-repo: avoid parsing the repository refs repeatedly, 2025-02-18). So, you can also try out the fix by just grabbing the latest version. Let me know if it fixes it for you.

@protoEvangelion
Copy link
Author

@newren YUP 7s now. Nice one!

Parsed 3000 commits
New history written in 4.00 seconds; now repacking/cleaning...
1
Repacking your repo and cleaning out old unneeded objects
HEAD is now at f3ec4b47 Update README.md
Enumerating objects: 51569, done.
Counting objects: 100% (51569/51569), done.
Delta compression using up to 10 threads
Compressing objects: 100% (16626/16626), done.
Writing objects: 100% (51569/51569), done.
Total 51569 (delta 32980), reused 51569 (delta 32980), pack-reused 0
Completely finished after 7.08 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants