Skip to content

Conversation

githubnemo
Copy link
Collaborator

@githubnemo githubnemo commented Jan 14, 2025

This PR introduces CI caching for datasets and hub artifacts across runner operating systems with the intended goal to minimize the number of failed test runs because of network faults. As an additional bonus it might make the CI a bit faster.

The following artifacts are cached: ${HF_HOME}/hub/**

Note that we're avoiding .lock files as well as *.pyc files. We're not simply caching $HF_HOME since there is also the datasets and modules where the former was acting up when testing (no details, just dropped, we may explore this later but we're not using that many datasets) and the latter is just code which is probably not a good idea to cache anyway.

There is a post process for the cache action which uploads new data to the cache - only one runner can access the cache for uploading. This is done because github actions is locking cache creation, so if there's a concurrent cache creation, both may fail. This runner is currently set to ubuntu in the python 3.10 run.

If this modification turns out to be ineffective we can move to forbidding access to the hub in general (HF_HUB_OFFLINE=1) and updating the cache once per day but let's first try out if this is already enough to decrease the fail rate.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

nemo added 9 commits January 14, 2025 14:26
Especially avoid lock files, code and pyc files
Remove `-` in front of lookup key to avoid weird numerical
sort by accident (just a precaution)
This makes sure that we're always saving the newest models into
the cache even if we had a cache hit (we might still have updated
models)
@githubnemo githubnemo marked this pull request as ready for review January 15, 2025 17:32
@githubnemo githubnemo changed the title Draft: Attempt at adding a cache for models Attempt at adding a cache for models Jan 15, 2025
@githubnemo
Copy link
Collaborator Author

The failing zizmor run is fixed here: #2331

@githubnemo
Copy link
Collaborator Author

@Wauplin do you think this is an approach that will work or did I overlook something?

@Wauplin
Copy link
Contributor

Wauplin commented Jan 16, 2025

The approach looks reasonable to me, though I never really tried it myself. I few things that comes to mind though:

  • it might take longer to load the cache on the github runner that downloading weights from the Hub. So as always with this kind of stuff, better to test
  • I'm not sure why !${{ env.HF_HOME }}/**/*.pyc is needed since only the /hub folder is cached (which contains models/datasets). Doesn't hurt to have it but wonder if it's necesary
  • I don't know how github cache works with symlinks. I suspect that it might not work. Since we are heavily relying on symlinks in the cache, it might lead to suboptimal cases:
    • typically we have in cache model A on revision 123. If the model card gets updated on the Hub, revision is updated to 456. But the model weights did not change. On a machine with symlinks, we would not need to redownload the weights. But if symlinking doesn't work, then we have a degraded cache experience (typically on windows without dev mode 👀 ). So in this case with github cache we might 1. load the weights from the GH cache and then 2. load the weights from HF,
      • even worse, next time we load the cache, the weights file will be duplicated between revision 123 and revision 456 even though revision 123 will never be used again. As a workaround you could hardcode revisions in the tests.
  • no matter the cache size, I think there should be a cleaning process at the end of the CI. If a file is in the cache and has not been accessed once during the tests, then it should be removed so that it's not reloaded at the next CI run. This might be tricky to implement (check out https://huggingface.co/docs/huggingface_hub/guides/manage-cache#scan-cache-from-python and https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-cache-from-python on how to scan/clean the cache).

Note: I know @ivarflakstad is working on something similar (see private slack thread). You should both get in touch if you start building such a tool. It might even become a Github action on its own.

@BenjaminBossan
Copy link
Member

Thanks for the detailed response. Regarding duplication and clean up: As a practical albeit not very efficient solution, could we just purge the whole cache periodically?

@githubnemo
Copy link
Collaborator Author

Thanks a lot :)

  • it might take longer to load the cache on the github runner that downloading weights from the Hub. So as always with this kind of stuff, better to test

Yep. I think it's OK though, since we're not optimizing for speed but against test flakiness due to failed downloads. However in my testing I didn't see a big difference (might change over time, of course).

  • I'm not sure why !${{ env.HF_HOME }}/**/*.pyc is needed since only the /hub folder is cached (which contains models/datasets). Doesn't hurt to have it but wonder if it's necesary

Not sure, in case the hub repo also hosts code, won't the python file be compiled in the cache, resulting in a .pyc file?

  • I don't know how github cache works with symlinks. I suspect that it might not work. Since we are heavily relying on symlinks in the cache, it might lead to suboptimal cases:

    • typically we have in cache model A on revision 123. If the model card gets updated on the Hub, revision is updated to 456. But the model weights did not change. On a machine with symlinks, we would not need to redownload the weights. But if symlinking doesn't work, then we have a degraded cache experience (typically on windows without dev mode 👀 ). So in this case with github cache we might 1. load the weights from the GH cache and then 2. load the weights from HF,

    • even worse, next time we load the cache, the weights file will be duplicated between revision 123 and revision 456 even though revision 123 will never be used again. As a workaround you could hardcode revisions in the tests.

I think that symlinks do work since the cache directory is simply packed and unpacked with tar. I don't know how it deals with symlinks under Windows, though, that might be a problem. But then again, it would happen anyway if I understood you correctly, right?

Agreed, great pointers. Thank you! I will look into a cache cleaning utility.

@Wauplin
Copy link
Contributor

Wauplin commented Jan 20, 2025

Not sure, in case the hub repo also hosts code, won't the python file be compiled in the cache, resulting in a .pyc file?

I'm not exactly sure from where transformers imports the modules hosted on the Hub TBH. I thought that it was copied to a temporary directory (after being downloaded) but honestly not a part that I know much about.

we're not optimizing for speed but against test flakiness due to failed downloads.

Oh ok. Didn't know that. I assumed speed was the goal here.

we're not optimizing for speed but against test flakiness due to failed downloads.

Even on Windows it will work yes. Simply less optimized so might redownload some stuff from time to time.

. Regarding duplication and clean up: As a practical albeit not very efficient solution, could we just purge the whole cache periodically?

Would also be a simpler solution rather than having a complex script 😄

nemo added 5 commits January 20, 2025 15:12
This script finds hub cache entries that have a last access time older than 30 days
and will delete them from the cache. This way, if a model is not used by the CI
anymore it does not eat up disk space needlessly.
It's not critical to break the CI
@githubnemo
Copy link
Collaborator Author

I've attempted to write a script for cleaning the cache. If it is too complex I cannot judge alone but I think it is fairly straight-forward at least. I would be OK to go with removing the cache periodically as well, though. This might even be the case already since we only have 10GB cache space and the python package cache is OS and python version dependent, so we're regularly being cleaned up by GitHub :)

Let me know what you think.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my POV, this looks good. We can try it and if it doesn't help with the goal of making CI more stable, we can revert it.

Comment on lines +24 to +27
revisions = [(i.revisions, i.last_accessed) for i in scan_results.repos]
revisions_ages = [(rev, (now - dt.fromtimestamp(ts_access)).days) for rev, ts_access in revisions]
delete_candidates = [rev for rev, age in revisions_ages if age > max_age_days]
hashes = [n.commit_hash for rev in delete_candidates for n in rev]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if it wouldn't be easier to write this as a for-loop instead of having 4 list comprehensions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted for the future :)

@ivarflakstad
Copy link
Member

Since I was mentioned I just wanted to say that this implementation is better than the little script I had ☺️
Carry on hehe

@githubnemo githubnemo merged commit 9c11a3e into huggingface:main Jan 23, 2025
15 checks passed
Guy-Bilitski pushed a commit to Guy-Bilitski/peft that referenced this pull request May 13, 2025
This change introduces CI caching for datasets and hub artifacts across runner operating systems with the intended goal to minimize the number of failed test runs because of network faults. As an additional bonus it might make the CI a bit faster.

The following artifacts are cached: ${HF_HOME}/hub/**

Note that we're avoiding .lock files as well as *.pyc files. We're not simply caching $HF_HOME since there is also the datasets and modules where the former was acting up when testing (no details, just dropped, we may explore this later but we're not using that many datasets) and the latter is just code which is probably not a good idea to cache anyway.

There is a post process for the cache action which uploads new data to the cache - only one runner can access the cache for uploading. This is done because github actions is locking cache creation, so if there's a concurrent cache creation, both may fail. This runner is currently set to ubuntu in the python 3.10 run.

If this modification turns out to be ineffective we can move to forbidding access to the hub in general (HF_HUB_OFFLINE=1) and updating the cache once per day but let's first try out if this is already enough to decrease the fail rate.
cyyever pushed a commit to cyyever/peft that referenced this pull request Sep 4, 2025
* update issue template

* Add checklist for bug report template

* Fix formatting in bug report template

* Update bug report template with additional instructions for code formatting and screenshots

* Update bug report template with code formatting instructions

* Update bug report template with code examples

* Update code block placeholder in bug report template

* Update .github/ISSUE_TEMPLATE/bug-report.yml

---------

Co-authored-by: Kashif Rasul <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants