-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Attempt at adding a cache for models #2327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1a2fc18
to
644310e
Compare
644310e
to
79713b1
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Especially avoid lock files, code and pyc files
Remove `-` in front of lookup key to avoid weird numerical sort by accident (just a precaution)
This makes sure that we're always saving the newest models into the cache even if we had a cache hit (we might still have updated models)
cd903fa
to
534fe01
Compare
The failing zizmor run is fixed here: #2331 |
@Wauplin do you think this is an approach that will work or did I overlook something? |
The approach looks reasonable to me, though I never really tried it myself. I few things that comes to mind though:
Note: I know @ivarflakstad is working on something similar (see private slack thread). You should both get in touch if you start building such a tool. It might even become a Github action on its own. |
Thanks for the detailed response. Regarding duplication and clean up: As a practical albeit not very efficient solution, could we just purge the whole cache periodically? |
Thanks a lot :)
Yep. I think it's OK though, since we're not optimizing for speed but against test flakiness due to failed downloads. However in my testing I didn't see a big difference (might change over time, of course).
Not sure, in case the hub repo also hosts code, won't the python file be compiled in the cache, resulting in a
I think that symlinks do work since the cache directory is simply packed and unpacked with
Agreed, great pointers. Thank you! I will look into a cache cleaning utility. |
I'm not exactly sure from where
Oh ok. Didn't know that. I assumed speed was the goal here.
Even on Windows it will work yes. Simply less optimized so might redownload some stuff from time to time.
Would also be a simpler solution rather than having a complex script 😄 |
This script finds hub cache entries that have a last access time older than 30 days and will delete them from the cache. This way, if a model is not used by the CI anymore it does not eat up disk space needlessly.
It's not critical to break the CI
I've attempted to write a script for cleaning the cache. If it is too complex I cannot judge alone but I think it is fairly straight-forward at least. I would be OK to go with removing the cache periodically as well, though. This might even be the case already since we only have 10GB cache space and the python package cache is OS and python version dependent, so we're regularly being cleaned up by GitHub :) Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my POV, this looks good. We can try it and if it doesn't help with the goal of making CI more stable, we can revert it.
revisions = [(i.revisions, i.last_accessed) for i in scan_results.repos] | ||
revisions_ages = [(rev, (now - dt.fromtimestamp(ts_access)).days) for rev, ts_access in revisions] | ||
delete_candidates = [rev for rev, age in revisions_ages if age > max_age_days] | ||
hashes = [n.commit_hash for rev in delete_candidates for n in rev] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if it wouldn't be easier to write this as a for-loop instead of having 4 list comprehensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted for the future :)
Since I was mentioned I just wanted to say that this implementation is better than the little script I had |
This change introduces CI caching for datasets and hub artifacts across runner operating systems with the intended goal to minimize the number of failed test runs because of network faults. As an additional bonus it might make the CI a bit faster. The following artifacts are cached: ${HF_HOME}/hub/** Note that we're avoiding .lock files as well as *.pyc files. We're not simply caching $HF_HOME since there is also the datasets and modules where the former was acting up when testing (no details, just dropped, we may explore this later but we're not using that many datasets) and the latter is just code which is probably not a good idea to cache anyway. There is a post process for the cache action which uploads new data to the cache - only one runner can access the cache for uploading. This is done because github actions is locking cache creation, so if there's a concurrent cache creation, both may fail. This runner is currently set to ubuntu in the python 3.10 run. If this modification turns out to be ineffective we can move to forbidding access to the hub in general (HF_HUB_OFFLINE=1) and updating the cache once per day but let's first try out if this is already enough to decrease the fail rate.
* update issue template * Add checklist for bug report template * Fix formatting in bug report template * Update bug report template with additional instructions for code formatting and screenshots * Update bug report template with code formatting instructions * Update bug report template with code examples * Update code block placeholder in bug report template * Update .github/ISSUE_TEMPLATE/bug-report.yml --------- Co-authored-by: Kashif Rasul <[email protected]>
This PR introduces CI caching for datasets and hub artifacts across runner operating systems with the intended goal to minimize the number of failed test runs because of network faults. As an additional bonus it might make the CI a bit faster.
The following artifacts are cached:
${HF_HOME}/hub/**
Note that we're avoiding
.lock
files as well as*.pyc
files. We're not simply caching$HF_HOME
since there is also the datasets andmodules
where the former was acting up when testing (no details, just dropped, we may explore this later but we're not using that many datasets) and the latter is just code which is probably not a good idea to cache anyway.There is a post process for the cache action which uploads new data to the cache - only one runner can access the cache for uploading. This is done because github actions is locking cache creation, so if there's a concurrent cache creation, both may fail. This runner is currently set to ubuntu in the python 3.10 run.
If this modification turns out to be ineffective we can move to forbidding access to the hub in general (
HF_HUB_OFFLINE=1
) and updating the cache once per day but let's first try out if this is already enough to decrease the fail rate.