Attempt at adding a cache for models #2327

githubnemo · 2025-01-14T11:21:14Z

This PR introduces CI caching for datasets and hub artifacts across runner operating systems with the intended goal to minimize the number of failed test runs because of network faults. As an additional bonus it might make the CI a bit faster.

The following artifacts are cached: ${HF_HOME}/hub/**

Note that we're avoiding .lock files as well as *.pyc files. We're not simply caching $HF_HOME since there is also the datasets and modules where the former was acting up when testing (no details, just dropped, we may explore this later but we're not using that many datasets) and the latter is just code which is probably not a good idea to cache anyway.

There is a post process for the cache action which uploads new data to the cache - only one runner can access the cache for uploading. This is done because github actions is locking cache creation, so if there's a concurrent cache creation, both may fail. This runner is currently set to ubuntu in the python 3.10 run.

If this modification turns out to be ineffective we can move to forbidding access to the hub in general (HF_HUB_OFFLINE=1) and updating the cache once per day but let's first try out if this is already enough to decrease the fail rate.

HuggingFaceDocBuilderDev · 2025-01-14T11:31:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Especially avoid lock files, code and pyc files

Remove `-` in front of lookup key to avoid weird numerical sort by accident (just a precaution)

This makes sure that we're always saving the newest models into the cache even if we had a cache hit (we might still have updated models)

githubnemo · 2025-01-16T12:02:20Z

The failing zizmor run is fixed here: #2331

githubnemo · 2025-01-16T15:45:59Z

@Wauplin do you think this is an approach that will work or did I overlook something?

Wauplin · 2025-01-16T17:50:14Z

The approach looks reasonable to me, though I never really tried it myself. I few things that comes to mind though:

it might take longer to load the cache on the github runner that downloading weights from the Hub. So as always with this kind of stuff, better to test
I'm not sure why !${{ env.HF_HOME }}/**/*.pyc is needed since only the /hub folder is cached (which contains models/datasets). Doesn't hurt to have it but wonder if it's necesary
I don't know how github cache works with symlinks. I suspect that it might not work. Since we are heavily relying on symlinks in the cache, it might lead to suboptimal cases:
- typically we have in cache model A on revision 123. If the model card gets updated on the Hub, revision is updated to 456. But the model weights did not change. On a machine with symlinks, we would not need to redownload the weights. But if symlinking doesn't work, then we have a degraded cache experience (typically on windows without dev mode 👀 ). So in this case with github cache we might 1. load the weights from the GH cache and then 2. load the weights from HF,
  - even worse, next time we load the cache, the weights file will be duplicated between revision 123 and revision 456 even though revision 123 will never be used again. As a workaround you could hardcode revisions in the tests.
no matter the cache size, I think there should be a cleaning process at the end of the CI. If a file is in the cache and has not been accessed once during the tests, then it should be removed so that it's not reloaded at the next CI run. This might be tricky to implement (check out https://huggingface.co/docs/huggingface_hub/guides/manage-cache#scan-cache-from-python and https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-cache-from-python on how to scan/clean the cache).

Note: I know @ivarflakstad is working on something similar (see private slack thread). You should both get in touch if you start building such a tool. It might even become a Github action on its own.

BenjaminBossan · 2025-01-17T10:13:52Z

Thanks for the detailed response. Regarding duplication and clean up: As a practical albeit not very efficient solution, could we just purge the whole cache periodically?

githubnemo · 2025-01-17T17:08:22Z

Thanks a lot :)

it might take longer to load the cache on the github runner that downloading weights from the Hub. So as always with this kind of stuff, better to test

Yep. I think it's OK though, since we're not optimizing for speed but against test flakiness due to failed downloads. However in my testing I didn't see a big difference (might change over time, of course).

I'm not sure why !${{ env.HF_HOME }}/**/*.pyc is needed since only the /hub folder is cached (which contains models/datasets). Doesn't hurt to have it but wonder if it's necesary

Not sure, in case the hub repo also hosts code, won't the python file be compiled in the cache, resulting in a .pyc file?

I don't know how github cache works with symlinks. I suspect that it might not work. Since we are heavily relying on symlinks in the cache, it might lead to suboptimal cases:

typically we have in cache model A on revision 123. If the model card gets updated on the Hub, revision is updated to 456. But the model weights did not change. On a machine with symlinks, we would not need to redownload the weights. But if symlinking doesn't work, then we have a degraded cache experience (typically on windows without dev mode 👀 ). So in this case with github cache we might 1. load the weights from the GH cache and then 2. load the weights from HF,

even worse, next time we load the cache, the weights file will be duplicated between revision 123 and revision 456 even though revision 123 will never be used again. As a workaround you could hardcode revisions in the tests.

I think that symlinks do work since the cache directory is simply packed and unpacked with tar. I don't know how it deals with symlinks under Windows, though, that might be a problem. But then again, it would happen anyway if I understood you correctly, right?

no matter the cache size, I think there should be a cleaning process at the end of the CI. If a file is in the cache and has not been accessed once during the tests, then it should be removed so that it's not reloaded at the next CI run. This might be tricky to implement (check out https://huggingface.co/docs/huggingface_hub/guides/manage-cache#scan-cache-from-python and https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-cache-from-python on how to scan/clean the cache).

Agreed, great pointers. Thank you! I will look into a cache cleaning utility.

Wauplin · 2025-01-20T10:26:50Z

Not sure, in case the hub repo also hosts code, won't the python file be compiled in the cache, resulting in a .pyc file?

I'm not exactly sure from where transformers imports the modules hosted on the Hub TBH. I thought that it was copied to a temporary directory (after being downloaded) but honestly not a part that I know much about.

we're not optimizing for speed but against test flakiness due to failed downloads.

Oh ok. Didn't know that. I assumed speed was the goal here.

we're not optimizing for speed but against test flakiness due to failed downloads.

Even on Windows it will work yes. Simply less optimized so might redownload some stuff from time to time.

. Regarding duplication and clean up: As a practical albeit not very efficient solution, could we just purge the whole cache periodically?

Would also be a simpler solution rather than having a complex script 😄

This script finds hub cache entries that have a last access time older than 30 days and will delete them from the cache. This way, if a model is not used by the CI anymore it does not eat up disk space needlessly.

It's not critical to break the CI

githubnemo · 2025-01-21T16:16:38Z

I've attempted to write a script for cleaning the cache. If it is too complex I cannot judge alone but I think it is fairly straight-forward at least. I would be OK to go with removing the cache periodically as well, though. This might even be the case already since we only have 10GB cache space and the python package cache is OS and python version dependent, so we're regularly being cleaned up by GitHub :)

Let me know what you think.

BenjaminBossan

From my POV, this looks good. We can try it and if it doesn't help with the goal of making CI more stable, we can revert it.

BenjaminBossan · 2025-01-22T09:42:53Z

scripts/ci_clean_cache.py

+    revisions = [(i.revisions, i.last_accessed) for i in scan_results.repos]
+    revisions_ages = [(rev, (now - dt.fromtimestamp(ts_access)).days) for rev, ts_access in revisions]
+    delete_candidates = [rev for rev, age in revisions_ages if age > max_age_days]
+    hashes = [n.commit_hash for rev in delete_candidates for n in rev]


Just wondering if it wouldn't be easier to write this as a for-loop instead of having 4 list comprehensions.

Noted for the future :)

ivarflakstad · 2025-01-22T15:52:50Z

Since I was mentioned I just wanted to say that this implementation is better than the little script I had ☺️
Carry on hehe

This change introduces CI caching for datasets and hub artifacts across runner operating systems with the intended goal to minimize the number of failed test runs because of network faults. As an additional bonus it might make the CI a bit faster. The following artifacts are cached: ${HF_HOME}/hub/** Note that we're avoiding .lock files as well as *.pyc files. We're not simply caching $HF_HOME since there is also the datasets and modules where the former was acting up when testing (no details, just dropped, we may explore this later but we're not using that many datasets) and the latter is just code which is probably not a good idea to cache anyway. There is a post process for the cache action which uploads new data to the cache - only one runner can access the cache for uploading. This is done because github actions is locking cache creation, so if there's a concurrent cache creation, both may fail. This runner is currently set to ubuntu in the python 3.10 run. If this modification turns out to be ineffective we can move to forbidding access to the hub in general (HF_HUB_OFFLINE=1) and updating the cache once per day but let's first try out if this is already enough to decrease the fail rate.

* update issue template * Add checklist for bug report template * Fix formatting in bug report template * Update bug report template with additional instructions for code formatting and screenshots * Update bug report template with code formatting instructions * Update bug report template with code examples * Update code block placeholder in bug report template * Update .github/ISSUE_TEMPLATE/bug-report.yml --------- Co-authored-by: Kashif Rasul <[email protected]>

githubnemo force-pushed the feature/ci-cache branch from 1a2fc18 to 644310e Compare January 14, 2025 11:23

First attempt at adding a cache for models

79713b1

githubnemo force-pushed the feature/ci-cache branch from 644310e to 79713b1 Compare January 14, 2025 11:28

nemo added 9 commits January 14, 2025 14:26

Use path list to better specify what's cached

6bf426f

Especially avoid lock files, code and pyc files

Enable cross OS archiving

c0e033c

Remove `-` in front of lookup key to avoid weird numerical sort by accident (just a precaution)

Limit cache to models

5b110e5

Split caching in restore & save

f56c59d

This makes sure that we're always saving the newest models into the cache even if we had a cache hit (we might still have updated models)

Fix OS name in condition

89643e3

Let the first ubuntu win

5035462

Make one specific runner cache lead

246531f

Cache inspection

2326581

Handle diff on MacOS runners

534fe01

githubnemo force-pushed the feature/ci-cache branch from cd903fa to 534fe01 Compare January 15, 2025 16:13

Make diff not fail test run

0cfdb03

githubnemo marked this pull request as ready for review January 15, 2025 17:32

githubnemo changed the title ~~Draft: Attempt at adding a cache for models~~ Attempt at adding a cache for models Jan 15, 2025

githubnemo requested a review from BenjaminBossan January 16, 2025 14:09

nemo added 5 commits January 20, 2025 15:12

Add CI script for deleting old cache entries

37edded

This script finds hub cache entries that have a last access time older than 30 days and will delete them from the cache. This way, if a model is not used by the CI anymore it does not eat up disk space needlessly.

Make style

f0cabe0

Only dump cache if it exists

e350946

Ignore error of find

a597462

Ignore exit code of cache cleaner

0542f71

It's not critical to break the CI

BenjaminBossan approved these changes Jan 22, 2025

View reviewed changes

Merge branch 'main' into feature/ci-cache

1f2179b

githubnemo merged commit 9c11a3e into huggingface:main Jan 23, 2025
15 checks passed

Attempt at adding a cache for models #2327

Attempt at adding a cache for models #2327

Uh oh!

Conversation

githubnemo commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jan 14, 2025

Uh oh!

githubnemo commented Jan 16, 2025

Uh oh!

githubnemo commented Jan 16, 2025

Uh oh!

Wauplin commented Jan 16, 2025

Uh oh!

BenjaminBossan commented Jan 17, 2025

Uh oh!

githubnemo commented Jan 17, 2025

Uh oh!

Wauplin commented Jan 20, 2025

Uh oh!

githubnemo commented Jan 21, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

githubnemo Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

ivarflakstad commented Jan 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

githubnemo commented Jan 14, 2025 •

edited

Loading