-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more efficient caching #848
Comments
@satra So you're saying that if a file about to be digested has the same mtime and size as any previously digested file, it should be assumed to be the same file and the previous digest returned? That seems liable to frequently give wrong results. |
@jwodder - mtime + size by itself is insufficient as a checksum. however, in a specific setting of a dandiset, it can be used as a proxy. in the cache i would maintain path and allow the option of replacing the path with a new path. so if someone is doing a rename, this should work. for example say |
I really would like to avoid mtime+size alone as some kind of a "proxy" measure. IMHO it is more important to have it AFAIK correct (even if slow) than likely to be correct (but faster). In principle A complimentary/alternative is for people to use git-annex (and/or datalad) to version their data (unless on windows or file system without symlinks support), and then fscacher AFAIK (since we have a test https://github.com/con/fscacher/blob/HEAD/src/fscacher/tests/test_cache.py#L146 and con/fscacher#44 open) would resolve the symlink to the actual file with content, thus avoiding redigesting. If it still does -- we should have that fixed I guess. |
also if we are talking about |
slow is what we are trying to avoid, so we need the engineering to avoid slow.
indeed i am not a fan of why digesting would be invalidated by those libraries. the intent of a digest is that it is not dependent on a library version. git-annex / datalad can be an option when this becomes something most neuroscientists can do. at present we will have a lot of microscopy data coming in and most of the awardees have indicated that the terminal is not something they use daily. we have to adjust our efforts towards ease. |
well, if library "fixes" digest algorithm, cache should be invalidated. But indeed, since we already well-tested this functionality, @jwodder -- may be instead of library versions, just add some explicit token value (e.g.
well, caching already allows to avoid it for many scenarios. Besides explicit As for support for some Re-implementing support for "more efficient caching" straight in dandi-cli, I guess could be done, but at large would probably end up being just an "overfit fork" of fscacher and/or joblib (which fscacher relies upon) implementation... might need to be done, but I think we should first research more into possible solutions/approaches. |
FWIW: I checked the code. Apparently we do not add any token ATM, so no upgrades should invalidate that cache (may be only if joblib does some invalidation upon its upgrade, didn't check): https://github.com/dandi/dandi-cli/blob/master/dandi/support/digests.py#L76 . So nothing todo on this regard. |
the two imaging dandisets are large and will continuously run into caching efficiency. giacomo’s is only about 5TB but lee’s is around 120TB and growing. any kind of bids-related rewrite could thus involve significant checksum computation overhead that could take weeks. i would say it’s time to consider efficiency of both zarr versions and large files qua local checksum computation. i would say the overall problem is to ensure that a local directory can be checksummed efficiently.
one easy way is to maintain a table of mtime+size checksum alongside a dandi-etag in the cache. a rename or a move of a file doesn’t change this checksum and can be copied even across filesystems with both of those elements maintained. thus having a table that is simply an LRU type cache would allow for local movement instead of tying it to a path name.
The text was updated successfully, but these errors were encountered: