more efficient caching #848

satra · 2021-12-07T14:00:34Z

the two imaging dandisets are large and will continuously run into caching efficiency. giacomo’s is only about 5TB but lee’s is around 120TB and growing. any kind of bids-related rewrite could thus involve significant checksum computation overhead that could take weeks. i would say it’s time to consider efficiency of both zarr versions and large files qua local checksum computation. i would say the overall problem is to ensure that a local directory can be checksummed efficiently.

one easy way is to maintain a table of mtime+size checksum alongside a dandi-etag in the cache. a rename or a move of a file doesn’t change this checksum and can be copied even across filesystems with both of those elements maintained. thus having a table that is simply an LRU type cache would allow for local movement instead of tying it to a path name.

jwodder · 2021-12-07T14:26:21Z

@satra So you're saying that if a file about to be digested has the same mtime and size as any previously digested file, it should be assumed to be the same file and the previous digest returned? That seems liable to frequently give wrong results.

satra · 2021-12-07T15:11:00Z

@jwodder - mtime + size by itself is insufficient as a checksum. however, in a specific setting of a dandiset, it can be used as a proxy. in the cache i would maintain path and allow the option of replacing the path with a new path. so if someone is doing a rename, this should work. for example say dandi mv path1 path2 would update the cache and not require re-checksumming.

yarikoptic · 2021-12-07T15:53:20Z

I really would like to avoid mtime+size alone as some kind of a "proxy" measure. IMHO it is more important to have it AFAIK correct (even if slow) than likely to be correct (but faster).

In principle dandi mv could be implemented but we would need to figure out how to RF fscacher we developed/used for the purpose of caching such compute results.
(note that also any dandi-cli, dandischema, or pynwb or other listed library update would also invalidate cache - may be for digesting we should relax that a bit)

A complimentary/alternative is for people to use git-annex (and/or datalad) to version their data (unless on windows or file system without symlinks support), and then fscacher AFAIK (since we have a test https://github.com/con/fscacher/blob/HEAD/src/fscacher/tests/test_cache.py#L146 and con/fscacher#44 open) would resolve the symlink to the actual file with content, thus avoiding redigesting. If it still does -- we should have that fixed I guess.

yarikoptic · 2021-12-07T15:58:24Z

also if we are talking about mv in the archive itself, may be such mv (or move?) should parallel delete which we already have: move them in the archive (may be with an option to move locally as well).

satra · 2021-12-07T16:57:54Z

I really would like to avoid mtime+size alone as some kind of a "proxy" measure. IMHO it is more important to have it AFAIK correct (even if slow) than likely to be correct (but faster).

slow is what we are trying to avoid, so we need the engineering to avoid slow.

(note that also any dandi-cli, dandischema, or pynwb or other listed library update would also invalidate cache - may be for digesting we should relax that a bit)

indeed i am not a fan of why digesting would be invalidated by those libraries. the intent of a digest is that it is not dependent on a library version.

git-annex / datalad can be an option when this becomes something most neuroscientists can do. at present we will have a lot of microscopy data coming in and most of the awardees have indicated that the terminal is not something they use daily. we have to adjust our efforts towards ease.

yarikoptic · 2021-12-07T17:50:18Z

the intent of a digest is that it is not dependent on a library version.

well, if library "fixes" digest algorithm, cache should be invalidated. But indeed, since we already well-tested this functionality, @jwodder -- may be instead of library versions, just add some explicit token value (e.g. 1) for "digest implementation", so we could still be able to invalidate prior cache by explicitly changing that token.

slow is what we are trying to avoid, so we need the engineering to avoid slow.

well, caching already allows to avoid it for many scenarios.

Besides explicit dandi mv I do not see some magical way to speed it up more (without having some underlying mechanism -- from filesystem or git-annex to provide reliable "content identity" information)

As for support for some move in the fscacher, let's look into feasibility separately -- con/fscacher#57 .

Re-implementing support for "more efficient caching" straight in dandi-cli, I guess could be done, but at large would probably end up being just an "overfit fork" of fscacher and/or joblib (which fscacher relies upon) implementation... might need to be done, but I think we should first research more into possible solutions/approaches.

yarikoptic · 2021-12-15T23:08:05Z

the intent of a digest is that it is not dependent on a library version.

well, if library "fixes" digest algorithm, cache should be invalidated. But indeed, since we already well-tested this functionality, @jwodder -- may be instead of library versions, just add some explicit token value (e.g. 1) for "digest implementation", so we could still be able to invalidate prior cache by explicitly changing that token.

FWIW: I checked the code. Apparently we do not add any token ATM, so no upgrades should invalidate that cache (may be only if joblib does some invalidation upon its upgrade, didn't check): https://github.com/dandi/dandi-cli/blob/master/dandi/support/digests.py#L76 . So nothing todo on this regard.

yarikoptic mentioned this issue Dec 7, 2021

memoized_path_copy helper to complement @memoize_path con/fscacher#57

Open

yarikoptic mentioned this issue Feb 16, 2022

Next design, directories oriented?/localized, is needed con/fscacher#69

Open

yarikoptic mentioned this issue May 11, 2022

dandi move command #1006

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more efficient caching #848

more efficient caching #848

satra commented Dec 7, 2021

jwodder commented Dec 7, 2021

satra commented Dec 7, 2021

yarikoptic commented Dec 7, 2021

yarikoptic commented Dec 7, 2021

satra commented Dec 7, 2021

yarikoptic commented Dec 7, 2021

yarikoptic commented Dec 15, 2021

more efficient caching #848

more efficient caching #848

Comments

satra commented Dec 7, 2021

jwodder commented Dec 7, 2021

satra commented Dec 7, 2021

yarikoptic commented Dec 7, 2021

yarikoptic commented Dec 7, 2021

satra commented Dec 7, 2021

yarikoptic commented Dec 7, 2021

yarikoptic commented Dec 15, 2021