Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap cache-file-management operations for CachingFileSystem in thread locks #1127

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jwodder
Copy link
Contributor

@jwodder jwodder commented Nov 29, 2022

Tries to address #1126.

existcond = all(os.path.exists(storage) for storage in self.storage)
if timecond or not existcond:
self.load_cache()
with self.lock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.load_cache has a lock and I think all the other points are either safe for parallel execution (nothing binding to self, and file presence check is ok to not be locked), so why to lock here?

Overall -- please analyze the code and see if may be we could avoid as much locking, and may be even avoid using reentering lock (RLock) by clearly identifying "problematic" code blocks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic I've removed this lock. I don't think any more can be removed, and RLock is needed due to things like clear_cache().

@yarikoptic
Copy link
Contributor

is this PR enough to revert datalad/datalad-fuse#83 and get dandi healthcheck work without errorring out? if so -- do we see any more sensible performance from healthcheck?

@yarikoptic
Copy link
Contributor

is this PR enough to revert datalad/datalad-fuse#83 and get dandi healthcheck work without errorring out? if so -- do we see any more sensible performance from healthcheck?

if this is a PR which is currently tested on drogon, it seems it did improve situation over when locked at datalad-fuse level. I see more of MATLAB processes getting at least somewhat busy, total load is about 20 not 60, datalad fusefs process is quite busy. py-spy top'ing it though shows

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                                                                                                           
 61.00%  75.00%    2.79s     3.24s   save_cache (fsspec/implementations/cached.py)
  9.00%   9.00%   0.290s    0.290s   read (ssl.py)
 14.00%  14.00%   0.250s    0.250s   exists (genericpath.py)
  4.00%   6.00%   0.210s    0.250s   atomic_write (fsspec/implementations/cached.py)
  7.00%   7.00%   0.200s    0.210s   _execute_child (subprocess.py)
  1.00%  13.00%   0.170s    0.680s   _read_ready__data_received (asyncio/selector_events.py)
  1.00%   1.00%   0.090s    0.090s   write (asyncio/selector_events.py)

and getting 100 samples of any /cache file being accessed:

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ for s in `seq 1 100`; do ls -l /proc/3986540/fd/ 2>&1 | grep 'fsspec/cache'; done
lr-x------ 1 dandi dandi 64 Dec  5 11:44 29 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lrwx------ 1 dandi dandi 64 Dec  5 11:44 73 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-htwu0fk9
lrwx------ 1 dandi dandi 64 Dec  5 11:44 73 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-htwu0fk9
lrwx------ 1 dandi dandi 64 Dec  5 11:44 73 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-htwu0fk9
lr-x------ 1 dandi dandi 64 Dec  5 11:44 59 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 15 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lrwx------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache-h3r32ds3
lr-x------ 1 dandi dandi 64 Dec  5 11:44 6 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
lr-x------ 1 dandi dandi 64 Dec  5 11:44 14 -> /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ ls -ld /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache
-rw------- 1 dandi dandi 92584 Dec  5 11:44 /mnt/backup/dandi/dandisets-healthstatus/dandisets/000003/.git/datalad/cache/fsspec/cache

shows that we are spending lots of time on writing/re-reading cache file. Could that be avoided may be?

@martindurant
Copy link
Member

The continuous writing of the cache file is designed for the situations that you might have simultaneous access to the cache from multiple processes. If you only ever have the one one FUSE process using the cache, I don't suppose there's any good reason to write the cache until program end (if you can guarantee not to crash).

@yarikoptic
Copy link
Contributor

Thanks @martindurant , so it feels like ideally it could be an option like cache_dump : ["always", "exit", "periodic"] where "always" is current implementation, "exit" where it would dump it only atexit (/ deconstructor call?), and "periodic" - have some thread which would look after the cache and if modified since last save -- save it (but then would need some extra setting for how often :-/ ). May be we could start with getting "always", "exit" implemented @jwodder ?

@martindurant
Copy link
Member

Exactly. Again, it would be nice to decompose the caching logic into

  • what is stored
  • where it is stored
  • how/often it is updated
  • how target files map to local files

@yarikoptic
Copy link
Contributor

@martindurant that sounds like a good amount of work to overhaul entire caching layer. Would you consider PR which would first just augment life cycle of the cache with "always" and "exit" strategies to when it is dumped?

@martindurant
Copy link
Member

I can understand hesitating to do the whole job, and the augmentation may be alright. What I worry about, it extra keyword arguments that we will find had to go back on later.

@yarikoptic
Copy link
Contributor

@martindurant before any other major RF -- what do you think about this particular PR? IMHO it is "the best we could do" ATM to ensure thread-safety, seems to work for us in that at least we do not crash (right @jwodder ?) and should not really have noticeable penalty on users (run time performance of which would be bottlenecked primarily by all those cache read/writes anyways)

@martindurant
Copy link
Member

I think it's OK... but still share the worry that the locking is too aggressive. Are the list operations really problematic, or just the file ops (which is what the original issue was about)?

@amotl
Copy link
Contributor

amotl commented Dec 25, 2022

Hi. I did not look too deep into the details of the struggles here, but I wanted to take the chance to humbly remind you about GH-895 by @gutzbenj. The patch probably needs a refresh, but the core idea was to start using DiskCache, as it promises to be a rock-solid (thread-safe and process-safe) solution to disk-based caching, without needing to run any kind of cache- or database-server. On this matter, also being tested on Linux, macOS and Windows, we think this package is one of its own kind.

@martindurant
Copy link
Member

You may be right, but DiskCache should only be optional, as fsspec is very careful not to require any dependencies.

Also, you should be aware that we wish to be able to cache files to remote stores too! It may seem counter intuitive, but some remotes are faster than others, and egress fees often mean that copying into near-remote once is worthwhile if there are to be many reads.

@amotl
Copy link
Contributor

amotl commented Dec 31, 2022

@martindurant: Thank you for sharing those important details we have not been aware of. Happy new year everyone!

@martindurant
Copy link
Member

So where shall we go from here? A full rewrite of the cache classes seems to beyond anyone's capacity right now :(

@yarikoptic
Copy link
Contributor

note that diskcache might be unfit for cases where location is on NFS. grantjenks/python-diskcache#198 was "resolved" but not clear to me if there is NFS-safe way to use diskcache.

I hope that @jwodder could have a look and may be see how to refactor/structure the interface to encapsulate caching so alternative implementations could be provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants