-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rmtree for nfs #2434
base: main
Are you sure you want to change the base?
Rmtree for nfs #2434
Conversation
Can one of the admins verify this patch? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2434 +/- ##
==========================================
- Coverage 98.38% 98.16% -0.23%
==========================================
Files 85 85
Lines 3477 3495 +18
==========================================
+ Hits 3421 3431 +10
- Misses 56 64 +8 ☔ View full report in Codecov by Sentry. |
@zulissimeta: Happy to accept something along these lines. First, a quick question though: is it unavoidable that the .nfs files will be present? For instance, this comment suggests that I/O on the node could cause this (e.g. from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor things (in addition to my question above).
def rmtree(*args, max_retries=3, retry_wait=10, **kwargs): | ||
""" | ||
rmtree that will retry if we get common NFS errors (files still being deleted, etc) | ||
Adapted from https://github.com/teojgo/reframe/blob/master/reframe/utility/osext.py | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can have some docstrings and type hints here, that would be greatly appreciated. For instance, it's not immediately clear what **kwargs
is until you read the source code.
if i == max_retries: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we raise a specific exception and have a useful message?
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with the exception to raise.
@@ -174,3 +177,25 @@ def terminate(tmpdir: Path | str, exception: Exception) -> None: | |||
symlink_path.symlink_to(job_failed_dir, target_is_directory=True) | |||
|
|||
raise JobFailure(job_failed_dir, message=msg, parent_error=exception) from exception | |||
|
|||
|
|||
def rmtree(*args, max_retries=3, retry_wait=10, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're good with a 10 s default? Not too long for you?
I haven't been able to reproduce it consistently. Is the logger writing local files? |
I think by default it writes to |
@zulissimeta: Out of curiosity, are you running ASE relaxations? I think there might be a problem with the trajectory not closing properly. I have vague recollections of this... |
Summary of Changes
Some filesystems like NFS don't delete files right away, especially if another node might be holding access to them. This manifests as files like
.nfs....
in the directory being deleted.rmtree
will complain after deleting files and trying to delete the folder.This PR adapts a utility from another open sources (BSD-3-clause) license repo that deals with this problem for clusters with NFS. Basically, if we hit one of the common errors (resource or device is busy), we wait a few seconds and try again.
I don't know how to reproduce this test without an NFS filesystem, but the existing tests should cover this working as expected for file deletion.
Requirements
main
).Note: If you are an external contributor, you will see a comment from @buildbot-princeton. This is solely for the maintainers.