-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Avoid global umask for setting file mode. #7547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -411,11 +412,14 @@ def temp_file_manager(mode="w+b"): | |||
# GET file object | |||
fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm) | |||
|
|||
# get the permissions of the temp file | |||
temp_file_mode = stat.S_IMODE(os.stat(temp_file.name).st_mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When looking more at the history of this code, I am not sure the umask block below is needed at all anymore and could just be deleted.
datasets/src/datasets/utils/file_utils.py
Lines 582 to 626 in f5f6c7e
with FileLock(lock_path): | |
if resume_download: | |
incomplete_path = cache_path + ".incomplete" | |
@contextmanager | |
def _resumable_file_manager(): | |
with open(incomplete_path, "a+b") as f: | |
yield f | |
temp_file_manager = _resumable_file_manager | |
if os.path.exists(incomplete_path): | |
resume_size = os.stat(incomplete_path).st_size | |
else: | |
resume_size = 0 | |
else: | |
temp_file_manager = partial(tempfile.NamedTemporaryFile, dir=cache_dir, delete=False) | |
resume_size = 0 | |
# Download to temporary file, then copy to cache dir once finished. | |
# Otherwise you get corrupt cache entries if the download gets interrupted. | |
with temp_file_manager() as temp_file: | |
logger.info(f"{url} not found in cache or force_download set to True, downloading to {temp_file.name}") | |
# GET file object | |
if scheme == "ftp": | |
ftp_get(url, temp_file) | |
elif scheme not in ("http", "https"): | |
fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc) | |
else: | |
http_get( | |
url, | |
temp_file, | |
proxies=proxies, | |
resume_size=resume_size, | |
headers=headers, | |
cookies=cookies, | |
max_retries=max_retries, | |
desc=download_desc, | |
) | |
logger.info(f"storing {url} in cache at {cache_path}") | |
shutil.move(temp_file.name, cache_path) | |
umask = os.umask(0o666) | |
os.umask(umask) | |
os.chmod(cache_path, 0o666 & ~umask) |
NamedTemporaryFile
was used to create the files. NamedTemporaryFile
always set 600 permissions and didn't respect umask so umask code was added to fix the permissions.
d536e37 removed the NamedTemporaryFile
and moved to open()
, which respects the umask.
We should just be able to remove this temp_file_mode
+ updated chmod lines and then remove:
umask = os.umask(0o666)
os.umask(umask)
os.chmod(cache_path, 0o666 & ~umask)
Thoughts?
This PR updates the method for setting the permissions on
cache_path
after callingshutil.move
. The call toshutil.move
may not preserve permissions if the source and destination are on different filesystems. Reading and resetting umask can cause race conditions, so directly read what permissions were set for thetemp_file
instead.This fixes #7536.