Skip to content

[Errno 13] Permission denied: on .incomplete file #7536

@ryan-clancy

Description

@ryan-clancy

Describe the bug

When downloading a dataset, we frequently hit the below Permission Denied error. This looks to happen (at least) across datasets in HF, S3, and GCS.

It looks like the temp_file being passed here can sometimes be created with 000 permissions leading to the permission denied error (the user running the code is still the owner of the file). Deleting that particular file and re-running the code with 0 changes will usually succeed.

Is there some race condition happening with the umask, which is process global, and the file creation?

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.12/site-packages/datasets/load.py:2084: in load_dataset
    builder_instance.download_and_prepare(
.venv/lib/python3.12/site-packages/datasets/builder.py:925: in download_and_prepare
    self._download_and_prepare(
.venv/lib/python3.12/site-packages/datasets/builder.py:1649: in _download_and_prepare
    super()._download_and_prepare(
.venv/lib/python3.12/site-packages/datasets/builder.py:979: in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
.venv/lib/python3.12/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py:120: in _split_generators
    downloaded_files = dl_manager.download(files)
.venv/lib/python3.12/site-packages/datasets/download/download_manager.py:159: in download
    downloaded_path_or_paths = map_nested(
.venv/lib/python3.12/site-packages/datasets/utils/py_utils.py:514: in map_nested
    _single_map_nested((function, obj, batched, batch_size, types, None, True, None))
.venv/lib/python3.12/site-packages/datasets/utils/py_utils.py:382: in _single_map_nested
    return [mapped_item for batch in iter_batched(data_struct, batch_size) for mapped_item in function(batch)]
.venv/lib/python3.12/site-packages/datasets/download/download_manager.py:206: in _download_batched
    return thread_map(
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:69: in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
.venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py:51: in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
.venv/lib/python3.12/site-packages/tqdm/std.py:1181: in __iter__
    for obj in iterable:
../../../_tool/Python/3.12.10/x64/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
../../../_tool/Python/3.12.10/x64/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
../../../_tool/Python/3.12.10/x64/lib/python3.12/concurrent/futures/_base.py:449: in result
    return self.__get_result()
../../../_tool/Python/3.12.10/x64/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
../../../_tool/Python/3.12.10/x64/lib/python3.12/concurrent/futures/thread.py:59: in run
    result = self.fn(*self.args, **self.kwargs)
.venv/lib/python3.12/site-packages/datasets/download/download_manager.py:229: in _download_single
    out = cached_path(url_or_filename, download_config=download_config)
.venv/lib/python3.12/site-packages/datasets/utils/file_utils.py:206: in cached_path
    output_path = get_from_cache(
.venv/lib/python3.12/site-packages/datasets/utils/file_utils.py:412: in get_from_cache
    fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm)
.venv/lib/python3.12/site-packages/datasets/utils/file_utils.py:331: in fsspec_get
    fs.get_file(path, temp_file.name, callback=callback)
.venv/lib/python3.12/site-packages/fsspec/asyn.py:118: in wrapper
    return sync(self.loop, func, *args, **kwargs)
.venv/lib/python3.12/site-packages/fsspec/asyn.py:103: in sync
    raise return_result
.venv/lib/python3.12/site-packages/fsspec/asyn.py:56: in _runner
    result[0] = await coro
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <s3fs.core.S3FileSystem object at 0x7f27c18b2e70>
rpath = '<my-bucket>/<my-prefix>/img_1.jpg'
lpath = '/home/runner/_work/_temp/hf_cache/downloads/6c97983efa4e24e534557724655df8247a0bd04326cdfc4a95b638c11e78222d.incomplete'
callback = <datasets.utils.file_utils.TqdmCallback object at 0x7f27c00cdbe0>
version_id = None, kwargs = {}
_open_file = <function S3FileSystem._get_file.<locals>._open_file at 0x7f27628d1120>
body = <StreamingBody at 0x7f276344fa80 for ClientResponse at 0x7f27c015fce0>
content_length = 521923, failed_reads = 0, bytes_read = 0

    async def _get_file(
        self, rpath, lpath, callback=_DEFAULT_CALLBACK, version_id=None, **kwargs
    ):
        if os.path.isdir(lpath):
            return
        bucket, key, vers = self.split_path(rpath)
    
        async def _open_file(range: int):
            kw = self.req_kw.copy()
            if range:
                kw["Range"] = f"bytes={range}-"
            resp = await self._call_s3(
                "get_object",
                Bucket=bucket,
                Key=key,
                **version_id_kw(version_id or vers),
                **kw,
            )
            return resp["Body"], resp.get("ContentLength", None)
    
        body, content_length = await _open_file(range=0)
        callback.set_size(content_length)
    
        failed_reads = 0
        bytes_read = 0
    
        try:
>           with open(lpath, "wb") as f0:
E           PermissionError: [Errno 13] Permission denied: '/home/runner/_work/_temp/hf_cache/downloads/6c97983efa4e24e534557724655df8247a0bd04326cdfc4a95b638c11e78222d.incomplete'

.venv/lib/python3.12/site-packages/s3fs/core.py:1355: PermissionError

Steps to reproduce the bug

I believe this is a race condition and cannot reliably re-produce it, but it happens fairly frequently in our GitHub Actions tests and can also be re-produced (with lesser frequency) on cloud VMs.

Expected behavior

The dataset loads properly with no permission denied error.

Environment info

  • datasets version: 3.5.0
  • Platform: Linux-5.10.0-34-cloud-amd64-x86_64-with-glibc2.31
  • Python version: 3.12.10
  • huggingface_hub version: 0.30.2
  • PyArrow version: 19.0.1
  • Pandas version: 2.2.3
  • fsspec version: 2024.12.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions