snapshot_download ignores 5xx errors during resume, returns unsafe cached data #3007

NearBirdEZ · 2025-04-15T10:54:22Z

Describe the bug

Description
When resuming downloads, snapshot_download handles errors
[requests.exceptions.ConnectionError, requests.exceptions.Timeout, huggingface_hub.errors.OfflineModeIsEnabled, requests.HTTPError]
(e.g., Hub downtime) by falling back to locally cached files without raising an error. This occurs because the code prioritizes cache availability over server communication status

Reproduction

import os
import random
from typing import Never, Type
from unittest.mock import patch

import huggingface_hub
import requests


class BadHfApi(huggingface_hub.HfApi):
    def repo_info(self, *args, **kwargs) -> Never:
        expected_errors: list[Type[Exception]] = [
            requests.exceptions.ConnectionError,
            requests.exceptions.Timeout,
            huggingface_hub.errors.OfflineModeIsEnabled,
            requests.HTTPError
        ]
        error_class: Type[Exception] = random.choice(expected_errors)
        print(f"Will be raise {error_class=}")
        raise error_class("test timeout from hf")


def main() -> None:
    print("Continuing the download")
    repo_id: str = "deepcogito/cogito-v1-preview-qwen-32B"
    revision: str = "118e6e43c46a51667f544dae9cbe4027f59eb697"
    local_path: str = f"models/deepcogito/cogito_v1_preview_qwen_32b/{revision}"
    print(f"Before download: {os.listdir(local_path)=}")
    with patch("huggingface_hub._snapshot_download.HfApi", new=BadHfApi):
        result: str = huggingface_hub.snapshot_download(
            repo_id=repo_id,
            local_dir=local_path,
            revision=revision
        )
    print(f"{result=}")
    print(f"After download attempt: {os.listdir(local_path)=}")



if __name__ == "__main__":
    main()

Logs

Continuing the download
Before download: os.listdir(local_path)=['added_tokens.json', 'model-00001-of-00014.safetensors', 'images', 'config.json', 'README.md', 'merges.txt', '.gitattributes', '.cache']
Will be raise error_class=<class 'requests.exceptions.Timeout'>
result='/Users/smith/work/hf_bug/models/deepcogito/cogito_v1_preview_qwen_32b/118e6e43c46a51667f544dae9cbe4027f59eb697'
After download attempt: os.listdir(local_path)=['added_tokens.json', 'model-00001-of-00014.safetensors', 'images', 'config.json', 'README.md', 'merges.txt', '.gitattributes', '.cache']
Returning existing local_dir `models/deepcogito/cogito_v1_preview_qwen_32b/118e6e43c46a51667f544dae9cbe4027f59eb697` as remote repo cannot be accessed in `snapshot_download` (test timeout from hf).

Process finished with exit code 0

System info

- huggingface_hub version: 0.30.2
- Platform: macOS-13.6.3-arm64-arm-64bit
- Python version: 3.12.0
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /Users/smith/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: osxkeychain
- FastAI: N/A
- Tensorflow: N/A
- Torch: N/A
- Jinja2: N/A
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: N/A
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: N/A
- pydantic: N/A
- aiohttp: N/A
- hf_xet: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /Users/smith/.cache/huggingface/hub
- HF_ASSETS_CACHE: /Users/smith/.cache/huggingface/assets
- HF_TOKEN_PATH: /Users/smith/.cache/huggingface/token
- HF_STORED_TOKENS_PATH: /Users/smith/.cache/huggingface/stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

The text was updated successfully, but these errors were encountered:

Wauplin · 2025-04-15T15:03:27Z

Hi @NearBirdEZ, can you elaborate on what's your expectation for this vs what you currently have? Falling back to existing cache instead of raising an error is actually a feature implemented on purpose, not a bug. If you've spotted a bug in a specific case, please let us know precisely.

NearBirdEZ · 2025-04-15T15:27:16Z

Hi, @Wauplin
Today, for example, the loading of the model was interrupted due to the 500 error HF
After that, I tried to restart, and instead of the expected model before loading, or the error that the server is unavailable, I got, in theory, the result that the entire model is available, here is the path.
It turns out that if the download is automated and without close attention, it may be that there will be an attempt to work with incomplete data

Wauplin · 2025-04-15T16:05:36Z

Yes indeed, you are correct. When implementing things, we took the decision that the benefit of returning a local cache outweighs the drawback of potentially returning a half-empty folder. It's not perfect but it's a compromise that we knowingly chose to take. So not a bug and not really something we want to iterate on at the moment. If we see strong interest on that topic in the future, we might re-evaluate.

NearBirdEZ added the bug Something isn't working label Apr 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snapshot_download ignores 5xx errors during resume, returns unsafe cached data #3007

snapshot_download ignores 5xx errors during resume, returns unsafe cached data #3007

NearBirdEZ commented Apr 15, 2025 •

edited

Loading

Wauplin commented Apr 15, 2025

NearBirdEZ commented Apr 15, 2025

Wauplin commented Apr 15, 2025

snapshot_download ignores 5xx errors during resume, returns unsafe cached data #3007

snapshot_download ignores 5xx errors during resume, returns unsafe cached data #3007

Comments

NearBirdEZ commented Apr 15, 2025 • edited Loading

Describe the bug

Reproduction

Logs

System info

Wauplin commented Apr 15, 2025

NearBirdEZ commented Apr 15, 2025

Wauplin commented Apr 15, 2025

NearBirdEZ commented Apr 15, 2025 •

edited

Loading