Skip to content

snapshot_download ignores 5xx errors during resume, returns unsafe cached data #3007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
NearBirdEZ opened this issue Apr 15, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@NearBirdEZ
Copy link

NearBirdEZ commented Apr 15, 2025

Describe the bug

Description
When resuming downloads, snapshot_download handles errors
[requests.exceptions.ConnectionError, requests.exceptions.Timeout, huggingface_hub.errors.OfflineModeIsEnabled, requests.HTTPError]
(e.g., Hub downtime) by falling back to locally cached files without raising an error. This occurs because the code prioritizes cache availability over server communication status

Reproduction

import os
import random
from typing import Never, Type
from unittest.mock import patch

import huggingface_hub
import requests


class BadHfApi(huggingface_hub.HfApi):
    def repo_info(self, *args, **kwargs) -> Never:
        expected_errors: list[Type[Exception]] = [
            requests.exceptions.ConnectionError,
            requests.exceptions.Timeout,
            huggingface_hub.errors.OfflineModeIsEnabled,
            requests.HTTPError
        ]
        error_class: Type[Exception] = random.choice(expected_errors)
        print(f"Will be raise {error_class=}")
        raise error_class("test timeout from hf")


def main() -> None:
    print("Continuing the download")
    repo_id: str = "deepcogito/cogito-v1-preview-qwen-32B"
    revision: str = "118e6e43c46a51667f544dae9cbe4027f59eb697"
    local_path: str = f"models/deepcogito/cogito_v1_preview_qwen_32b/{revision}"
    print(f"Before download: {os.listdir(local_path)=}")
    with patch("huggingface_hub._snapshot_download.HfApi", new=BadHfApi):
        result: str = huggingface_hub.snapshot_download(
            repo_id=repo_id,
            local_dir=local_path,
            revision=revision
        )
    print(f"{result=}")
    print(f"After download attempt: {os.listdir(local_path)=}")



if __name__ == "__main__":
    main()

Logs

Continuing the download
Before download: os.listdir(local_path)=['added_tokens.json', 'model-00001-of-00014.safetensors', 'images', 'config.json', 'README.md', 'merges.txt', '.gitattributes', '.cache']
Will be raise error_class=<class 'requests.exceptions.Timeout'>
result='/Users/smith/work/hf_bug/models/deepcogito/cogito_v1_preview_qwen_32b/118e6e43c46a51667f544dae9cbe4027f59eb697'
After download attempt: os.listdir(local_path)=['added_tokens.json', 'model-00001-of-00014.safetensors', 'images', 'config.json', 'README.md', 'merges.txt', '.gitattributes', '.cache']
Returning existing local_dir `models/deepcogito/cogito_v1_preview_qwen_32b/118e6e43c46a51667f544dae9cbe4027f59eb697` as remote repo cannot be accessed in `snapshot_download` (test timeout from hf).

Process finished with exit code 0

System info

- huggingface_hub version: 0.30.2
- Platform: macOS-13.6.3-arm64-arm-64bit
- Python version: 3.12.0
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /Users/smith/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: osxkeychain
- FastAI: N/A
- Tensorflow: N/A
- Torch: N/A
- Jinja2: N/A
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: N/A
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: N/A
- pydantic: N/A
- aiohttp: N/A
- hf_xet: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /Users/smith/.cache/huggingface/hub
- HF_ASSETS_CACHE: /Users/smith/.cache/huggingface/assets
- HF_TOKEN_PATH: /Users/smith/.cache/huggingface/token
- HF_STORED_TOKENS_PATH: /Users/smith/.cache/huggingface/stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
@NearBirdEZ NearBirdEZ added the bug Something isn't working label Apr 15, 2025
@Wauplin
Copy link
Contributor

Wauplin commented Apr 15, 2025

Hi @NearBirdEZ, can you elaborate on what's your expectation for this vs what you currently have? Falling back to existing cache instead of raising an error is actually a feature implemented on purpose, not a bug. If you've spotted a bug in a specific case, please let us know precisely.

@NearBirdEZ
Copy link
Author

Hi, @Wauplin
Today, for example, the loading of the model was interrupted due to the 500 error HF
After that, I tried to restart, and instead of the expected model before loading, or the error that the server is unavailable, I got, in theory, the result that the entire model is available, here is the path.
It turns out that if the download is automated and without close attention, it may be that there will be an attempt to work with incomplete data

@Wauplin
Copy link
Contributor

Wauplin commented Apr 15, 2025

Yes indeed, you are correct. When implementing things, we took the decision that the benefit of returning a local cache outweighs the drawback of potentially returning a half-empty folder. It's not perfect but it's a compromise that we knowingly chose to take. So not a bug and not really something we want to iterate on at the moment. If we see strong interest on that topic in the future, we might re-evaluate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants