Skip to content

Broken Huggingface Datasets integration #10700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
maxstrobel opened this issue Mar 5, 2025 · 2 comments
Open

Broken Huggingface Datasets integration #10700

maxstrobel opened this issue Mar 5, 2025 · 2 comments
Labels
bug Did we break something?

Comments

@maxstrobel
Copy link

Bug Report

Description

The DVC integration seems to be broken.
Followed this guide: https://dvc.org/doc/user-guide/integrations/huggingface

Reproduce

from datasets import load_dataset

dataset = load_dataset(
    "csv",
    data_files="dvc://workshop/satellite-data/jan_train.csv",
    storage_options={"url": "https://github.com/iterative/dataset-registry.git"},
)

print(dataset)
Traceback (most recent call last):
  File "C:\tmp\test\load.py", line 3, in <module>
    dataset = load_dataset(
              ^^^^^^^^^^^^^
  File "C:\tmp\test\.venv\Lib\site-packages\datasets\load.py", line 2151, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\tmp\test\.venv\Lib\site-packages\datasets\builder.py", line 808, in download_and_prepare
    fs, output_dir = url_to_fs(output_dir, **(storage_options or {}))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: url_to_fs() got multiple values for argument 'url'

Expected

Integration would work and the indicated file is downloaded and opened.

Environment information

Python version

python --version
Python 3.11.10

Venv (pip install datasets dvc):

Package                Version
---------------------- -----------
aiohappyeyeballs       2.4.6
aiohttp                3.11.13
aiohttp-retry          2.9.1
aiosignal              1.3.2
amqp                   5.3.1
annotated-types        0.7.0
antlr4-python3-runtime 4.9.3
appdirs                1.4.4
asyncssh               2.20.0
atpublic               5.1
attrs                  25.1.0
billiard               4.2.1
celery                 5.4.0
certifi                2025.1.31
cffi                   1.17.1
charset-normalizer     3.4.1
click                  8.1.8
click-didyoumean       0.3.1
click-plugins          1.1.1
click-repl             0.3.0
colorama               0.4.6
configobj              5.0.9
cryptography           44.0.1
datasets               3.3.2
dictdiffer             0.9.0
dill                   0.3.8
diskcache              5.6.3
distro                 1.9.0
dpath                  2.2.0
dulwich                0.22.7
dvc                    3.59.1
dvc-data               3.16.9
dvc-http               2.32.0
dvc-objects            5.1.0
dvc-render             1.0.2
dvc-studio-client      0.21.0
dvc-task               0.40.2
entrypoints            0.4
filelock               3.17.0
flatten-dict           0.4.2
flufl-lock             8.1.0
frozenlist             1.5.0
fsspec                 2024.12.0
funcy                  2.0
gitdb                  4.0.12
gitpython              3.1.44
grandalf               0.8
gto                    1.7.2
huggingface-hub        0.29.1
hydra-core             1.3.2
idna                   3.10
iterative-telemetry    0.0.10
kombu                  5.4.2
markdown-it-py         3.0.0
mdurl                  0.1.2
multidict              6.1.0
multiprocess           0.70.16
networkx               3.4.2
numpy                  2.2.3
omegaconf              2.3.0
orjson                 3.10.15
packaging              24.2
pandas                 2.2.3
pathspec               0.12.1
platformdirs           4.3.6
prompt-toolkit         3.0.50
propcache              0.3.0
psutil                 7.0.0
pyarrow                19.0.1
pycparser              2.22
pydantic               2.10.6
pydantic-core          2.27.2
pydot                  3.0.4
pygit2                 1.17.0
pygments               2.19.1
pygtrie                2.5.0
pyparsing              3.2.1
python-dateutil        2.9.0.post0
pytz                   2025.1
pywin32                308
pyyaml                 6.0.2
requests               2.32.3
rich                   13.9.4
ruamel-yaml            0.18.10
ruamel-yaml-clib       0.2.12
scmrepo                3.3.10
semver                 3.0.4
setuptools             75.8.0
shellingham            1.5.4
shortuuid              1.0.13
shtab                  1.7.1
six                    1.17.0
smmap                  5.0.2
sqltrie                0.11.2
tabulate               0.9.0
tomlkit                0.13.2
tqdm                   4.67.1
typer                  0.15.1
typing-extensions      4.12.2
tzdata                 2025.1
urllib3                2.3.0
vine                   5.1.0
voluptuous             0.15.2
wcwidth                0.2.13
xxhash                 3.5.0
yarl                   1.18.3
zc-lockfile            3.0.post1

Additional Information (if any):

Unfortunately url is a reserved argument in fsspec.url_to_fs, so ideally file system implementations like DVC should use another argument name to avoid this kind of errors

@shcheklein
Copy link
Member

Should be fixed when fsspec/filesystem_spec#1802 lands

@maxstrobel
Copy link
Author

@shcheklein Great, thanks for the support! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something?
Projects
None yet
2 participants