Description
Describe the bug
Concurrent processes modifying and pushing a dataset can overwrite each others' dataset card, leaving the dataset unusable.
Consider this scenario:
- we have an Arrow dataset
- there are
N
configs of the dataset - there are
N
independent processes operating on each of the individual configs (e.g. adding a column,new_col
) - each process calls
push_to_hub
on their particular config when they're done processing - all calls to
push_to_hub
succeed - the
README.md
now has some configs withnew_col
added and some withnew_col
missing
Any attempt to load a config (using load_dataset
) where new_col
is missing will fail because of a schema mismatch between README.md
and the Arrow files. Fixing the dataset requires updating README.md
by hand with the correct schema for the affected config. In effect, push_to_hub
is doing a git push --force
(I found this behavior quite surprising).
We have hit this issue every time we run processing jobs over our datasets and have to fix corrupted schemas by hand.
Reading through the code, it seems that specifying a parent_commit
hash around here https://github.com/huggingface/datasets/blob/main/src/datasets/arrow_dataset.py#L5794 would get us to a normal, non-forced git push, and avoid schema corruption. I'm not familiar enough with the code to know how to determine the commit hash from which the in-memory dataset card was loaded.
Steps to reproduce the bug
See above.
Expected behavior
Concurrent edits to disjoint configs of a dataset should never corrupt the dataset schema.
Environment info
datasets
version: 2.20.0- Platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
huggingface_hub
version: 0.30.2- PyArrow version: 19.0.1
- Pandas version: 2.2.2
fsspec
version: 2023.9.0