Skip to content

push_to_hub is not concurrency safe (dataset schema corruption) #7600

Closed
@sharvil

Description

@sharvil

Describe the bug

Concurrent processes modifying and pushing a dataset can overwrite each others' dataset card, leaving the dataset unusable.

Consider this scenario:

  • we have an Arrow dataset
  • there are N configs of the dataset
  • there are N independent processes operating on each of the individual configs (e.g. adding a column, new_col)
  • each process calls push_to_hub on their particular config when they're done processing
  • all calls to push_to_hub succeed
  • the README.md now has some configs with new_col added and some with new_col missing

Any attempt to load a config (using load_dataset) where new_col is missing will fail because of a schema mismatch between README.md and the Arrow files. Fixing the dataset requires updating README.md by hand with the correct schema for the affected config. In effect, push_to_hub is doing a git push --force (I found this behavior quite surprising).

We have hit this issue every time we run processing jobs over our datasets and have to fix corrupted schemas by hand.

Reading through the code, it seems that specifying a parent_commit hash around here https://github.com/huggingface/datasets/blob/main/src/datasets/arrow_dataset.py#L5794 would get us to a normal, non-forced git push, and avoid schema corruption. I'm not familiar enough with the code to know how to determine the commit hash from which the in-memory dataset card was loaded.

Steps to reproduce the bug

See above.

Expected behavior

Concurrent edits to disjoint configs of a dataset should never corrupt the dataset schema.

Environment info

  • datasets version: 2.20.0
  • Platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35
  • Python version: 3.10.14
  • huggingface_hub version: 0.30.2
  • PyArrow version: 19.0.1
  • Pandas version: 2.2.2
  • fsspec version: 2023.9.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions