`push_to_hub` is not concurrency safe (dataset schema corruption)

### Describe the bug

Concurrent processes modifying and pushing a dataset can overwrite each others' dataset card, leaving the dataset unusable.

Consider this scenario:
- we have an Arrow dataset
- there are `N` configs of the dataset
- there are `N` independent processes operating on each of the individual configs (e.g. adding a column, `new_col`)
- each process calls `push_to_hub` on their particular config when they're done processing
- all calls to `push_to_hub` succeed
- the `README.md` now has some configs with `new_col` added and some with `new_col` missing

Any attempt to load a config (using `load_dataset`) where `new_col` is missing will fail because of a schema mismatch between `README.md` and the Arrow files. Fixing the dataset requires updating `README.md` by hand with the correct schema for the affected config. In effect, `push_to_hub` is doing a `git push --force` (I found this behavior quite surprising).

We have hit this issue every time we run processing jobs over our datasets and have to fix corrupted schemas by hand.

Reading through the code, it seems that specifying a [`parent_commit`](https://github.com/huggingface/huggingface_hub/blob/v0.32.4/src/huggingface_hub/hf_api.py#L4587) hash around here https://github.com/huggingface/datasets/blob/main/src/datasets/arrow_dataset.py#L5794 would get us to a normal, non-forced git push, and avoid schema corruption. I'm not familiar enough with the code to know how to determine the commit hash from which the in-memory dataset card was loaded.

### Steps to reproduce the bug

See above.

### Expected behavior

Concurrent edits to disjoint configs of a dataset should never corrupt the dataset schema.

### Environment info

- `datasets` version: 2.20.0
- Platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
- `huggingface_hub` version: 0.30.2
- PyArrow version: 19.0.1
- Pandas version: 2.2.2
- `fsspec` version: 2023.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`push_to_hub` is not concurrency safe (dataset schema corruption) #7600

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

push_to_hub is not concurrency safe (dataset schema corruption) #7600

Description

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`push_to_hub` is not concurrency safe (dataset schema corruption) #7600