Open
Description
Describe the bug
This code fails to load the dataset it just saved:
from datasets import load_dataset
from transformers import AutoTokenizer
MODEL = "google-bert/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
dataset = load_dataset("yelp_review_full")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.save_to_disk("dataset")
tokenized_datasets = load_dataset("dataset/") # raises
It raises ValueError: Couldn't infer the same data file format for all splits. Got {NamedSplit('train'): ('arrow', {}), NamedSplit('test'): ('json', {})}
.
I believe this bug is caused by the logic that tries to infer dataset format. It counts the most common file extension. However, a small dataset can fit in a single .arrow
file and have two JSON metadata files, causing the format to be inferred as JSON:
$ ls -l dataset/test
-rw-r--r-- 1 sliedes sliedes 191498784 Jul 1 13:55 data-00000-of-00001.arrow
-rw-r--r-- 1 sliedes sliedes 1730 Jul 1 13:55 dataset_info.json
-rw-r--r-- 1 sliedes sliedes 249 Jul 1 13:55 state.json
Steps to reproduce the bug
Execute the code above.
Expected behavior
The dataset is loaded successfully.
Environment info
datasets
version: 2.20.0- Platform: Linux-6.9.3-arch1-1-x86_64-with-glibc2.39
- Python version: 3.12.4
huggingface_hub
version: 0.23.4- PyArrow version: 16.1.0
- Pandas version: 2.2.2
fsspec
version: 2024.5.0
Metadata
Metadata
Assignees
Labels
No labels