Unable to use 'convert_dataset.py' to load data #36

sandeep-krutrim · 2024-06-28T07:06:54Z

I am getting server disconnected error when I am using convert_dataset.py', even for bookcorpus or wikipedia dataset.
If I do, stream=False in the code, then i get the following error -

Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [03:06<00:00, 4.55s/files] Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6458670/6458670 [01:06<00:00, 96995.20 examples/s] Loading dataset shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:00<00:00, 1805.01it/s] Traceback (most recent call last): File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 524, in <module> main(parse_args()) File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 489, in main loader = build_dataloader(dataset=dataset, batch_size=512) File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 397, in build_dataloader num_workers = min(64, dataset.hf_dataset.n_shards) # type: ignore AttributeError: 'Dataset' object has no attribute 'n_shards'

Please help to resolve this as I am stucked on reproducing the training pipeline.

The text was updated successfully, but these errors were encountered:

DanFu09 · 2024-06-28T15:34:42Z

This seems like a change in HuggingFace API version. What version of HuggingFace transformers are you using?

…

On Fri, Jun 28, 2024 at 12:07 AM sandeep-krutrim ***@***.***> wrote: I am getting server disconnected error when I am using convert_dataset.py', even for bookcorpus or wikipedia dataset. If I do, stream=False in the code, then i get the following error - Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [03:06<00:00, 4.55s/files] Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6458670/6458670 [01:06<00:00, 96995.20 examples/s] Loading dataset shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:00<00:00, 1805.01it/s] Traceback (most recent call last): File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 524, in <module> main(parse_args()) File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 489, in main loader = build_dataloader(dataset=dataset, batch_size=512) File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 397, in build_dataloader num_workers = min(64, dataset.hf_dataset.n_shards) # type: ignore AttributeError: 'Dataset' object has no attribute 'n_shards' Please help to resolve this as I am stucked on reproducing the training pipeline. — Reply to this email directly, view it on GitHub <#36>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDDIITIPFQ6WC7YHIRQ5PDZJUDSFAVCNFSM6AAAAABKBJB3DWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3TSNZUHA4TKNQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use 'convert_dataset.py' to load data #36

Unable to use 'convert_dataset.py' to load data #36

sandeep-krutrim commented Jun 28, 2024

DanFu09 commented Jun 28, 2024 via email

Unable to use 'convert_dataset.py' to load data #36

Unable to use 'convert_dataset.py' to load data #36

Comments

sandeep-krutrim commented Jun 28, 2024

DanFu09 commented Jun 28, 2024 via email