You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am getting server disconnected error when I am using convert_dataset.py', even for bookcorpus or wikipedia dataset.
If I do, stream=False in the code, then i get the following error -
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [03:06<00:00, 4.55s/files] Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6458670/6458670 [01:06<00:00, 96995.20 examples/s] Loading dataset shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:00<00:00, 1805.01it/s] Traceback (most recent call last): File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 524, in <module> main(parse_args()) File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 489, in main loader = build_dataloader(dataset=dataset, batch_size=512) File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 397, in build_dataloader num_workers = min(64, dataset.hf_dataset.n_shards) # type: ignore AttributeError: 'Dataset' object has no attribute 'n_shards'
Please help to resolve this as I am stucked on reproducing the training pipeline.
The text was updated successfully, but these errors were encountered:
On Fri, Jun 28, 2024 at 12:07 AM sandeep-krutrim ***@***.***> wrote:
I am getting server disconnected error when I am using
convert_dataset.py', even for bookcorpus or wikipedia dataset.
If I do, stream=False in the code, then i get the following error -
Downloading data:
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
41/41 [03:06<00:00, 4.55s/files] Generating train split:
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
6458670/6458670 [01:06<00:00, 96995.20 examples/s] Loading dataset shards:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
41/41 [00:00<00:00, 1805.01it/s] Traceback (most recent call last): File
"/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 524, in
<module> main(parse_args()) File
"/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 489, in main
loader = build_dataloader(dataset=dataset, batch_size=512) File
"/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 397, in
build_dataloader num_workers = min(64, dataset.hf_dataset.n_shards) # type:
ignore AttributeError: 'Dataset' object has no attribute 'n_shards'
Please help to resolve this as I am stucked on reproducing the training
pipeline.
—
Reply to this email directly, view it on GitHub
<#36>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDDIITIPFQ6WC7YHIRQ5PDZJUDSFAVCNFSM6AAAAABKBJB3DWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3TSNZUHA4TKNQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
I am getting server disconnected error when I am using convert_dataset.py', even for bookcorpus or wikipedia dataset.
If I do,
stream=False
in the code, then i get the following error -Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [03:06<00:00, 4.55s/files] Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6458670/6458670 [01:06<00:00, 96995.20 examples/s] Loading dataset shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:00<00:00, 1805.01it/s] Traceback (most recent call last): File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 524, in <module> main(parse_args()) File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 489, in main loader = build_dataloader(dataset=dataset, batch_size=512) File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 397, in build_dataloader num_workers = min(64, dataset.hf_dataset.n_shards) # type: ignore AttributeError: 'Dataset' object has no attribute 'n_shards'
Please help to resolve this as I am stucked on reproducing the training pipeline.
The text was updated successfully, but these errors were encountered: