Description
Hi there,
Thanks for releasing code. I occurred an cuDNN error when training the model on spider dataset. I am using Pytorch 1.7.0 with cuda 11.0 and python 3.7 on GeForce RTX 3090.
I only changed the batch size and delete some samples in the original spider dataset. The log is showed below. Any idea on that?
Model initialization (xavier)
encoder_embeddings.trans_parameters.embeddings.word_embeddings.weight (skipped)
encoder_embeddings.trans_parameters.embeddings.position_embeddings.weight (skipped)
encoder_embeddings.trans_parameters.embeddings.token_type_embeddings.weight (skipped)
encoder_embeddings.trans_parameters.embeddings.LayerNorm.weight (skipped)
encoder_embeddings.trans_parameters.embeddings.LayerNorm.bias (skipped)
encoder_embeddings.trans_parameters.encoder.layer.0.attention.self.query.weight (skipped)
encoder_embeddings.trans_parameters.encoder.layer.0.attention.self.query.bias (skipped)
encoder_embeddings.trans_parameters.encoder.layer.0.attention.self.key.weight (skipped)
encoder_embeddings.trans_parameters.encoder.layer.0.attention.self.key.bias (skipped)
encoder_embeddings.trans_parameters.encoder.layer.0.attention.self.value.weight (skipped)
.......
Model Parameters
.....
mdl.encoder.text_encoder.rnn.rnn.rnn.bias_ih_l0 800 requires_grad=True
mdl.encoder.text_encoder.rnn.rnn.rnn.bias_hh_l0 800 requires_grad=True
mdl.encoder.text_encoder.rnn.rnn.rnn.weight_ih_l0_reverse 320000 requires_grad=True
Total # parameters = 342157588
wandb: Tracking run with wandb version 0.8.30
wandb: Wandb version 0.10.33 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Run data is saved locally in wandb/run-20210706_030553-1b8pui3b
wandb: Syncing run spider.bridge.lstm.meta.ts.ppl-0.85.2.dn.eo.feat.bert-large-uncased.xavier-1024-400-400-8-2-0.0005-inv-sqr-0.0005-4000-6e-05-inv-sqr-3e-05-4000-0.3-0.3-0
.0-0.0-1-8-0.0-0.0-res-0.2-0.0-ff-0.4-0.0.210706-030553.daud
wandb: ⭐️ View project at https://app.wandb.ai/ningzheng/smore-spider-group--final
wandb: 🚀 View run at https://app.wandb.ai/ningzheng/smore-spider-group--final/runs/1b8pui3b
wandb: Run wandb off
to turn off syncing.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [19:18<00:00, 1.73it/s]
Step 999.5: average training loss = 1.4097217868864536
0 pre-computed prediction order reconstruction cached
10%|█████████████▎ | 30/302 [05:23<48:53, 10.79s/it]
Traceback (most recent call last):
File "/home/ningzheng/miniconda3/envs/tsp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ningzheng/miniconda3/envs/tsp/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ningzheng/TabularSemanticParsing/src/experiments.py", line 407, in
run_experiment(args)
File "/home/ningzheng/TabularSemanticParsing/src/experiments.py", line 392, in run_experiment
train(sp)
File "/home/ningzheng/TabularSemanticParsing/src/experiments.py", line 63, in train
sp.run_train(train_data, dev_data)
wandb: Waiting for W&B process to finish, PID 1765293
File "/home/ningzheng/TabularSemanticParsing/src/common/learn_framework.py", line 250, in run_train
engine=engine, inline_eval=True, verbose=False)
File "/home/ningzheng/TabularSemanticParsing/src/semantic_parser/learn_framework.py", line 157, in inference
outputs = self.forward(formatted_batch, model_ensemble)
File "/home/ningzheng/TabularSemanticParsing/src/semantic_parser/learn_framework.py", line 129, in forward
decoder_ptr_value_ids=decoder_ptr_value_ids)
File "/home/ningzheng/miniconda3/envs/tsp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ningzheng/TabularSemanticParsing/src/semantic_parser/bridge.py", line 100, in forward
no_from=(self.dataset_name == 'wikisql'))
File "/home/ningzheng/TabularSemanticParsing/src/semantic_parser/decoding_algorithms.py", line 251, in beam_search
last_output=input)
File "/home/ningzheng/miniconda3/envs/tsp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ningzheng/TabularSemanticParsing/src/semantic_parser/bridge.py", line 349, in forward
output, hidden = self.rnn(input_sa, hidden)
File "/home/ningzheng/miniconda3/envs/tsp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ningzheng/TabularSemanticParsing/src/common/nn_modules.py", line 98, in forward
return self.rnn(inputs, hidden)
File "/home/ningzheng/miniconda3/envs/tsp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ningzheng/miniconda3/envs/tsp/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 582, in forward
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [15,0,0], thread: [4,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [15,0,0], thread: [5,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [15,0,0], thread: [6,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [15,0,0], thread: [7,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [15,0,0], thread: [8,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [15,0,0], thread: [9,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [18,0,0], thread: [3,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [18,0,0], thread: [4,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [18,0,0], thread: [5,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [18,0,0], thread: [6,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [18,0,0], thread: [7,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [18,0,0], thread: [8,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [18,0,0], thread: [9,0,0] Assertion idx_dim >= 0 & & idx_dim < index_size && "index out of bounds"
failed.
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: Run summary:
wandb: _step 4
wandb: _timestamp 1625541921.4755883
wandb: learning_rate/spider 0.0005
wandb: _runtime 1190.0860652923584
wandb: fine_tuning_rate/spider 3.37575e-05
wandb: cross_entropy_loss/spider 1.4097217868864536
wandb: Syncing files in wandb/run-20210706_030553-1b8pui3b:
wandb: code/src/experiments.py
wandb: plus 7 W&B file(s) and 1 media file(s)
wandb:
wandb: Synced spider.bridge.lstm.meta.ts.ppl-0.85.2.dn.eo.feat.bert-large-uncased.xavier-1024-400-400-8-2-0.0005-inv-sqr-0.0005-4000-6e-05-inv-sqr-3e-05-4000-0.3-0.3-0.0-0.
0-1-8-0.0-0.0-res-0.2-0.0-ff-0.4-0.0.210706-030553.daud: https://app.wandb.ai/ningzheng/smore-spider-group--final/runs/1b8pui3b