Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error found in validating when use 2 gpu(But it'ok when using one gpu ).. #17

Open
ZHO9504 opened this issue Jul 20, 2019 · 9 comments
Open

Comments

@ZHO9504
Copy link

ZHO9504 commented Jul 20, 2019

Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
EM: 61.2193, f1: 69.6262, qas_used_fraction: 1.0000, loss: 4.3453 ||: : 17502it [6:26:59, 1.33s/it]
2019-07-20 15:09:22,954 - INFO - allennlp.training.trainer - Validating
EM: 48.9301, f1: 59.0550, qas_used_fraction: 1.0000, loss: 5.1889 ||: : 94it [00:41, 2.15it/s]Traceback (most recent call last):
File "/home/gpu245/anaconda3/envs/emnlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/gpu245/anaconda3/envs/emnlp/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/run.py", line 21, in
run()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/init.py", line 102, in main
args.func(args)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args
args.cache_prefix)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file
cache_directory, cache_prefix)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 243, in train_model
metrics = trainer.train()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 493, in train
val_loss, num_batches = self._validation_loss()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 430, in _validation_loss
loss = self.batch_loss(batch_group, for_training=False)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 258, in batch_loss
output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/util.py", line 336, in data_parallel
losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, outputs)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: tensor.ndimension() == static_cast<int64_t>(expected_size.size()) ASSERT FAILED at /pytorch/torch/csrc/cuda/comm.cpp:232, please report a bug to PyTorch. (gather at /pytorch/torch/csrc/cuda/comm.cpp:232)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f6d3dad8441 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f6d3dad7d7a in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: torch::cuda::gather(c10::ArrayRefat::Tensor, long, c10::optional) + 0x962 (0x7f6d132be792 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #3: + 0x5a3d1c (0x7f6d33e0bd1c in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x130fac (0x7f6d33998fac in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: _PyMethodDef_RawFastCallKeywords + 0x264 (0x5567e0e3c6e4 in python3.7)
frame #6: _PyCFunction_FastCallKeywords + 0x21 (0x5567e0e3c801 in python3.7)
frame #7: _PyEval_EvalFrameDefault + 0x4e8c (0x5567e0e982bc in python3.7)
frame #8: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #9: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #10: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #11: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #12: _PyFunction_FastCallDict + 0x1d5 (0x5567e0dda5d5 in python3.7)
frame #13: THPFunction_apply(_object
, _object*) + 0x6b1 (0x7f6d33c1c301 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #14: PyCFunction_Call + 0xe7 (0x5567e0dffbe7 in python3.7)
frame #15: _PyEval_EvalFrameDefault + 0x5d21 (0x5567e0e99151 in python3.7)
frame #16: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #17: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #18: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #19: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #20: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #21: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #22: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #23: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #24: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #25: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #26: _PyEval_EvalFrameDefault + 0x14ce (0x5567e0e948fe in python3.7)
frame #27: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #28: _PyEval_EvalFrameDefault + 0x6a0 (0x5567e0e93ad0 in python3.7)
frame #29: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #30: _PyEval_EvalFrameDefault + 0x6a0 (0x5567e0e93ad0 in python3.7)
frame #31: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #32: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #33: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #34: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #35: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #36: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #37: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #38: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #39: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #40: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #41: _PyEval_EvalFrameDefault + 0x14ce (0x5567e0e948fe in python3.7)
frame #42: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #43: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #44: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #45: PyEval_EvalCodeEx + 0x44 (0x5567e0dda3c4 in python3.7)
frame #46: PyEval_EvalCode + 0x1c (0x5567e0dda3ec in python3.7)
frame #47: + 0x1e004d (0x5567e0ea304d in python3.7)
frame #48: _PyMethodDef_RawFastCallKeywords + 0xe9 (0x5567e0e3c569 in python3.7)
frame #49: _PyCFunction_FastCallKeywords + 0x21 (0x5567e0e3c801 in python3.7)
frame #50: _PyEval_EvalFrameDefault + 0x4755 (0x5567e0e97b85 in python3.7)
frame #51: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #52: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #53: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #54: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #55: _PyFunction_FastCallDict + 0x1d5 (0x5567e0dda5d5 in python3.7)
frame #56: + 0x222d77 (0x5567e0ee5d77 in python3.7)
frame #57: + 0x23ae95 (0x5567e0efde95 in python3.7)
frame #58: _Py_UnixMain + 0x3c (0x5567e0efdf7c in python3.7)
frame #59: __libc_start_main + 0xf0 (0x7f6d4ea12830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #60: + 0x1e0122 (0x5567e0ea3122 in python3.7)

I don't know why....

@ZHO9504
Copy link
Author

ZHO9504 commented Jul 20, 2019

My running script is,
python3.7 -m allennlp.run train /home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/baseline/MRQA_BERTLarge.jsonnet -s Models/large_f5/ -o "{'dataset_reader': {'sample_size': 75000}, 'validation_dataset_reader': {'sample_size': 1000}, 'train_data_path': '/home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/data/train/TriviaQA-web.jsonl.gz', 'validation_data_path': '/home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/data/dev-indomain/TriviaQA-web.jsonl.gz', 'trainer': {'cuda_device': [0,1], 'num_epochs': '2', 'optimizer': {'type': 'bert_adam', 'lr': 3e-05, 'warmup': 0.1, 't_total': '50000'}}}" --include-package mrqa_allennlp

whatever the train_data_path,

@alontalmor
Copy link
Collaborator

Hi ZHO9504, i will try to reproduce this, but because this does not happen on 1 GPU it's likely to be an allennlp problem with multiGPU, which version of allennlp are you using? thanks

@ZHO9504 ZHO9504 closed this as completed Jul 21, 2019
@ZHO9504
Copy link
Author

ZHO9504 commented Jul 21, 2019

Hi ZHO9504, i will try to reproduce this, but because this does not happen on 1 GPU it's likely to be an allennlp problem with multiGPU, which version of allennlp are you using? thanks

  1. Thank you for your reply. The version of allennlp I use:
    $ allennlp --version
    allennlp 0.8.5-unreleased`
    and had same issue using V0.8.4
    torch1.1.0
  2. It's ok when validate the data: HotpotQA\SearchQA using one or two gpu.
    But have the issue when valating trival/NaturalQuestionsShort/SearchQA with 2 gpu.
    A little strange.....

@ZHO9504 ZHO9504 reopened this Jul 21, 2019
@alontalmor
Copy link
Collaborator

It sounds like some edge case that's a bit difficult to reproduce...
Does it happen when you evaluate only on TriviaQA or NaturalQuestionsShort?

@ZHO9504
Copy link
Author

ZHO9504 commented Jul 22, 2019

It sounds like some edge case that's a bit difficult to reproduce...
Does it happen when you evaluate only on TriviaQA or NaturalQuestionsShort?

Yes, I evaluated on each of them , but only HotpotQA or SearchQA went well.
And, as long as the evaluation data include such as TriviaQA, then procedure error

@alontalmor
Copy link
Collaborator

Ok i'm trying to recreate and solve this, but it may take a few days.

@Alex-Fabbri
Copy link

I also got this error during multi-gpu validation but fine on a single gpu. Using allennlp V0.8.4 and torch 1.1.0.

@Kaimary
Copy link

Kaimary commented Oct 26, 2019

+1.
I also got this error during multiple-gpu validation phrase. Using allennlp V0.8.4 and torch 1.1.0.

@lucadiliello
Copy link

I was able to train on every MRQA task using every number of GPUs using pytorch-lightning. I published the scripts here: https://github.com/lucadiliello/mrqa-lightning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants