
Description
Hi Andrew,
This repo is super great and helpful. But I encountered an error when doing validation.
Traceback (most recent call last):
File "/train.py", line 283, in
main(args)
File "/train.py", line 227, in main
start_iter, running_loss = run_epoch(start_iter, running_loss, dh_model, summary_loss, model_opt, train_loader, val_loader, train_log_interval, val_log_interval, device, beam, gen_len, k, decoding_strategy, accum_iter, "FT Training Epoch [{}/{}]".format(i + 1, args.num_epochs_ft), save_dir, logger, text_encoder, show_progress=args.show_progress)
File "/train.py", line 132, in run_epoch
val_loss, scores = evaluate(val_loader, train_log_interval, model, text_encoder, device, beam, gen_len, k, decoding_strategy, summary_loss if summary_loss else compute_loss_fct)
File "/train.py", line 97, in evaluate
src_strs, new_refs, new_hyps = generate_outputs(model, pad_seq, mask_seq, text_encoder, device, beam, gen_len, k, decoding_strategy)
File "/generate.py", line 18, in generate_outputs
outputs = model(pad_output, mask_output, text_encoder, device, beam=beam, gen_len=gen_len, k=k, decoding_strategy=decoding_strategy, generate=True, min_len=min_len)
File "/anaconda3/envs/apex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/apex/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/anaconda3/envs/apex/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/anaconda3/envs/apex/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/anaconda3/envs/apex/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/anaconda3/envs/apex/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/anaconda3/envs/apex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/model_pytorch.py", line 220, in forward
return self.generate(pad_output, mask_output, text_encoder, device, beam, gen_len, k, decoding_strategy, min_len=min_len)
File "/model_pytorch.py", line 365, in generate
generated_toks = self.beam_search(XMB, mask, classify_idx, text_encoder, beam=beam, gen_len=gen_len, min_len=min_len)
File "/model_pytorch.py", line 333, in beam_search
if finished_mask[i].item() == 1:
RuntimeError: CUDA error: an illegal memory access was encountered
It seems like a error about the pytorch and cudatoolkit version. (my env: pytorch==1.3.0, cudatoolkit==10.1)
Have you encountered this error?
Besides, the code cannot debug in Pycharm IDE,
Thanks :)