You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unit and integration tests currently needs to be run with pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py. If not, for instance with pytests tests/ , test_trainer will be executed before test_trainer_distributed and the latter will fail without any error message.
The following code snippet in training_args.py should actually not be executed in single-card mode and is responsible for this error:
However, even when this is corrected, I still get the following error:
Traceback (most recent call last):
File "/root/shared/optimum-habana/tests/test_trainer_distributed.py", line 117, in<module>
trainer = GaudiTrainer(
File "/usr/local/lib/python3.8/dist-packages/optimum/habana/trainer.py", line 118, in __init__
super().__init__(
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 382, in __init__
self._move_model_to_device(model, args.device)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 548, in _move_model_to_device
model = model.to(device)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 899, in to
return self._apply(convert)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 593, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 897, in convert
return t.to(device, dtype ift.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: Device acquire failed.
I think this is due to the fact that one process may still be running on a HPU when Torch tries to acquire devices.
The text was updated successfully, but these errors were encountered:
This could be occurring for a multitude of reasons,
GPU access seems unavailable from what I see, this could be for many reasons.
Improper GPU Config
GPU Drivers not installed
GPU in use with another process.
If this problem is occurring due to another process being in play as you stated, try running "nvidia-smi" if you are using Nvidia, to see what processes are in action with the GPU.
Also, check to see if any dependencies/libraires aren't up to date on your local deice.
Hope this can help.
mkumargarg
pushed a commit
to mkumargarg/optimum-habana
that referenced
this issue
Feb 13, 2024
Unit and integration tests currently needs to be run with
pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py
. If not, for instance withpytests tests/
, test_trainer will be executed before test_trainer_distributed and the latter will fail without any error message.The following code snippet in training_args.py should actually not be executed in single-card mode and is responsible for this error:
However, even when this is corrected, I still get the following error:
I think this is due to the fact that one process may still be running on a HPU when Torch tries to acquire devices.
The text was updated successfully, but these errors were encountered: