Error in tests when test_trainer is run before test_trainer_distributed #31

regisss · 2022-04-18T21:32:30Z

Unit and integration tests currently needs to be run with pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py. If not, for instance with pytests tests/ , test_trainer will be executed before test_trainer_distributed and the latter will fail without any error message.

The following code snippet in training_args.py should actually not be executed in single-card mode and is responsible for this error:

try:
  global mpi_comm
  from mpi4py import MPI
  
  mpi_comm = MPI.COMM_WORLD
  world_size = mpi_comm.Get_size()
  if world_size > 1:
      rank = mpi_comm.Get_rank()
      self.local_rank = rank
  else:
      raise ("Single MPI process")
except Exception as e:
  logger.info("Single node run")

However, even when this is corrected, I still get the following error:

Traceback (most recent call last):
  File "/root/shared/optimum-habana/tests/test_trainer_distributed.py", line 117, in <module>
    trainer = GaudiTrainer(
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/trainer.py", line 118, in __init__
    super().__init__(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 382, in __init__
    self._move_model_to_device(model, args.device)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 548, in _move_model_to_device
    model = model.to(device)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 899, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 593, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 897, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: Device acquire failed.

I think this is due to the fact that one process may still be running on a HPU when Torch tries to acquire devices.

The text was updated successfully, but these errors were encountered:

AaTekle · 2023-08-14T22:45:57Z

This could be occurring for a multitude of reasons,

GPU access seems unavailable from what I see, this could be for many reasons.

Improper GPU Config
GPU Drivers not installed
GPU in use with another process.

If this problem is occurring due to another process being in play as you stated, try running "nvidia-smi" if you are using Nvidia, to see what processes are in action with the GPU.

Also, check to see if any dependencies/libraires aren't up to date on your local deice.

Hope this can help.

regisss added the bug Something isn't working label Apr 18, 2022

mkumargarg pushed a commit to mkumargarg/optimum-habana that referenced this issue Feb 13, 2024

Fix inference command clip-roberta (huggingface#31)

64013ff

schoi-habana pushed a commit that referenced this issue Mar 8, 2024

Fix inference command clip-roberta (#31)

763f609

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in tests when test_trainer is run before test_trainer_distributed #31

Error in tests when test_trainer is run before test_trainer_distributed #31

regisss commented Apr 18, 2022 •

edited

Loading

AaTekle commented Aug 14, 2023

Error in tests when test_trainer is run before test_trainer_distributed #31

Error in tests when test_trainer is run before test_trainer_distributed #31

Comments

regisss commented Apr 18, 2022 • edited Loading

AaTekle commented Aug 14, 2023

regisss commented Apr 18, 2022 •

edited

Loading