Skip to content

[v2.8] Neuron inference trace analyzer/bucketing unit tests hanging at GetParameterIdTensorMapping/TransferFromDevice #9378

Open
@jeffhataws

Description

@jeffhataws

🐛 Bug

We have multiple unit tests (Neuron inference trace analyzer/bucketing) that failed with

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000764de392105c in absl::lts_20230802::synchronization_internal::FutexWaiter::WaitUntil(std::atomic<int>*, int, absl::lts_20230802::synchronization_internal::KernelTimeout) ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#2  0x0000764de3921122 in absl::lts_20230802::synchronization_internal::FutexWaiter::Wait(absl::lts_20230802::synchronization_internal::KernelTimeout) () from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#3  0x0000764de3921343 in AbslInternalPerThreadSemWait_lts_20230802 ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#4  0x0000764de3923053 in absl::lts_20230802::Mutex::Block(absl::lts_20230802::base_internal::PerThreadSynch*) ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#5  0x0000764dd8822cf7 in absl::lts_20230802::Mutex::LockSlowWithDeadline(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, absl::lts_20230802::synchronization_internal::KernelTimeout, int) [clone .cold] ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#6  0x0000764dd8822d0c in absl::lts_20230802::Mutex::LockSlow(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, int) () from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#7  0x0000764de3924552 in absl::lts_20230802::Notification::WaitForNotification() const ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#8  0x0000764de2067508 in tsl::BlockUntilReady(tsl::AsyncValue*) ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#9  0x0000764dd8f7b20e in torch_xla::runtime::PjRtComputationClient::TransferFromDevice(absl::lts_20230802::Span<std::shared_ptr<torch_xla::runtime::ComputationClient::Data> const>) ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#10 0x0000764dd898494c in torch_xla::(anonymous namespace)::PyLoweringContext::GetParameterIdTensorMapping() ()

After bisecting the torch-xla nightlies I narrowed down to commit 8dc5b49 . Reverting this commit resolves the hang.

To Reproduce

Will work on a self-contained unit test to demonstrate the hang, as the above unit test depends on torch-neuronx.

Expected behavior

No hang

Environment

  • Reproducible on XLA backend [CPU/TPU/CUDA]: Neuron
  • torch_xla version: 2.8

Additional context

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-analyze.html#torch-neuronx-analyze-api

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-trace.html#torch-neuronx-trace-api

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions