[v2.8] Neuron inference trace analyzer/bucketing unit tests hanging at GetParameterIdTensorMapping/TransferFromDevice

## 🐛 Bug

We have multiple unit tests (Neuron inference trace analyzer/bucketing) that failed with 

```
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000764de392105c in absl::lts_20230802::synchronization_internal::FutexWaiter::WaitUntil(std::atomic<int>*, int, absl::lts_20230802::synchronization_internal::KernelTimeout) ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#2  0x0000764de3921122 in absl::lts_20230802::synchronization_internal::FutexWaiter::Wait(absl::lts_20230802::synchronization_internal::KernelTimeout) () from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#3  0x0000764de3921343 in AbslInternalPerThreadSemWait_lts_20230802 ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#4  0x0000764de3923053 in absl::lts_20230802::Mutex::Block(absl::lts_20230802::base_internal::PerThreadSynch*) ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#5  0x0000764dd8822cf7 in absl::lts_20230802::Mutex::LockSlowWithDeadline(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, absl::lts_20230802::synchronization_internal::KernelTimeout, int) [clone .cold] ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#6  0x0000764dd8822d0c in absl::lts_20230802::Mutex::LockSlow(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, int) () from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#7  0x0000764de3924552 in absl::lts_20230802::Notification::WaitForNotification() const ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#8  0x0000764de2067508 in tsl::BlockUntilReady(tsl::AsyncValue*) ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#9  0x0000764dd8f7b20e in torch_xla::runtime::PjRtComputationClient::TransferFromDevice(absl::lts_20230802::Span<std::shared_ptr<torch_xla::runtime::ComputationClient::Data> const>) ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#10 0x0000764dd898494c in torch_xla::(anonymous namespace)::PyLoweringContext::GetParameterIdTensorMapping() ()
```

After bisecting the torch-xla nightlies I narrowed down to commit https://github.com/pytorch/xla/commit/8dc5b496b05e7a25dc721fd23851480850ae3935 . Reverting this commit resolves the hang.

## To Reproduce

Will work on a self-contained unit test to demonstrate the hang, as the above unit test depends on torch-neuronx.

## Expected behavior

No hang

## Environment

 - Reproducible on XLA backend [CPU/TPU/CUDA]: Neuron
 - torch_xla version: 2.8


## Additional context

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-analyze.html#torch-neuronx-analyze-api

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-trace.html#torch-neuronx-trace-api

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v2.8] Neuron inference trace analyzer/bucketing unit tests hanging at GetParameterIdTensorMapping/TransferFromDevice #9378

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[v2.8] Neuron inference trace analyzer/bucketing unit tests hanging at GetParameterIdTensorMapping/TransferFromDevice #9378

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions