Open
Description
🐛 Bug
We have multiple unit tests (Neuron inference trace analyzer/bucketing) that failed with
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x0000764de392105c in absl::lts_20230802::synchronization_internal::FutexWaiter::WaitUntil(std::atomic<int>*, int, absl::lts_20230802::synchronization_internal::KernelTimeout) ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#2 0x0000764de3921122 in absl::lts_20230802::synchronization_internal::FutexWaiter::Wait(absl::lts_20230802::synchronization_internal::KernelTimeout) () from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#3 0x0000764de3921343 in AbslInternalPerThreadSemWait_lts_20230802 ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#4 0x0000764de3923053 in absl::lts_20230802::Mutex::Block(absl::lts_20230802::base_internal::PerThreadSynch*) ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#5 0x0000764dd8822cf7 in absl::lts_20230802::Mutex::LockSlowWithDeadline(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, absl::lts_20230802::synchronization_internal::KernelTimeout, int) [clone .cold] ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#6 0x0000764dd8822d0c in absl::lts_20230802::Mutex::LockSlow(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, int) () from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#7 0x0000764de3924552 in absl::lts_20230802::Notification::WaitForNotification() const ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#8 0x0000764de2067508 in tsl::BlockUntilReady(tsl::AsyncValue*) ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#9 0x0000764dd8f7b20e in torch_xla::runtime::PjRtComputationClient::TransferFromDevice(absl::lts_20230802::Span<std::shared_ptr<torch_xla::runtime::ComputationClient::Data> const>) ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#10 0x0000764dd898494c in torch_xla::(anonymous namespace)::PyLoweringContext::GetParameterIdTensorMapping() ()
After bisecting the torch-xla nightlies I narrowed down to commit 8dc5b49 . Reverting this commit resolves the hang.
To Reproduce
Will work on a self-contained unit test to demonstrate the hang, as the above unit test depends on torch-neuronx.
Expected behavior
No hang
Environment
- Reproducible on XLA backend [CPU/TPU/CUDA]: Neuron
- torch_xla version: 2.8