You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have previously used torch_xla == 2.4 to run our experiments on v5e tpu nodes.
While upgrading to v6e, we encountered some issues. Firstly, when running the same code, we received this error when calling xmp.spawn(_mp_fn, args=(),start_method='fork')
File "/usr/lib/python3.10/concurrent/futures/process.py", line 611, in init
raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Understanding that v6e are newer, we upgraded to torch_xla==2.6.0 using this install:
After modifying to adhere to new torch ie. using xla.launch, none of the intermediate print statements were executing during the training run using xm.add_step_closure.
We assume this is because it's not hitting a barrier for some reason, so we attempted to force that using xm.optimizer_step(optimizer,barrier=True) instead of xm.optimizer_step(optimizer,barrier=False), which leads to the following errors:
https://symbolize.stripped_domain/r/?trace=7c18a00969fc,7c18a004251f&map=
*** SIGABRT received by PID 28602 (TID 34242) on cpu 20 from PID 28602; ***
E0120 12:29:19.315807 34242 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:19.315822 34242 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:19.315829 34242 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:19.315832 34242 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:19.315861 34242 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:19.315865 34242 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:19.830939 34285 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x7d32016a55c4 (unknown)
@ 0x7d32016a50f8 (unknown)
@ 0x7d32018da9a9 (unknown)
@ 0x7d31fb99a092 (unknown)
@ 0x7d31fb99d225 (unknown)
@ 0x7d31fb99bf84 (unknown)
@ 0x7d31fb99fda0 (unknown)
@ 0x7d31fb9a04fa (unknown)
@ 0x7d31f99d07de (unknown)
@ 0x7d31f99d35cc (unknown)
@ 0x7d31f99d6d57 (unknown)
@ 0x7d3201269733 (unknown)
@ 0x7d320126f9f6 (unknown)
@ 0x7d32012784c5 (unknown)
@ 0x7d3201523583 (unknown)
@ 0x7d33fa094ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=7d32016a55c4,7d32016a50f7,7d32018da9a8,7d31fb99a091,7d31fb99d224,7d31fb99bf83,7d31fb99fd9f,7d31fb9a04f9,7d31f99d07dd,7d31f99d35cb,7d31f99d6d56,7d3201269732,7d320126f9f5,7d32012784c4,7d3201523582,7d33fa094ac2&map=
https://symbolize.stripped_domain/r/?trace=7d33fa0969fc,7d33fa04251f&map=
*** SIGABRT received by PID 28601 (TID 34285) on cpu 134 from PID 28601; ***
E0120 12:29:19.847869 34285 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:19.847887 34285 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:19.847894 34285 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:19.847898 34285 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:19.847931 34285 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:19.847935 34285 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:23.305308 34355 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x70cb5d4a55c4 (unknown)
@ 0x70cb5d4a50f8 (unknown)
@ 0x70cb5d6da9a9 (unknown)
@ 0x70cb5779a092 (unknown)
@ 0x70cb5779d225 (unknown)
@ 0x70cb5779bf84 (unknown)
@ 0x70cb5779fda0 (unknown)
@ 0x70cb577a04fa (unknown)
@ 0x70cb557d07de (unknown)
@ 0x70cb557d35cc (unknown)
@ 0x70cb557d6d57 (unknown)
@ 0x70cb5d069733 (unknown)
@ 0x70cb5d06f9f6 (unknown)
@ 0x70cb5d0784c5 (unknown)
@ 0x70cb5d323583 (unknown)
@ 0x70cd55e94ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=70cb5d4a55c4,70cb5d4a50f7,70cb5d6da9a8,70cb5779a091,70cb5779d224,70cb5779bf83,70cb5779fd9f,70cb577a04f9,70cb557d07dd,70cb557d35cb,70cb557d6d56,70cb5d069732,70cb5d06f9f5,70cb5d0784c4,70cb5d323582,70cd55e94ac2&map=
https://symbolize.stripped_domain/r/?trace=70cd55e969fc,70cd55e4251f&map=
*** SIGABRT received by PID 28603 (TID 34355) on cpu 65 from PID 28603; ***
E0120 12:29:23.405821 34355 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:23.405837 34355 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:23.405845 34355 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:23.405848 34355 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:23.405884 34355 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:23.405888 34355 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:19.154320 34242 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:24.483387 34242 process_state.cc:806] RAW: Raising signal 6 with default behavior
F0120 12:29:24.800956 34326 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x75e6f8ea55c4 (unknown)
@ 0x75e6f8ea50f8 (unknown)
@ 0x75e6f90da9a9 (unknown)
@ 0x75e6f319a092 (unknown)
@ 0x75e6f319d225 (unknown)
@ 0x75e6f319bf84 (unknown)
@ 0x75e6f319fda0 (unknown)
@ 0x75e6f31a04fa (unknown)
@ 0x75e6f11d07de (unknown)
@ 0x75e6f11d35cc (unknown)
@ 0x75e6f11d6d57 (unknown)
@ 0x75e6f8a69733 (unknown)
@ 0x75e6f8a6f9f6 (unknown)
@ 0x75e6f8a784c5 (unknown)
@ 0x75e6f8d23583 (unknown)
@ 0x75e8f1694ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=75e6f8ea55c4,75e6f8ea50f7,75e6f90da9a8,75e6f319a091,75e6f319d224,75e6f319bf83,75e6f319fd9f,75e6f31a04f9,75e6f11d07dd,75e6f11d35cb,75e6f11d6d56,75e6f8a69732,75e6f8a6f9f5,75e6f8a784c4,75e6f8d23582,75e8f1694ac2&map=
https://symbolize.stripped_domain/r/?trace=75e8f16969fc,75e8f164251f&map=
*** SIGABRT received by PID 28598 (TID 34326) on cpu 53 from PID 28598; ***
E0120 12:29:24.820820 34326 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:24.820834 34326 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:24.820839 34326 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:24.820842 34326 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:24.825330 34326 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:24.825456 34326 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:19.830939 34285 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:26.228373 34285 process_state.cc:806] RAW: Raising signal 6 with default behavior
F0120 12:29:24.800956 34326 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:27.231930 34326 process_state.cc:806] RAW: Raising signal 6 with default behavior
F0120 12:29:23.305308 34355 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:27.291814 34355 process_state.cc:806] RAW: Raising signal 6 with default behavior
The repo is quite large, and I can't reproduce this using a minimal example. Do you have any advice on how to troubleshoot this/have you seen this before?
Is it possible to use version 2.4 on v6e compute nodes?
Thank you very much for the help.
The text was updated successfully, but these errors were encountered:
We have previously used torch_xla == 2.4 to run our experiments on v5e tpu nodes.
While upgrading to v6e, we encountered some issues. Firstly, when running the same code, we received this error when calling xmp.spawn(_mp_fn, args=(),start_method='fork')
File "/usr/lib/python3.10/concurrent/futures/process.py", line 611, in init
raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Understanding that v6e are newer, we upgraded to torch_xla==2.6.0 using this install:
pip install "torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev20241201-cp310-cp310-linux_x86_64.whl" -f https://storage.googleapis.com/libtpu-releases/index.html -f https://storage.googleapis.com/libtpu-wheels/index.html
pip3 install torch==2.6.0.dev20241201+cpu torchvision==0.20.0.dev20241201+cpu --index-url https://download.pytorch.org/whl/nightly/cpu
After modifying to adhere to new torch ie. using xla.launch, none of the intermediate print statements were executing during the training run using xm.add_step_closure.
We assume this is because it's not hitting a barrier for some reason, so we attempted to force that using xm.optimizer_step(optimizer,barrier=True) instead of xm.optimizer_step(optimizer,barrier=False), which leads to the following errors:
F0120 12:29:19.154320 34242 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x7c16a78a55c4 (unknown)
@ 0x7c16a78a50f8 (unknown)
@ 0x7c16a7ada9a9 (unknown)
@ 0x7c16a1b9a092 (unknown)
@ 0x7c16a1b9d225 (unknown)
@ 0x7c16a1b9bf84 (unknown)
@ 0x7c16a1b9fda0 (unknown)
@ 0x7c16a1ba04fa (unknown)
@ 0x7c169fbd07de (unknown)
@ 0x7c169fbd35cc (unknown)
@ 0x7c169fbd6d57 (unknown)
@ 0x7c16a7469733 (unknown)
@ 0x7c16a746f9f6 (unknown)
@ 0x7c16a74784c5 (unknown)
@ 0x7c16a7723583 (unknown)
@ 0x7c18a0094ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=7c16a78a55c4,7c16a78a50f7,7c16a7ada9a8,7c16a1b9a091,7c16a1b9d224,7c16a1b9bf83,7c16a1b9fd9f,7c16a1ba04f9,7c169fbd07dd,7c169fbd35cb,7c169fbd6d56,7c16a7469732,7c16a746f9f5,7c16a74784c4,7c16a7723582,7c18a0094ac2&map=
https://symbolize.stripped_domain/r/?trace=7c18a00969fc,7c18a004251f&map=
*** SIGABRT received by PID 28602 (TID 34242) on cpu 20 from PID 28602; ***
E0120 12:29:19.315807 34242 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:19.315822 34242 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:19.315829 34242 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:19.315832 34242 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:19.315861 34242 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:19.315865 34242 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:19.830939 34285 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x7d32016a55c4 (unknown)
@ 0x7d32016a50f8 (unknown)
@ 0x7d32018da9a9 (unknown)
@ 0x7d31fb99a092 (unknown)
@ 0x7d31fb99d225 (unknown)
@ 0x7d31fb99bf84 (unknown)
@ 0x7d31fb99fda0 (unknown)
@ 0x7d31fb9a04fa (unknown)
@ 0x7d31f99d07de (unknown)
@ 0x7d31f99d35cc (unknown)
@ 0x7d31f99d6d57 (unknown)
@ 0x7d3201269733 (unknown)
@ 0x7d320126f9f6 (unknown)
@ 0x7d32012784c5 (unknown)
@ 0x7d3201523583 (unknown)
@ 0x7d33fa094ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=7d32016a55c4,7d32016a50f7,7d32018da9a8,7d31fb99a091,7d31fb99d224,7d31fb99bf83,7d31fb99fd9f,7d31fb9a04f9,7d31f99d07dd,7d31f99d35cb,7d31f99d6d56,7d3201269732,7d320126f9f5,7d32012784c4,7d3201523582,7d33fa094ac2&map=
https://symbolize.stripped_domain/r/?trace=7d33fa0969fc,7d33fa04251f&map=
*** SIGABRT received by PID 28601 (TID 34285) on cpu 134 from PID 28601; ***
E0120 12:29:19.847869 34285 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:19.847887 34285 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:19.847894 34285 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:19.847898 34285 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:19.847931 34285 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:19.847935 34285 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:23.305308 34355 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x70cb5d4a55c4 (unknown)
@ 0x70cb5d4a50f8 (unknown)
@ 0x70cb5d6da9a9 (unknown)
@ 0x70cb5779a092 (unknown)
@ 0x70cb5779d225 (unknown)
@ 0x70cb5779bf84 (unknown)
@ 0x70cb5779fda0 (unknown)
@ 0x70cb577a04fa (unknown)
@ 0x70cb557d07de (unknown)
@ 0x70cb557d35cc (unknown)
@ 0x70cb557d6d57 (unknown)
@ 0x70cb5d069733 (unknown)
@ 0x70cb5d06f9f6 (unknown)
@ 0x70cb5d0784c5 (unknown)
@ 0x70cb5d323583 (unknown)
@ 0x70cd55e94ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=70cb5d4a55c4,70cb5d4a50f7,70cb5d6da9a8,70cb5779a091,70cb5779d224,70cb5779bf83,70cb5779fd9f,70cb577a04f9,70cb557d07dd,70cb557d35cb,70cb557d6d56,70cb5d069732,70cb5d06f9f5,70cb5d0784c4,70cb5d323582,70cd55e94ac2&map=
https://symbolize.stripped_domain/r/?trace=70cd55e969fc,70cd55e4251f&map=
*** SIGABRT received by PID 28603 (TID 34355) on cpu 65 from PID 28603; ***
E0120 12:29:23.405821 34355 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:23.405837 34355 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:23.405845 34355 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:23.405848 34355 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:23.405884 34355 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:23.405888 34355 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:19.154320 34242 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:24.483387 34242 process_state.cc:806] RAW: Raising signal 6 with default behavior
F0120 12:29:24.800956 34326 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x75e6f8ea55c4 (unknown)
@ 0x75e6f8ea50f8 (unknown)
@ 0x75e6f90da9a9 (unknown)
@ 0x75e6f319a092 (unknown)
@ 0x75e6f319d225 (unknown)
@ 0x75e6f319bf84 (unknown)
@ 0x75e6f319fda0 (unknown)
@ 0x75e6f31a04fa (unknown)
@ 0x75e6f11d07de (unknown)
@ 0x75e6f11d35cc (unknown)
@ 0x75e6f11d6d57 (unknown)
@ 0x75e6f8a69733 (unknown)
@ 0x75e6f8a6f9f6 (unknown)
@ 0x75e6f8a784c5 (unknown)
@ 0x75e6f8d23583 (unknown)
@ 0x75e8f1694ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=75e6f8ea55c4,75e6f8ea50f7,75e6f90da9a8,75e6f319a091,75e6f319d224,75e6f319bf83,75e6f319fd9f,75e6f31a04f9,75e6f11d07dd,75e6f11d35cb,75e6f11d6d56,75e6f8a69732,75e6f8a6f9f5,75e6f8a784c4,75e6f8d23582,75e8f1694ac2&map=
https://symbolize.stripped_domain/r/?trace=75e8f16969fc,75e8f164251f&map=
*** SIGABRT received by PID 28598 (TID 34326) on cpu 53 from PID 28598; ***
E0120 12:29:24.820820 34326 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:24.820834 34326 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:24.820839 34326 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:24.820842 34326 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:24.825330 34326 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:24.825456 34326 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:19.830939 34285 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:26.228373 34285 process_state.cc:806] RAW: Raising signal 6 with default behavior
F0120 12:29:24.800956 34326 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:27.231930 34326 process_state.cc:806] RAW: Raising signal 6 with default behavior
F0120 12:29:23.305308 34355 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:27.291814 34355 process_state.cc:806] RAW: Raising signal 6 with default behavior
The repo is quite large, and I can't reproduce this using a minimal example. Do you have any advice on how to troubleshoot this/have you seen this before?
Is it possible to use version 2.4 on v6e compute nodes?
Thank you very much for the help.
The text was updated successfully, but these errors were encountered: