Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with v6e tpu deployment #8591

Open
ttdd11 opened this issue Jan 20, 2025 · 0 comments
Open

issues with v6e tpu deployment #8591

ttdd11 opened this issue Jan 20, 2025 · 0 comments

Comments

@ttdd11
Copy link

ttdd11 commented Jan 20, 2025

We have previously used torch_xla == 2.4 to run our experiments on v5e tpu nodes.

While upgrading to v6e, we encountered some issues. Firstly, when running the same code, we received this error when calling xmp.spawn(_mp_fn, args=(),start_method='fork')

File "/usr/lib/python3.10/concurrent/futures/process.py", line 611, in init
raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0

Understanding that v6e are newer, we upgraded to torch_xla==2.6.0 using this install:

pip install "torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev20241201-cp310-cp310-linux_x86_64.whl" -f https://storage.googleapis.com/libtpu-releases/index.html -f https://storage.googleapis.com/libtpu-wheels/index.html
pip3 install torch==2.6.0.dev20241201+cpu torchvision==0.20.0.dev20241201+cpu --index-url https://download.pytorch.org/whl/nightly/cpu

After modifying to adhere to new torch ie. using xla.launch, none of the intermediate print statements were executing during the training run using xm.add_step_closure.

We assume this is because it's not hitting a barrier for some reason, so we attempted to force that using xm.optimizer_step(optimizer,barrier=True) instead of xm.optimizer_step(optimizer,barrier=False), which leads to the following errors:

F0120 12:29:19.154320 34242 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x7c16a78a55c4 (unknown)
@ 0x7c16a78a50f8 (unknown)
@ 0x7c16a7ada9a9 (unknown)
@ 0x7c16a1b9a092 (unknown)
@ 0x7c16a1b9d225 (unknown)
@ 0x7c16a1b9bf84 (unknown)
@ 0x7c16a1b9fda0 (unknown)
@ 0x7c16a1ba04fa (unknown)
@ 0x7c169fbd07de (unknown)
@ 0x7c169fbd35cc (unknown)
@ 0x7c169fbd6d57 (unknown)
@ 0x7c16a7469733 (unknown)
@ 0x7c16a746f9f6 (unknown)
@ 0x7c16a74784c5 (unknown)
@ 0x7c16a7723583 (unknown)
@ 0x7c18a0094ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=7c16a78a55c4,7c16a78a50f7,7c16a7ada9a8,7c16a1b9a091,7c16a1b9d224,7c16a1b9bf83,7c16a1b9fd9f,7c16a1ba04f9,7c169fbd07dd,7c169fbd35cb,7c169fbd6d56,7c16a7469732,7c16a746f9f5,7c16a74784c4,7c16a7723582,7c18a0094ac2&map=

https://symbolize.stripped_domain/r/?trace=7c18a00969fc,7c18a004251f&map=
*** SIGABRT received by PID 28602 (TID 34242) on cpu 20 from PID 28602; ***
E0120 12:29:19.315807 34242 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:19.315822 34242 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:19.315829 34242 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:19.315832 34242 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:19.315861 34242 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:19.315865 34242 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:19.830939 34285 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x7d32016a55c4 (unknown)
@ 0x7d32016a50f8 (unknown)
@ 0x7d32018da9a9 (unknown)
@ 0x7d31fb99a092 (unknown)
@ 0x7d31fb99d225 (unknown)
@ 0x7d31fb99bf84 (unknown)
@ 0x7d31fb99fda0 (unknown)
@ 0x7d31fb9a04fa (unknown)
@ 0x7d31f99d07de (unknown)
@ 0x7d31f99d35cc (unknown)
@ 0x7d31f99d6d57 (unknown)
@ 0x7d3201269733 (unknown)
@ 0x7d320126f9f6 (unknown)
@ 0x7d32012784c5 (unknown)
@ 0x7d3201523583 (unknown)
@ 0x7d33fa094ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=7d32016a55c4,7d32016a50f7,7d32018da9a8,7d31fb99a091,7d31fb99d224,7d31fb99bf83,7d31fb99fd9f,7d31fb9a04f9,7d31f99d07dd,7d31f99d35cb,7d31f99d6d56,7d3201269732,7d320126f9f5,7d32012784c4,7d3201523582,7d33fa094ac2&map=

https://symbolize.stripped_domain/r/?trace=7d33fa0969fc,7d33fa04251f&map=
*** SIGABRT received by PID 28601 (TID 34285) on cpu 134 from PID 28601; ***
E0120 12:29:19.847869 34285 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:19.847887 34285 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:19.847894 34285 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:19.847898 34285 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:19.847931 34285 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:19.847935 34285 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:23.305308 34355 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x70cb5d4a55c4 (unknown)
@ 0x70cb5d4a50f8 (unknown)
@ 0x70cb5d6da9a9 (unknown)
@ 0x70cb5779a092 (unknown)
@ 0x70cb5779d225 (unknown)
@ 0x70cb5779bf84 (unknown)
@ 0x70cb5779fda0 (unknown)
@ 0x70cb577a04fa (unknown)
@ 0x70cb557d07de (unknown)
@ 0x70cb557d35cc (unknown)
@ 0x70cb557d6d57 (unknown)
@ 0x70cb5d069733 (unknown)
@ 0x70cb5d06f9f6 (unknown)
@ 0x70cb5d0784c5 (unknown)
@ 0x70cb5d323583 (unknown)
@ 0x70cd55e94ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=70cb5d4a55c4,70cb5d4a50f7,70cb5d6da9a8,70cb5779a091,70cb5779d224,70cb5779bf83,70cb5779fd9f,70cb577a04f9,70cb557d07dd,70cb557d35cb,70cb557d6d56,70cb5d069732,70cb5d06f9f5,70cb5d0784c4,70cb5d323582,70cd55e94ac2&map=

https://symbolize.stripped_domain/r/?trace=70cd55e969fc,70cd55e4251f&map=
*** SIGABRT received by PID 28603 (TID 34355) on cpu 65 from PID 28603; ***
E0120 12:29:23.405821 34355 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:23.405837 34355 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:23.405845 34355 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:23.405848 34355 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:23.405884 34355 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:23.405888 34355 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:19.154320 34242 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:24.483387 34242 process_state.cc:806] RAW: Raising signal 6 with default behavior
F0120 12:29:24.800956 34326 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x75e6f8ea55c4 (unknown)
@ 0x75e6f8ea50f8 (unknown)
@ 0x75e6f90da9a9 (unknown)
@ 0x75e6f319a092 (unknown)
@ 0x75e6f319d225 (unknown)
@ 0x75e6f319bf84 (unknown)
@ 0x75e6f319fda0 (unknown)
@ 0x75e6f31a04fa (unknown)
@ 0x75e6f11d07de (unknown)
@ 0x75e6f11d35cc (unknown)
@ 0x75e6f11d6d57 (unknown)
@ 0x75e6f8a69733 (unknown)
@ 0x75e6f8a6f9f6 (unknown)
@ 0x75e6f8a784c5 (unknown)
@ 0x75e6f8d23583 (unknown)
@ 0x75e8f1694ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=75e6f8ea55c4,75e6f8ea50f7,75e6f90da9a8,75e6f319a091,75e6f319d224,75e6f319bf83,75e6f319fd9f,75e6f31a04f9,75e6f11d07dd,75e6f11d35cb,75e6f11d6d56,75e6f8a69732,75e6f8a6f9f5,75e6f8a784c4,75e6f8d23582,75e8f1694ac2&map=

https://symbolize.stripped_domain/r/?trace=75e8f16969fc,75e8f164251f&map=
*** SIGABRT received by PID 28598 (TID 34326) on cpu 53 from PID 28598; ***
E0120 12:29:24.820820 34326 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:24.820834 34326 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:24.820839 34326 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:24.820842 34326 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:24.825330 34326 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:24.825456 34326 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:19.830939 34285 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:26.228373 34285 process_state.cc:806] RAW: Raising signal 6 with default behavior
F0120 12:29:24.800956 34326 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:27.231930 34326 process_state.cc:806] RAW: Raising signal 6 with default behavior
F0120 12:29:23.305308 34355 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:27.291814 34355 process_state.cc:806] RAW: Raising signal 6 with default behavior

The repo is quite large, and I can't reproduce this using a minimal example. Do you have any advice on how to troubleshoot this/have you seen this before?

Is it possible to use version 2.4 on v6e compute nodes?

Thank you very much for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant